Opened 4 years ago

Last modified 5 weeks ago

#744 new defect

Merge tfidf-maxwdf-norm branch

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 1.5.0
Component: Library API Version:
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

Nishad Dawkhar implemented the "maxwdf" norm for TfIdfWeight, which is on the tfidf-maxwdf-norm branch in git now.

Because this changes the API of the Weight class (by adding a new parameter to get_sumpart()) this can't be merged in 1.4.x. I think it's better to hold off merging to master while these issues from before remain:

  • Remote backend support
  • Given we pass doclen and uniqterms to get_sumextra(), it would make sense to pass wdfdocmax to that too.

I'm not 100% happy with the way we seem to need to add new parameters to get_sumpart() from time to time, because this means every Weight subclass needs updating (fixing those in the library is OK, but this also affects user-defined weighting schemes). I wonder if there's a clean and efficient way to avoid this (it needs to be efficient as this method can get called a lot). Or perhaps there are only so many per-doc stats, and this is only the second time we've needed to do this.

It'd also be nice to store the wdfdocmax stats (and the uniqueterms stats) for all the documents in a chunked stream (like how document lengths are stored) - the code to work them out in this patch is correct, but requires scanning the termlist of each document we need this stat for, which is quite a lot of work.

Change History (1)

comment:1 by Olly Betts, 5 weeks ago

Description: modified (diff)
Note: See TracTickets for help on using tickets.