Opened 16 years ago

Last modified 5 months ago

#264 assigned enhancement

Optimise expand using min weight techniques

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 2.0.0
Component: Matcher Version: git master
Severity: minor Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

ESet::Internal::expand() and ExpandWeight could optimise using min weight techniques like those used by the matcher - for example, ExpandWeight could provide an upper bound and expand() could use it to know when further terms can't make it in into the ESet.

The gain is probably much less than for the matcher, since termlists aren't chunked, so we take the I/O hit regardless, but this could save a fair bit of CPU so is still useful.

Change History (10)

comment:1 by Olly Betts, 15 years ago

Milestone: 1.1.11.1.4

Triaging milestone:1.1.1 bugs.

comment:2 by Olly Betts, 15 years ago

Milestone: 1.1.41.2.0
Status: newassigned

comment:3 by Olly Betts, 11 years ago

Milestone: 1.2.x1.3.x

This isn't 1.2.x material now.

comment:4 by Olly Betts, 9 years ago

Milestone: 1.3.x1.4.x

This isn't worth holding up 1.4.0 for. It's an internal change which should affect the API at all.

comment:5 by Olly Betts, 5 years ago

Version: SVN trunkgit master

comment:6 by Olly Betts, 5 years ago

Looking at the expansion code, it looks to me like we fetch the collection frequency for every term, which is a waste of effort when we're using the default TradEWeight weighting - it's only used by Bo1EWeight. We should have a need_stat() mechanism like Xapian::Weight has.

comment:7 by Olly Betts, 13 months ago

Milestone: 1.4.x1.5.0

This seems like it probably needs public API changes, so targetting at 1.5.0 for now.

comment:8 by Olly Betts, 5 months ago

Milestone: 1.5.02.0.0

Looking at the expansion code, it looks to me like we fetch the collection frequency for every term, which is a waste of effort when we're using the default TradEWeight weighting - it's only used by Bo1EWeight.

Addressed by 8dc6f72354d733db17fa564bfb5db51090a8adc3 which I'll backport for 1.4.25.

We should have a need_stat() mechanism like Xapian::Weight has.

This is still true, but I think the whole mechanism should be reviewed as it looks like it was built back when we stored the termfreq in every posting list entry, which isn't really feasible for an updatable database format. Maybe we can be lazier about this if there are cases where we can rule a term out without knowing its termfreq (and for Bo1 we don't actually use the termfreq anyway). Also it looks like we probably effectively double fetch the termfreqs if use_exact_termfreq is specified (except when we detect that the summed termfreq from the tree is exact already).

Postponing further work for now, as this is an optimisation rather than correctness and I've made a significant improvement to the default case.

comment:9 by Olly Betts, 5 months ago

I've been mulling this over more - at this point the current optimisation is essentially for the benefit of remote shards, where we open a RemoteTermList and on the remote that fetches the termfreq for every term from the remote shard and sends it over. That does result in fetching and sending a fair amount of unused and redundant data, but is likely to be a win over fetching just the termfreq info we want which would require one remote protocol message exchange per term being considered.

For local shards, it's probably unhelpful - when there are multiple documents marked as relevant in the same shard we'll end up fetching the termfreq multiple times for terms which occur in multiple of those relevant documents. There's the upside of saving fetching termfreq for shards which don't have a relevant document containing that term (assuming we aren't asked to use exact termfreq), but that comes at the cost of using approximated termfreqs.

I wonder if we ought to push a merging step onto the remote so that if there are multiple relevant documents on a remote we merge termlists for those documents on the remote and send one combined termlist back with termfreqs (which we can then only fetch once).

We could have a similar merging step for just the local shards, and if there are both local and remote shards with relevant documents, a final merge step between the two.

This seems like it would be as good or better in every case, and probably doesn't need a huge amount of new code.

comment:10 by Olly Betts, 5 months ago

Addressed by 8dc6f72354d733db17fa564bfb5db51090a8adc3 which I'll backport for 1.4.25.

Backported as aafb1aee4dccf8eb37830ce26b3075c7574e563c.

Note: See TracTickets for help on using tickets.