Opened 17 years ago
Last modified 13 months ago
#264 assigned enhancement
Optimise expand using min weight techniques
Reported by: | Olly Betts | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 2.0.0 |
Component: | Matcher | Version: | git master |
Severity: | minor | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
ESet::Internal::expand() and ExpandWeight could optimise using min weight techniques like those used by the matcher - for example, ExpandWeight could provide an upper bound and expand() could use it to know when further terms can't make it in into the ESet.
The gain is probably much less than for the matcher, since termlists aren't chunked, so we take the I/O hit regardless, but this could save a fair bit of CPU so is still useful.
Change History (10)
comment:1 by , 16 years ago
Milestone: | 1.1.1 → 1.1.4 |
---|
comment:2 by , 15 years ago
Milestone: | 1.1.4 → 1.2.0 |
---|---|
Status: | new → assigned |
comment:4 by , 10 years ago
Milestone: | 1.3.x → 1.4.x |
---|
This isn't worth holding up 1.4.0 for. It's an internal change which should affect the API at all.
comment:5 by , 5 years ago
Version: | SVN trunk → git master |
---|
comment:6 by , 5 years ago
Looking at the expansion code, it looks to me like we fetch the collection frequency for every term, which is a waste of effort when we're using the default TradEWeight
weighting - it's only used by Bo1EWeight
. We should have a need_stat()
mechanism like Xapian::Weight
has.
comment:7 by , 21 months ago
Milestone: | 1.4.x → 1.5.0 |
---|
This seems like it probably needs public API changes, so targetting at 1.5.0 for now.
comment:8 by , 13 months ago
Milestone: | 1.5.0 → 2.0.0 |
---|
Looking at the expansion code, it looks to me like we fetch the collection frequency for every term, which is a waste of effort when we're using the default
TradEWeight
weighting - it's only used byBo1EWeight
.
Addressed by 8dc6f72354d733db17fa564bfb5db51090a8adc3 which I'll backport for 1.4.25.
We should have a
need_stat()
mechanism likeXapian::Weight
has.
This is still true, but I think the whole mechanism should be reviewed as it looks like it was built back when we stored the termfreq in every posting list entry, which isn't really feasible for an updatable database format. Maybe we can be lazier about this if there are cases where we can rule a term out without knowing its termfreq (and for Bo1 we don't actually use the termfreq anyway). Also it looks like we probably effectively double fetch the termfreqs if use_exact_termfreq
is specified (except when we detect that the summed termfreq from the tree is exact already).
Postponing further work for now, as this is an optimisation rather than correctness and I've made a significant improvement to the default case.
comment:9 by , 13 months ago
I've been mulling this over more - at this point the current optimisation is essentially for the benefit of remote shards, where we open a RemoteTermList and on the remote that fetches the termfreq for every term from the remote shard and sends it over. That does result in fetching and sending a fair amount of unused and redundant data, but is likely to be a win over fetching just the termfreq info we want which would require one remote protocol message exchange per term being considered.
For local shards, it's probably unhelpful - when there are multiple documents marked as relevant in the same shard we'll end up fetching the termfreq multiple times for terms which occur in multiple of those relevant documents. There's the upside of saving fetching termfreq for shards which don't have a relevant document containing that term (assuming we aren't asked to use exact termfreq), but that comes at the cost of using approximated termfreqs.
I wonder if we ought to push a merging step onto the remote so that if there are multiple relevant documents on a remote we merge termlists for those documents on the remote and send one combined termlist back with termfreqs (which we can then only fetch once).
We could have a similar merging step for just the local shards, and if there are both local and remote shards with relevant documents, a final merge step between the two.
This seems like it would be as good or better in every case, and probably doesn't need a huge amount of new code.
comment:10 by , 13 months ago
Addressed by 8dc6f72354d733db17fa564bfb5db51090a8adc3 which I'll backport for 1.4.25.
Backported as aafb1aee4dccf8eb37830ce26b3075c7574e563c.
Triaging milestone:1.1.1 bugs.