#400 closed enhancement (fixed)
Optimise AND_MAYBE when the RHS has a maxweight of 0
Reported by: | Richard Boulton | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.5.0 |
Component: | Matcher | Version: | git master |
Severity: | minor | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
If the RHS of an AND_MAYBE clause has an RHS whose maxweight is 0 (or becomes zero during the search), the AND_MAYBE could be replaced with the LHS. The attached patch does this. However, it doesn't quite work, due to two side effects:
- if the AND_MAYBE is part of an OP_SYNONYM, the wdf of the RHS of the AND_MAYBE can have an effect, but will be lost.
- if a percentage weight calculation is in effect, the count of the number of terms matching will miss those in the AND_MAYBE.
This doesn't change API at all, so doesn't need to be fixed during the 1.1 series.
Attachments (1)
Change History (9)
by , 15 years ago
Attachment: | badopt.patch added |
---|
comment:1 by , 12 years ago
Milestone: | 1.2.x → 1.3.x |
---|
comment:2 by , 10 years ago
It is certainly feasible to check if we're under OP_SYNONYM
.
The count of matching subqueries is more problematic though - at present we always do count them, and to avoid that we'd need to have the user specify up front if they want percentages.
I guess this is useful for a PostingSource
which adds weight, and such posting sources quite often match all documents - we could perhaps detect that and know we could do this optimisation (and just adjust the count appropriately).
This seems quite an esoteric case, and unlikely to be a huge win even then, so I don't think it should block 1.4.0.
comment:3 by , 10 years ago
Milestone: | 1.3.x → 1.4.x |
---|
comment:4 by , 5 years ago
Milestone: | 1.4.x → 1.5.0 |
---|---|
Status: | new → assigned |
Version: | SVN trunk → git master |
The case where the scale factor is zero (which is what the testcases in the patch test) has been handled since 1.4.10 (9e1023ab5d28532e649715754d5f000038e98f2f) - I tested with the new testcases applied to git master and they pass. This optimisation doesn't cause the problem with percentages highlighted above since we don't count subqueries for which factor == 0.
We don't currently handle the case where the maxweight of the RHS is zero or becomes zero, but I think that's quite easy to do:
-
xapian-core/api/queryinternal.cc
diff --git a/xapian-core/api/queryinternal.cc b/xapian-core/api/queryinternal.cc index c5148ca350e0..4c888d8b0af7 100644
a b QueryAndMaybe::postlist(QueryOptimiser * qopt, double factor) const 2357 2357 } 2358 2358 OrContext ctx(qopt, subqueries.size() - 1); 2359 2359 do_or_like(ctx, qopt, factor, 0, 1); 2360 Xapian::termcount save_total_subqs = qopt->get_total_subqs(); 2360 2361 unique_ptr<PostList> r(ctx.postlist()); 2361 2362 if (!r.get()) { 2362 2363 RETURN(l.release()); 2363 2364 } 2365 if (r->recalc_maxweight() == 0.0) { 2366 // The RHS can't contribute any weight, so can be discarded. Reset 2367 // total_subqs in case we counted any in the RHS so that percentages 2368 // don't get messed up. 2369 qopt->set_total_subqs(save_total_subqs); 2370 RETURN(l.release()); 2371 } 2364 2372 RETURN(new AndMaybePostList(l.release(), r.release(), 2365 2373 qopt->matcher, qopt->db_size)); 2366 2374 }
I'm not sure if we actually need to restore total_subqs - it seems there probably can't be any weighted terms in the RHS if its overall maxweight is zero, but maybe with a custom weighting scheme there could be.
Let's try to get test coverage for the above and apply for 1.5.0. I don't think this additional case is worth patching 1.4.x for.
And I think we can not worry about the case where the max weight starts off non-zero but becomes zero during the match - in such a situation it would be more obvious to implement the PostingSource to simply signal it has reached its end rather than just setting its maxweight to zero, and that will be handled efficiently already.
comment:5 by , 5 years ago
The wdf from the RHS of OP_AND_MAYBE under OP_SYNONYM needs thought still though.
It looks to me like the optimisation of the factor 0 case means that we currently discard any wdf from the RHS in this case. It'd certainly be more correct not to, and probably not hard to implement - we'd just need to have a flag recording if wdf is wanted for the current subquery (or perhaps just pass factor = -1.0 for this case and then we can check factor == 0.0
or factor <= 0.0
depending which cases we care about).
comment:6 by , 5 years ago
I was thinking we set factor = 0.0
for the subquery under OP_SYNONYM
, because it's unweighted, but actually we don't - instead we save and restore the total_subqs value across turning it into a postlist, but keep factor non-zero (if factor is zero when we get to an OP_SYNONYM
we turn it into an OP_OR
anyway, since it's equivalent when the weight isn't used).
That means the existing OP_AND_MAYBE factor == 0.0
optimisation doesn't happen under OP_SYNONYM
so we're good there.
The new optimisation in the patch above ideally should not happen under OP_SYNONYM
though.
comment:7 by , 5 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Committed to git master in 3b0dcc8d731618eb64f6783d7c3e8851b17e1e76.
I don't think this is worth backporting so closing.
comment:8 by , 5 years ago
Just noting I'm doing some reworking to handle this in more cases. It wasn't being handled for OP_AND_MAYBE
directly below another AND-like operator. And for an OP_AND_MAYBE
with multiple right branches (which is probably much less common, especially as we used to only allow one right branch) it would only trigger if they all had maxweight 0, whereas with my changes it will consider them individually.
Optimisation for AND_MAYBE which doesn't quite work right.