Opened 15 years ago

Closed 5 years ago

Last modified 5 years ago

#400 closed enhancement (fixed)

Optimise AND_MAYBE when the RHS has a maxweight of 0

Reported by: Richard Boulton Owned by: Olly Betts
Priority: normal Milestone: 1.5.0
Component: Matcher Version: git master
Severity: minor Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

If the RHS of an AND_MAYBE clause has an RHS whose maxweight is 0 (or becomes zero during the search), the AND_MAYBE could be replaced with the LHS. The attached patch does this. However, it doesn't quite work, due to two side effects:

  • if the AND_MAYBE is part of an OP_SYNONYM, the wdf of the RHS of the AND_MAYBE can have an effect, but will be lost.
  • if a percentage weight calculation is in effect, the count of the number of terms matching will miss those in the AND_MAYBE.

This doesn't change API at all, so doesn't need to be fixed during the 1.1 series.

Attachments (1)

badopt.patch (7.6 KB ) - added by Richard Boulton 15 years ago.
Optimisation for AND_MAYBE which doesn't quite work right.

Download all attachments as: .zip

Change History (9)

by Richard Boulton, 15 years ago

Attachment: badopt.patch added

Optimisation for AND_MAYBE which doesn't quite work right.

comment:1 by Olly Betts, 12 years ago

Milestone: 1.2.x1.3.x

comment:2 by Olly Betts, 10 years ago

It is certainly feasible to check if we're under OP_SYNONYM.

The count of matching subqueries is more problematic though - at present we always do count them, and to avoid that we'd need to have the user specify up front if they want percentages.

I guess this is useful for a PostingSource which adds weight, and such posting sources quite often match all documents - we could perhaps detect that and know we could do this optimisation (and just adjust the count appropriately).

This seems quite an esoteric case, and unlikely to be a huge win even then, so I don't think it should block 1.4.0.

comment:3 by Olly Betts, 10 years ago

Milestone: 1.3.x1.4.x

comment:4 by Olly Betts, 5 years ago

Milestone: 1.4.x1.5.0
Status: newassigned
Version: SVN trunkgit master

The case where the scale factor is zero (which is what the testcases in the patch test) has been handled since 1.4.10 (9e1023ab5d28532e649715754d5f000038e98f2f) - I tested with the new testcases applied to git master and they pass. This optimisation doesn't cause the problem with percentages highlighted above since we don't count subqueries for which factor == 0.

We don't currently handle the case where the maxweight of the RHS is zero or becomes zero, but I think that's quite easy to do:

  • xapian-core/api/queryinternal.cc

    diff --git a/xapian-core/api/queryinternal.cc b/xapian-core/api/queryinternal.cc
    index c5148ca350e0..4c888d8b0af7 100644
    a b QueryAndMaybe::postlist(QueryOptimiser * qopt, double factor) const  
    23572357    }
    23582358    OrContext ctx(qopt, subqueries.size() - 1);
    23592359    do_or_like(ctx, qopt, factor, 0, 1);
     2360    Xapian::termcount save_total_subqs = qopt->get_total_subqs();
    23602361    unique_ptr<PostList> r(ctx.postlist());
    23612362    if (!r.get()) {
    23622363       RETURN(l.release());
    23632364    }
     2365    if (r->recalc_maxweight() == 0.0) {
     2366       // The RHS can't contribute any weight, so can be discarded.  Reset
     2367       // total_subqs in case we counted any in the RHS so that percentages
     2368       // don't get messed up.
     2369       qopt->set_total_subqs(save_total_subqs);
     2370       RETURN(l.release());
     2371    }
    23642372    RETURN(new AndMaybePostList(l.release(), r.release(),
    23652373                               qopt->matcher, qopt->db_size));
    23662374}

I'm not sure if we actually need to restore total_subqs - it seems there probably can't be any weighted terms in the RHS if its overall maxweight is zero, but maybe with a custom weighting scheme there could be.

Let's try to get test coverage for the above and apply for 1.5.0. I don't think this additional case is worth patching 1.4.x for.

And I think we can not worry about the case where the max weight starts off non-zero but becomes zero during the match - in such a situation it would be more obvious to implement the PostingSource to simply signal it has reached its end rather than just setting its maxweight to zero, and that will be handled efficiently already.

Last edited 5 years ago by Olly Betts (previous) (diff)

comment:5 by Olly Betts, 5 years ago

The wdf from the RHS of OP_AND_MAYBE under OP_SYNONYM needs thought still though.

It looks to me like the optimisation of the factor 0 case means that we currently discard any wdf from the RHS in this case. It'd certainly be more correct not to, and probably not hard to implement - we'd just need to have a flag recording if wdf is wanted for the current subquery (or perhaps just pass factor = -1.0 for this case and then we can check factor == 0.0 or factor <= 0.0 depending which cases we care about).

comment:6 by Olly Betts, 5 years ago

I was thinking we set factor = 0.0 for the subquery under OP_SYNONYM, because it's unweighted, but actually we don't - instead we save and restore the total_subqs value across turning it into a postlist, but keep factor non-zero (if factor is zero when we get to an OP_SYNONYM we turn it into an OP_OR anyway, since it's equivalent when the weight isn't used).

That means the existing OP_AND_MAYBE factor == 0.0 optimisation doesn't happen under OP_SYNONYM so we're good there.

The new optimisation in the patch above ideally should not happen under OP_SYNONYM though.

comment:7 by Olly Betts, 5 years ago

Resolution: fixed
Status: assignedclosed

Committed to git master in 3b0dcc8d731618eb64f6783d7c3e8851b17e1e76.

I don't think this is worth backporting so closing.

comment:8 by Olly Betts, 5 years ago

Just noting I'm doing some reworking to handle this in more cases. It wasn't being handled for OP_AND_MAYBE directly below another AND-like operator. And for an OP_AND_MAYBE with multiple right branches (which is probably much less common, especially as we used to only allow one right branch) it would only trigger if they all had maxweight 0, whereas with my changes it will consider them individually.

Note: See TracTickets for help on using tickets.