Opened 15 years ago

Closed 15 years ago

#340 closed defect (fixed)

PostingSource subclasses cannot signal a change in their maxweight

Reported by: Richard Boulton Owned by: Olly Betts
Priority: normal Milestone: 1.1.1
Component: Library API Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Currently, PostingSource subclasses have a get_maxweight() method, which is called once at the start of a search to determine the maximum possible weight returned by the posting source.

However, it is possible that some PostingSource subclasses may know that the maximum possible returned weight has decreased at certain points during the match process. I'm working on one such postingsource at present, which returns monotonically decreasing weights as the match proceeds.

It would be nice if such postingsource subclasses could notify the matcher that their maximum weight (from this point onward) has decreased. Internally, PostList classes do this by calling matcher->recalc_maxweight, but the postingsource classes don't have access to the matcher object.

I'm not sure what the best API for this would be: one option is to add something like the following to the PostingSource base class:

  private:
    friend ExternalPostlist;
    ExternalPostlist * externalpl;
    void register_externalpl(ExternalPostlist * externalpl_) {
        externalpl = externalpl_;
    }
  protected:
    void notify_new_maxweight();

where notify_new_maxweight() does something like:

    externalpl->recalc_maxweight();

ExternalPostlist would register itself with the posting source before use, and would pass the recalc_maxweight() message on to the matcher when called.

Or, we could pass some proxy object to the reset() method, which the posting source could remember and use to call recalc_maxweight() if needed, or ignore if not needed.

Better ideas welcomed.

Marking for 1.1.0 since this might require us to change the PostingSource API.

Attachments (1)

patch (6.2 KB ) - added by Richard Boulton 15 years ago.
Implementation of this (untested)

Download all attachments as: .zip

Change History (13)

comment:1 by Richard Boulton, 15 years ago

Milestone: 1.1.01.1.1

After discussion on IRC, this doesn't need to block 1.1.0 - we'll just mark the postingsource API as experimental, so we can handle this kind of change. Also, the neatest approach so far seems to be the "add a notify method to PostingSource" approach, which doesn't require the api to be changed.

comment:2 by Richard Boulton, 15 years ago

Owner: changed from Olly Betts to Richard Boulton

comment:3 by Richard Boulton, 15 years ago

Status: newassigned

by Richard Boulton, 15 years ago

Attachment: patch added

Implementation of this (untested)

comment:4 by Richard Boulton, 15 years ago

Resolution: fixed
Status: assignedclosed

Fixed in r12627.

comment:5 by Olly Betts, 15 years ago

Hmm, this API probably seems rather esoteric to the user (your class calls a method to say it has changed, and then later someone calls a method on your class to find out what it actually changed to?!?), and probably leaves them very uncertain if/when to actually call it.

Perhaps it would be better for the subclass to report when its maxweight changes and specify what the new maxweight is? Then the matcher wouldn't need to do its recursive recalculation (that's really only needed when the tree has pruned) - it would just need to keep track of a postingsource's maxweight and reduce its own maxweight by the difference between the postingsource's old and new maxweights. Or perhaps the subclass should supply both old and new? You've implemented more of these than I have, so maybe you can see if you tend to have the old max weight available naturally.

Without a recursive walk of the postlist tree, this should be cheap enough to call pretty much whenever the maxweight reduces (unless the maxweight is inherently expensive to calculate, but that's something the user should have an understanding of as it is their implementation). It might still be better to avoid if there's a small reduction for every document (like an exponentially decaying weight function), but again that's something the user should be able to easily grasp.

comment:6 by Olly Betts, 15 years ago

Resolution: fixed
Status: closedreopened

Just trying out something along these lines to see how it looks (quite promising so far...)

comment:7 by Olly Betts, 15 years ago

Owner: changed from Richard Boulton to Olly Betts
Status: reopenednew

comment:8 by Richard Boulton, 15 years ago

Interesting ideas: I think having the posting source specify the new maxweight is a good idea (but they still need to make sure that a call to get_maxweight returns the new maxweight, as well - unless we take get_maxweight out of the API alltogether, and just have a method for notifying a new one, which would normally be called from the init method to set an initial value).

I think asking the subclass to supply both old maxweight and the new maxweight is just going to result in a lot of duplicated code in subclasses - I can't think of a situation where the old one is naturally available already.

comment:9 by Olly Betts, 15 years ago

OK, this looks good so I've committed the PostingSource end of things to trunk as r12694. Please take a look and tell me if this looks sane or not.

Currently this is crudely bolted to the matcher to just force a recursive call to PostList::recalc_maxweight() - I'm going to work on that part next.

Possibly "tell the matcher the change in maxweight" should be the only mechanism, and also used by all the other postlists - it would certainly be nicer if a reduction in maxweight didn't require work proportional to the size of the query. I will look at how feasible that is.

comment:10 by Richard Boulton, 15 years ago

If I recall correctly, the worry with this scheme was that cumulative rounding errors due to passing the matcher a series of small changes in the maxweight could have the effect of causing the matcher to optimise prematurely. This is probably unlikely to occur often, and, indeed, can probably occur even with the current setup, but the cumulative effect could make it likely enough to be a problem that we need to deal with it.

Olly was looking into ways to control the rounding direction, which would allow us to ensure that the rounding always happened in such a way that invalid optimisations didn't occur. However, this only seems to be possible on a process-global level, which isn't appropriate for a library.

My best idea so far is just to add an epsilon whenever modifying the maxweight, to ensure that the rounding is in the right direction.

comment:11 by Olly Betts, 15 years ago

Status: newassigned

Altering the top-level maxweight directly is also problematic because the intermediate postlists don't get their maxweight for the branch leading to the ExternalPostList updated.

I think you can do rounding locally, but control of rounding is much better supported by C99, which is currently orthogonal to ISO C++. But even if rounding control was simple and fully portable, this approach is going to be kind of delicate I suspect.

Perhaps the mechanism would be better as a "bubble up" rather than a "tell the top to reiterate the whole tree", which would at least make the work more O(log(N)) than O(N) in the number of postlists, but that doesn't solve the accumulation of errors.

So for 1.1.1 I'm just intending to tidy up what we now have. The change to provide a cleaner API to the user is the more urgent part and that's done. At least for now, that can just be connected to the existing mechanism (as it is currently in SVN, but perhaps with a little more cleaning up).

comment:12 by Olly Betts, 15 years ago

Resolution: fixed
Status: assignedclosed

Tidied up in trunk r12802.

Updated docs with a revised "don't call set_maxweight() too often" warning in trunk 12817.

Note: See TracTickets for help on using tickets.