Opened 20 years ago

Closed 16 years ago

Last modified 8 years ago

#50 closed enhancement (fixed)

SynonymPostList

Reported by: Olly Betts Owned by: Richard Boulton
Priority: low Milestone: 1.1.1
Component: Library API Version: SVN trunk
Severity: minor Keywords:
Cc: Blocked By:
Blocking: #104 Operating System: All

Description (last modified by Richard Boulton)

Add synonym postlists, which represents a set of postlists merged together such that each document that occurs in any of the sublists occurs in the synonym list. The termfrequency should ideally be the number of documents that one or more of the terms occurs in, but that's too expensive to find, so we'll need to estimate. Need to be able to take underlying postlists which aren't necessarily just postlists for single terms too.

Attachments (10)

synonym.patch (18.1 KB ) - added by Richard Boulton 17 years ago.
Nearly working implementation
patch (18.0 KB ) - added by Richard Boulton 17 years ago.
Updated patch resolving the cosmetic issues
opsynonym4.patch (24.1 KB ) - added by Richard Boulton 17 years ago.
Updated implementation patch
patch.2 (23.1 KB ) - added by Richard Boulton 17 years ago.
Further updated implementation
opsynonym_changes_12353_12354.patch (44.6 KB ) - added by Richard Boulton 16 years ago.
Latest patch from opsynonym branch
opsynonym_changes_12434_12435.patch (39.7 KB ) - added by Richard Boulton 16 years ago.
Recent version of patch
opsynonym_changes_12471_12472.patch (59.3 KB ) - added by Richard Boulton 16 years ago.
Latest patch from trunk to the opsynonym branch
opsynonym_changes_12490_12492.patch (62.8 KB ) - added by Richard Boulton 16 years ago.
Latest patch from trunk to the opsynonym branch
opsynonym_changes_12590_12591.patch (121.8 KB ) - added by Richard Boulton 16 years ago.
Latest patch
opsynonym_changes_12599_12601.patch (67.2 KB ) - added by Richard Boulton 16 years ago.
Latest patch

Download all attachments as: .zip

Change History (40)

comment:1 by Olly Betts, 20 years ago

Severity: blockernormal
Status: newassigned

comment:2 by Richard Boulton, 18 years ago

Blocking: 104 added

comment:3 by Olly Betts, 18 years ago

Component: otherOther
op_sys: otherAll
rep_platform: OtherAll
Severity: normalenhancement
Version: otherSVN HEAD

comment:4 by Olly Betts, 18 years ago

Component: OtherLibrary API
Priority: highlow

by Richard Boulton, 17 years ago

Attachment: synonym.patch added

Nearly working implementation

comment:5 by Olly Betts, 17 years ago

(As I mentioned on IRC but note here for posterity) the reason this works for SelectPostList but not here is that a tree of AndPostList objects terminates when any object terminatesm, but an OrPostList tree prunes so the entries in the vector<PostList *> become invalid while we're still working.

It might be possible to arrange to remove entries from the vector<> but I don't see a neat way. RefCntPtr is an option, but I think would require rather widespread changes. I think perhaps the best way is to just implement get_wdf() in OrPostList with the semantics we need here - that can recurse the tree to collect the wdf from the currently active OrPostLists. Or perhaps better, to subclass SynonymPostList from OrPostList and build a tree of those in this case.

That would allow eliminating the special SynonymPostList and all the proxying

to its subtree member.

A few comments on the patch:

  • A lot of virtual methods are declared in the synonympostlist.h, which is

something I've been trying to move away from (AIUI, the compiler won't ever be able to inline them, but it means a lot more header to parse when compiling source files so slows compilation - it also often means that headers need to pull in more other headers, and that source changes are more likely to cause a lot of recompilation).

  • I don't think that the skip_to() method should call subtree->get_docid() in

the hope it can avoid calling subtree->skip_to(). In the (probably most common) case when we need to call skip_to() we just end up making an extra call. If OrPostList::skip_to() doesn't minimise work well enough in this case, we should just fix that directly.

  • The licence boilerplate on new files incorrectly uses the old FSF address

(should be Franklin Street not Temple Place).

  • I've been removing the "----START-LICENCE----" stuff when modifying files.

When BrightStation owned the copyright on everything it was handy, but now that the copyright situation is more complex we can't simply make wholesale licence changes with a quick "perl -i" command.

  • The patch indentation as viewed in bugzilla suggests that some lines are space

indented.

comment:6 by Richard Boulton, 17 years ago

Owner: changed from Olly Betts to Richard Boulton
Status: assignednew

comment:7 by Richard Boulton, 17 years ago

Status: newassigned

comment:8 by Richard Boulton, 17 years ago

attachments.isobsolete: 01

comment:9 by Richard Boulton, 17 years ago

I don't think that subclassing SynonymPostList from OrPostList and then building a tree of SynonymPostLists will work ideally, because it won't allow for the operators to decay to other operators.

I think the best implementation is to implement get_wdf() on OrPostList and all it's possible decay products, implement SynonymPostList as a subclass of OrPostList, and build a tree which consists of a SynonymPostList at the top level, and OrPostLists lower down (assuming there are more than 2 items in the synonym list). I think this was what you were getting at with your last suggestion, but I'm not completely sure.

comment:10 by Olly Betts, 17 years ago

Hmm, you can't just make the root of (this part of) the tree a SynonymPostList subclassed from OrPostList, with normal OrPostLists beneath, since it can just get replaced with one of its siblings if when a prune happens.

So I think OrPostList and all the decay products need get_wdf() (which is fine as it's just unimplemented there at present) and then SynonymPostList is required which just overrides get_weight() (etc) and calculates it based on get_wdf() from the sub-tree.

I can't see how to avoid proxying all the methods here without implementing Synonym variants of OrPostList, AndPostList, XorPostList, AndMaybePostList (and perhaps others I missed) which at least at first glance seems excessive, though perhaps they can all just be subclasses of the normal versions which only override a couple of methods...

comment:11 by Richard Boulton, 17 years ago

I think having a single top-level SynonymPostList with a single sub-postlist, and having the SynonymPostList proxy all the relevant methods to the sub-postlist, is the clean approach. In particular, for the other approach we'd have to pass the weight object held in the synonym postlist subclass (or a clone of it) to the new class when decay happens, so we'll have to get fairly intimately involved in the code handling the decay. Also, if we have to subclass all the possible decay products, we're liable to miss one if new possibilities are added, or people submit queries we hadn't thought of (for example, imagine three phrases which are joined with a synonym operator - one of the possible decay products is a phrase postlist).

I'll try implementing the proxying approach, with get_wdf() in all the relevant postlist classes.

by Richard Boulton, 17 years ago

Attachment: patch added

Updated patch resolving the cosmetic issues

comment:12 by Richard Boulton, 17 years ago

attachments.isobsolete: 01

by Richard Boulton, 17 years ago

Attachment: patch.2 added

Updated implementation patch

comment:13 by Richard Boulton, 17 years ago

attachments.isobsolete: 01

comment:14 by Richard Boulton, 17 years ago

There is now a branch (called "opsynonym") in xapian SVN containing the implementation of this feature. The status is essentially still that described in the previous comment, though.

comment:15 by Richard Boulton, 17 years ago

Blocking: 160 added
Operating System: All

comment:17 by Richard Boulton, 17 years ago

Description: modified (diff)
Milestone: 1.1

comment:18 by Richard Boulton, 17 years ago

Blocking: 160 removed

comment:19 by Richard Boulton, 17 years ago

I'm slowly working on merging the opsysnoym branch to HEAD. The outstanding aspects are that the calculation of the percentage weight when the top query doesn't match 100% of the terms is broken for synonyms, which requires reworking, and that relevance reweighting doesn't work correctly with synonyms. I'd be willing to merge to HEAD once all percentage calculation work is complete, because the relevance reweighting stuff is relatively unused, and the incorrect value isn't terribly incorrect (an incorrect value for the relevance term frequency is used in the calculation, but this doesn't break anything badly).

comment:20 by Richard Boulton, 16 years ago

Blocking: 307 added

comment:21 by Olly Betts, 16 years ago

Bumping milestone to 1.1.1 as this is ready to apply and isn't an incompatible change.

comment:22 by Olly Betts, 16 years ago

Milestone: 1.1.01.1.1

by Richard Boulton, 16 years ago

Latest patch from opsynonym branch

comment:23 by Richard Boulton, 16 years ago

I've attached the latest patch containing all the changes on the opsynonym branch.

Coverage testing indicates that there is only one line added/changed by which is not covered by tests: the handling for "factor == 0.0" in QueryOptimiser::do_synonym() (ie, the line:

RETURN(do_or_like(query, 0.0));

near the end of queryoptimiser.cc

There are two FIXMEs added by the patch:

1) LocalSubMatch::make_synonym_postlist "FIXME - calculate the reltermfreq to use". Currently, we're not trying to estimate the reltermfreq for synonym queries - we should try to add something, but I don't think it should block applying the patch (we should make a ticket so we don't forget about it, though).

2) QueryOptimiser::do_synonym "FIXME - should we be doing something with the wqf?" I think the answer is "no", but I've not thought it through very carefully.

I think neither of these need block the patch being merged to trunk, and would like to do so shortly after 1.1.0 is released.

comment:24 by Olly Betts, 16 years ago

This is looking good. Almost tempted to try to merge it for 1.1.0, but I think we should probably stick to getting it out the door.

Patch review (to avoid too much probably pointless detailed explanation, I've just fixed bits which I think aren't at all controversial in SVN, but feel free to argue):

In multimatch.cc, it would be nice to say in what situation denom == 0 can happen as it isn't immediately clear to me (I'm guessing something like OP_SYNONYM being the top level query op?)

In queryparser.lemony, line 360 onwards, you add a second vector subqs2 to allow us to carefully OP_OR the OP_SYNONYM-ed terms for each prefix in the "partial term" case, but in the "wildcard term" case, you just OP_SYNONYM all the terms using a single vector. I think these two cases should be consistent (they're really just implicit vs explicit cases of the same thing), and I think we probably should OP_SYNONYM everything.

Why does this new overloaded form of Weight::init_() have a term parameter when we always pass an empty string for it? Is there some likely future use for this?

Is there an easy testcase to add coverage for the missing line?

comment:25 by Richard Boulton, 16 years ago

Very tempted as I am to merge to 1.1.0, let's focus on releasing 1.1.0 instead. We should merge this shortly after release, though.

Bits so far committed are fine.

I believe denom == 0 happens when the only terms in the query are synonym terms. Though reading the code, it looks like in that case termfreqandwts.size() will be 0, so maybe I'm wrong about that. (Or it may have been true in an earlier version of the patch, but is no longer.)

In queryparser.lemony, the intention is to build a query which matches all the possible completions of the word, but matches the exact word with a higher weight. This is desirable because the user hasn't explicitly marked the query as being a wildcard query (unlike in the wildcard case where the user has added a * to the end of the word). That said, I think it would be nicer if the resulting queries applied OP_SYNONYM to all the partial terms (across prefixes), and to all the non prefixed terms: so, if the prefixes are "A" and "B" and we have the terms "foo" and "foot" in the database with each prefix, we'd currently get, for a search for "foo":

(Afoo SYNONYM Afoot) OR (Afoo) OR (Bfoo SYNONYM Bfoot) OR (Bfoo)

whereas it would be better to get:

(Afoo SYNONYM Afoot SYNONYM Bfoo SYNONYM Bfoot) OR (Afoo SYNONYM Bfoo)

It's possible that Term::get_query() should be joining the terms across prefixes with OP_SYNONYM, too. That's a change which is far more likely to affect existing queries than the others, though, so needs some thought.

The Weight::init_() thing is probably just historical reasons.

I would think it's easy to add test coverage for the missing line.

comment:26 by Olly Betts, 16 years ago

OK, I've cleaned up the init_() stuff.

Presumably denom==0 must be possible if the code coverage shows that branch taken?

I think I probably agree about the FLAG_PARTIAL stuff.

Another issue, as discussed on IRC, is clamping wdf of the synonym so it doesn't exceed the document length, or something like that.

by Richard Boulton, 16 years ago

Recent version of patch

comment:27 by Richard Boulton, 16 years ago

I've fixed the FLAG_PARTIAL stuff, clamped the wdf() to the doclength (if wdf is requested by the weight object) - and fixed several bugs in the wdf() calculation while I was at it. I've also fixed a bug in the next() and skip_to() implementations - these were calling the subpostlist with the w_min() value supplied to the synonym, causing them to terminate early in some cases. I've also extended the testsuite to cover these cases.

The only missing bit now, I think is coverage for the missing line. Actually, we should probably check coverage for the patch as it stands now - my recent changes may have added some more lines which aren't covered. (XorPostList::get_wdf() is quite likely not to be covered, for example.)

by Richard Boulton, 16 years ago

Latest patch from trunk to the opsynonym branch

comment:28 by Richard Boulton, 16 years ago

Other missing bit is moving the testcases from api_db.cc to a new test file, to avoid bloating api_db.cc further.

comment:29 by Richard Boulton, 16 years ago

Blocking: 307 removed

(In #307) That potential fix would be done internally, though - the thinking behind this ticket was really to provide an API to allow OP_SYNONYM to more closely follow what would happen if you performed the synonym expansion at index time in a system with wdf multipliers for each field (as in omega), but allowing the relative weighting of fields to be adjusted at search time instead of index time. I don't have a concrete example of wanting to use it in this way, though - it was just a feeling that it would be nice to be "feature complete" here.

I think if the multipliers are in the range (0,1] this wouldn't work well; if you wanted term "title:A" to have 5 times the wdf of term "body:B", you'd implement that by dividing the wdfs of "body:B" by 5 and you'd lose any distinction between documents with fairly different values (eg, 1 and 5) for the wdf of "body:B".

An alternative approach to making one field more important is to use a OR query instead of a synonym query. The weightings which come out of doing that are different, of course (that's the point of OP_SYNONYM), but you could use an OP_SYNONYM as the main query, and an OR query with a fairly low multiplier to give a slight boost to a particular field, if you really needed to adjust the weight in this sort of way.

I think we're both agreed the WONTFIX is an appropriate response, given the difficulty of coming up with a reasonable implementation, and given the lack of definite use for this. Closing as that.

by Richard Boulton, 16 years ago

Latest patch from trunk to the opsynonym branch

comment:30 by Richard Boulton, 16 years ago

I've just gone through the coverage reports again for the changes made in the opsynonym branch (a little more carefully than last time I did this):

There are 3 changes which have no coverage, which I do not propose to fix:

  • weight.cc line 92. This is only called if (stats_needed & DOC_LENGTH_MAX) is true, but this isn't the case for any weighting object existing in core. The equivalent line is also missed for all the other Weight::init_() methods. The fix for all of these would be to write a test with a custom weight which used this.
  • synonympostlist.cc: SynonymPostList::get_weight() when (want_wdf) is false. Again, no this isn't true for any current weighting algorithm other than BoolWeight.
  • synonympostlist.cc: SynonymPostList::get_description(). This is only called from debugging code.

There are also the following functions which I propose to try and add coverage for by writing further tests:

  • andmaybepostlist.cc: AndMaybePostList::get_wdf()
  • andnotpostlist.cc: AndNotPostList::get_wdf()
  • andpostlist.cc: AndPostList::get_wdf()
  • xorpostlist.cc: XorPostList::get_wdf()
  • synonympostlist.cc: SynonymPostList::skip_to()

Other than writing these further tests, and assuming no problems show up when I write them, I think we're in a good position to merge to trunk.

comment:31 by Olly Betts, 16 years ago

The first two could be fixed by writing a suitable subclass of Xapian::Weight, but the effort is probably disproportionate to the gain, especially as we can more carefully inspect that code (it looks good to me).

BTW, DOC_LENGTH_MAX is there as I'm fairly sure some of the DfR schemes will want it.

by Richard Boulton, 16 years ago

Latest patch

by Richard Boulton, 16 years ago

Latest patch

comment:32 by Richard Boulton, 16 years ago

Resolution: fixed
Status: assignedclosed

Merged the implementation of OP_SYNONYM from the opsynonym branch to trunk in r12609, so I'm closing this now. The remaining code on the branch is the fix for ticket #104, which I'll be merging shortly.

Last edited 8 years ago by Olly Betts (previous) (diff)
Note: See TracTickets for help on using tickets.