#50 closed enhancement (fixed)
SynonymPostList
Reported by: | Olly Betts | Owned by: | Richard Boulton |
---|---|---|---|
Priority: | low | Milestone: | 1.1.1 |
Component: | Library API | Version: | SVN trunk |
Severity: | minor | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | #104 | Operating System: | All |
Description (last modified by )
Add synonym postlists, which represents a set of postlists merged together such that each document that occurs in any of the sublists occurs in the synonym list. The termfrequency should ideally be the number of documents that one or more of the terms occurs in, but that's too expensive to find, so we'll need to estimate. Need to be able to take underlying postlists which aren't necessarily just postlists for single terms too.
Attachments (10)
Change History (40)
comment:1 by , 20 years ago
Severity: | blocker → normal |
---|---|
Status: | new → assigned |
comment:2 by , 18 years ago
Blocking: | 104 added |
---|
comment:3 by , 18 years ago
Component: | other → Other |
---|---|
op_sys: | other → All |
rep_platform: | Other → All |
Severity: | normal → enhancement |
Version: | other → SVN HEAD |
comment:4 by , 18 years ago
Component: | Other → Library API |
---|---|
Priority: | high → low |
by , 17 years ago
Attachment: | synonym.patch added |
---|
comment:5 by , 17 years ago
(As I mentioned on IRC but note here for posterity) the reason this works for SelectPostList but not here is that a tree of AndPostList objects terminates when any object terminatesm, but an OrPostList tree prunes so the entries in the vector<PostList *> become invalid while we're still working.
It might be possible to arrange to remove entries from the vector<> but I don't see a neat way. RefCntPtr is an option, but I think would require rather widespread changes. I think perhaps the best way is to just implement get_wdf() in OrPostList with the semantics we need here - that can recurse the tree to collect the wdf from the currently active OrPostLists. Or perhaps better, to subclass SynonymPostList from OrPostList and build a tree of those in this case.
That would allow eliminating the special SynonymPostList and all the proxying
to its subtree member.
A few comments on the patch:
- A lot of virtual methods are declared in the synonympostlist.h, which is
something I've been trying to move away from (AIUI, the compiler won't ever be able to inline them, but it means a lot more header to parse when compiling source files so slows compilation - it also often means that headers need to pull in more other headers, and that source changes are more likely to cause a lot of recompilation).
- I don't think that the skip_to() method should call subtree->get_docid() in
the hope it can avoid calling subtree->skip_to(). In the (probably most common) case when we need to call skip_to() we just end up making an extra call. If OrPostList::skip_to() doesn't minimise work well enough in this case, we should just fix that directly.
- The licence boilerplate on new files incorrectly uses the old FSF address
(should be Franklin Street not Temple Place).
- I've been removing the "----START-LICENCE----" stuff when modifying files.
When BrightStation owned the copyright on everything it was handy, but now that the copyright situation is more complex we can't simply make wholesale licence changes with a quick "perl -i" command.
- The patch indentation as viewed in bugzilla suggests that some lines are space
indented.
comment:6 by , 17 years ago
Owner: | changed from | to
---|---|
Status: | assigned → new |
comment:7 by , 17 years ago
Status: | new → assigned |
---|
comment:8 by , 17 years ago
attachments.isobsolete: | 0 → 1 |
---|
comment:9 by , 17 years ago
I don't think that subclassing SynonymPostList from OrPostList and then building a tree of SynonymPostLists will work ideally, because it won't allow for the operators to decay to other operators.
I think the best implementation is to implement get_wdf() on OrPostList and all it's possible decay products, implement SynonymPostList as a subclass of OrPostList, and build a tree which consists of a SynonymPostList at the top level, and OrPostLists lower down (assuming there are more than 2 items in the synonym list). I think this was what you were getting at with your last suggestion, but I'm not completely sure.
comment:10 by , 17 years ago
Hmm, you can't just make the root of (this part of) the tree a SynonymPostList subclassed from OrPostList, with normal OrPostLists beneath, since it can just get replaced with one of its siblings if when a prune happens.
So I think OrPostList and all the decay products need get_wdf() (which is fine as it's just unimplemented there at present) and then SynonymPostList is required which just overrides get_weight() (etc) and calculates it based on get_wdf() from the sub-tree.
I can't see how to avoid proxying all the methods here without implementing Synonym variants of OrPostList, AndPostList, XorPostList, AndMaybePostList (and perhaps others I missed) which at least at first glance seems excessive, though perhaps they can all just be subclasses of the normal versions which only override a couple of methods...
comment:11 by , 17 years ago
I think having a single top-level SynonymPostList with a single sub-postlist, and having the SynonymPostList proxy all the relevant methods to the sub-postlist, is the clean approach. In particular, for the other approach we'd have to pass the weight object held in the synonym postlist subclass (or a clone of it) to the new class when decay happens, so we'll have to get fairly intimately involved in the code handling the decay. Also, if we have to subclass all the possible decay products, we're liable to miss one if new possibilities are added, or people submit queries we hadn't thought of (for example, imagine three phrases which are joined with a synonym operator - one of the possible decay products is a phrase postlist).
I'll try implementing the proxying approach, with get_wdf() in all the relevant postlist classes.
comment:12 by , 17 years ago
attachments.isobsolete: | 0 → 1 |
---|
comment:13 by , 17 years ago
attachments.isobsolete: | 0 → 1 |
---|
comment:14 by , 17 years ago
There is now a branch (called "opsynonym") in xapian SVN containing the implementation of this feature. The status is essentially still that described in the previous comment, though.
comment:15 by , 17 years ago
Blocking: | 160 added |
---|---|
Operating System: | → All |
comment:17 by , 17 years ago
Description: | modified (diff) |
---|---|
Milestone: | → 1.1 |
comment:18 by , 17 years ago
Blocking: | 160 removed |
---|
comment:19 by , 17 years ago
I'm slowly working on merging the opsysnoym branch to HEAD. The outstanding aspects are that the calculation of the percentage weight when the top query doesn't match 100% of the terms is broken for synonyms, which requires reworking, and that relevance reweighting doesn't work correctly with synonyms. I'd be willing to merge to HEAD once all percentage calculation work is complete, because the relevance reweighting stuff is relatively unused, and the incorrect value isn't terribly incorrect (an incorrect value for the relevance term frequency is used in the calculation, but this doesn't break anything badly).
comment:20 by , 16 years ago
Blocking: | 307 added |
---|
comment:21 by , 16 years ago
Bumping milestone to 1.1.1 as this is ready to apply and isn't an incompatible change.
comment:22 by , 16 years ago
Milestone: | 1.1.0 → 1.1.1 |
---|
by , 16 years ago
Attachment: | opsynonym_changes_12353_12354.patch added |
---|
Latest patch from opsynonym branch
comment:23 by , 16 years ago
I've attached the latest patch containing all the changes on the opsynonym branch.
Coverage testing indicates that there is only one line added/changed by which is not covered by tests: the handling for "factor == 0.0" in QueryOptimiser::do_synonym() (ie, the line:
RETURN(do_or_like(query, 0.0));
near the end of queryoptimiser.cc
There are two FIXMEs added by the patch:
1) LocalSubMatch::make_synonym_postlist "FIXME - calculate the reltermfreq to use". Currently, we're not trying to estimate the reltermfreq for synonym queries - we should try to add something, but I don't think it should block applying the patch (we should make a ticket so we don't forget about it, though).
2) QueryOptimiser::do_synonym "FIXME - should we be doing something with the wqf?" I think the answer is "no", but I've not thought it through very carefully.
I think neither of these need block the patch being merged to trunk, and would like to do so shortly after 1.1.0 is released.
comment:24 by , 16 years ago
This is looking good. Almost tempted to try to merge it for 1.1.0, but I think we should probably stick to getting it out the door.
Patch review (to avoid too much probably pointless detailed explanation, I've just fixed bits which I think aren't at all controversial in SVN, but feel free to argue):
In multimatch.cc, it would be nice to say in what situation denom == 0
can happen as it isn't immediately clear to me (I'm guessing something like OP_SYNONYM
being the top level query op?)
In queryparser.lemony, line 360 onwards, you add a second vector subqs2 to allow us to carefully OP_OR the OP_SYNONYM-ed terms for each prefix in the "partial term" case, but in the "wildcard term" case, you just OP_SYNONYM all the terms using a single vector. I think these two cases should be consistent (they're really just implicit vs explicit cases of the same thing), and I think we probably should OP_SYNONYM everything.
Why does this new overloaded form of Weight::init_()
have a term
parameter when we always pass an empty string for it? Is there some likely future use for this?
Is there an easy testcase to add coverage for the missing line?
comment:25 by , 16 years ago
Very tempted as I am to merge to 1.1.0, let's focus on releasing 1.1.0 instead. We should merge this shortly after release, though.
Bits so far committed are fine.
I believe denom == 0 happens when the only terms in the query are synonym terms. Though reading the code, it looks like in that case termfreqandwts.size() will be 0, so maybe I'm wrong about that. (Or it may have been true in an earlier version of the patch, but is no longer.)
In queryparser.lemony, the intention is to build a query which matches all the possible completions of the word, but matches the exact word with a higher weight. This is desirable because the user hasn't explicitly marked the query as being a wildcard query (unlike in the wildcard case where the user has added a * to the end of the word). That said, I think it would be nicer if the resulting queries applied OP_SYNONYM to all the partial terms (across prefixes), and to all the non prefixed terms: so, if the prefixes are "A" and "B" and we have the terms "foo" and "foot" in the database with each prefix, we'd currently get, for a search for "foo":
(Afoo SYNONYM Afoot) OR (Afoo) OR (Bfoo SYNONYM Bfoot) OR (Bfoo)
whereas it would be better to get:
(Afoo SYNONYM Afoot SYNONYM Bfoo SYNONYM Bfoot) OR (Afoo SYNONYM Bfoo)
It's possible that Term::get_query() should be joining the terms across prefixes with OP_SYNONYM, too. That's a change which is far more likely to affect existing queries than the others, though, so needs some thought.
The Weight::init_() thing is probably just historical reasons.
I would think it's easy to add test coverage for the missing line.
comment:26 by , 16 years ago
OK, I've cleaned up the init_() stuff.
Presumably denom==0 must be possible if the code coverage shows that branch taken?
I think I probably agree about the FLAG_PARTIAL stuff.
Another issue, as discussed on IRC, is clamping wdf of the synonym so it doesn't exceed the document length, or something like that.
comment:27 by , 16 years ago
I've fixed the FLAG_PARTIAL stuff, clamped the wdf() to the doclength (if wdf is requested by the weight object) - and fixed several bugs in the wdf() calculation while I was at it. I've also fixed a bug in the next() and skip_to() implementations - these were calling the subpostlist with the w_min() value supplied to the synonym, causing them to terminate early in some cases. I've also extended the testsuite to cover these cases.
The only missing bit now, I think is coverage for the missing line. Actually, we should probably check coverage for the patch as it stands now - my recent changes may have added some more lines which aren't covered. (XorPostList::get_wdf() is quite likely not to be covered, for example.)
by , 16 years ago
Attachment: | opsynonym_changes_12471_12472.patch added |
---|
Latest patch from trunk to the opsynonym branch
comment:28 by , 16 years ago
Other missing bit is moving the testcases from api_db.cc to a new test file, to avoid bloating api_db.cc further.
comment:29 by , 16 years ago
Blocking: | 307 removed |
---|
(In #307) That potential fix would be done internally, though - the thinking behind this ticket was really to provide an API to allow OP_SYNONYM to more closely follow what would happen if you performed the synonym expansion at index time in a system with wdf multipliers for each field (as in omega), but allowing the relative weighting of fields to be adjusted at search time instead of index time. I don't have a concrete example of wanting to use it in this way, though - it was just a feeling that it would be nice to be "feature complete" here.
I think if the multipliers are in the range (0,1] this wouldn't work well; if you wanted term "title:A" to have 5 times the wdf of term "body:B", you'd implement that by dividing the wdfs of "body:B" by 5 and you'd lose any distinction between documents with fairly different values (eg, 1 and 5) for the wdf of "body:B".
An alternative approach to making one field more important is to use a OR query instead of a synonym query. The weightings which come out of doing that are different, of course (that's the point of OP_SYNONYM), but you could use an OP_SYNONYM as the main query, and an OR query with a fairly low multiplier to give a slight boost to a particular field, if you really needed to adjust the weight in this sort of way.
I think we're both agreed the WONTFIX is an appropriate response, given the difficulty of coming up with a reasonable implementation, and given the lack of definite use for this. Closing as that.
by , 16 years ago
Attachment: | opsynonym_changes_12490_12492.patch added |
---|
Latest patch from trunk to the opsynonym branch
comment:30 by , 16 years ago
I've just gone through the coverage reports again for the changes made in the opsynonym branch (a little more carefully than last time I did this):
There are 3 changes which have no coverage, which I do not propose to fix:
- weight.cc line 92. This is only called if (stats_needed & DOC_LENGTH_MAX) is true, but this isn't the case for any weighting object existing in core. The equivalent line is also missed for all the other Weight::init_() methods. The fix for all of these would be to write a test with a custom weight which used this.
- synonympostlist.cc: SynonymPostList::get_weight() when (want_wdf) is false. Again, no this isn't true for any current weighting algorithm other than BoolWeight.
- synonympostlist.cc: SynonymPostList::get_description(). This is only called from debugging code.
There are also the following functions which I propose to try and add coverage for by writing further tests:
- andmaybepostlist.cc: AndMaybePostList::get_wdf()
- andnotpostlist.cc: AndNotPostList::get_wdf()
- andpostlist.cc: AndPostList::get_wdf()
- xorpostlist.cc: XorPostList::get_wdf()
- synonympostlist.cc: SynonymPostList::skip_to()
Other than writing these further tests, and assuming no problems show up when I write them, I think we're in a good position to merge to trunk.
comment:31 by , 16 years ago
The first two could be fixed by writing a suitable subclass of Xapian::Weight, but the effort is probably disproportionate to the gain, especially as we can more carefully inspect that code (it looks good to me).
BTW, DOC_LENGTH_MAX is there as I'm fairly sure some of the DfR schemes will want it.
comment:32 by , 16 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Nearly working implementation