Ticket #50 (assigned enhancement)

Opened 4 years ago

Last modified 8 months ago

SynonymPostList

Reported by: olly Owned by: richard
Priority: low Milestone: 1.1.0
Component: Library API Version: SVN trunk
Severity: minor Keywords:
Cc: Blocked By:
Operating System: All Blocking: #104, #307

Description (last modified by richard) (diff)

Add synonym postlists, which represents a set of postlists merged together such that each document that occurs in any of the sublists occurs in the synonym list. The termfrequency should ideally be the number of documents that one or more of the terms occurs in, but that's too expensive to find, so we'll need to estimate. Need to be able to take underlying postlists which aren't necessarily just postlists for single terms too.

Attachments

synonym.patch (18.1 kB) - added by richard 18 months ago.
Nearly working implementation
patch (18.0 kB) - added by richard 18 months ago.
Updated patch resolving the cosmetic issues
opsynonym4.patch (24.1 kB) - added by richard 12 months ago.
Updated implementation patch
patch.2 (23.1 kB) - added by richard 12 months ago.
Further updated implementation

Change History

Changed 4 years ago by olly

  • status changed from new to assigned
  • severity changed from blocker to normal

Changed 2 years ago by richard

  • blocking set to 104

Changed 23 months ago by olly

  • rep_platform changed from Other to All
  • version changed from other to SVN HEAD
  • component changed from other to Other
  • op_sys changed from other to All
  • severity changed from normal to enhancement

Changed 20 months ago by olly

  • priority changed from high to low
  • component changed from Other to Library API

Changed 18 months ago by richard

Nearly working implementation

Changed 18 months ago by olly

(As I mentioned on IRC but note here for posterity) the reason this works for SelectPostList? but not here is that a tree of AndPostList? objects terminates when any object terminatesm, but an OrPostList? tree prunes so the entries in the vector<PostList? *> become invalid while we're still working.

It might be possible to arrange to remove entries from the vector<> but I don't see a neat way. RefCntPtr? is an option, but I think would require rather widespread changes. I think perhaps the best way is to just implement get_wdf() in OrPostList? with the semantics we need here - that can recurse the tree to collect the wdf from the currently active OrPostLists?. Or perhaps better, to subclass SynonymPostList? from OrPostList? and build a tree of those in this case.

That would allow eliminating the special SynonymPostList? and all the proxying

to its subtree member.

A few comments on the patch:

* A lot of virtual methods are declared in the synonympostlist.h, which is something I've been trying to move away from (AIUI, the compiler won't ever be able to inline them, but it means a lot more header to parse when compiling source files so slows compilation - it also often means that headers need to pull in more other headers, and that source changes are more likely to cause a lot of recompilation).

* I don't think that the skip_to() method should call subtree->get_docid() in the hope it can avoid calling subtree->skip_to(). In the (probably most common) case when we need to call skip_to() we just end up making an extra call. If OrPostList::skip_to() doesn't minimise work well enough in this case, we should just fix that directly.

* The licence boilerplate on new files incorrectly uses the old FSF address (should be Franklin Street not Temple Place).

* I've been removing the "----START-LICENCE----" stuff when modifying files. When BrightStation? owned the copyright on everything it was handy, but now that the copyright situation is more complex we can't simply make wholesale licence changes with a quick "perl -i" command.

* The patch indentation as viewed in bugzilla suggests that some lines are space indented.

Changed 18 months ago by richard

  • owner changed from olly to richard
  • status changed from assigned to new

Changed 18 months ago by richard

  • status changed from new to assigned

Changed 18 months ago by richard

  • attachments.isobsolete changed from 0 to 1

Changed 18 months ago by richard

I don't think that subclassing SynonymPostList? from OrPostList? and then building a tree of SynonymPostLists? will work ideally, because it won't allow for the operators to decay to other operators.

I think the best implementation is to implement get_wdf() on OrPostList? and all it's possible decay products, implement SynonymPostList? as a subclass of OrPostList?, and build a tree which consists of a SynonymPostList? at the top level, and OrPostLists? lower down (assuming there are more than 2 items in the synonym list). I think this was what you were getting at with your last suggestion, but I'm not completely sure.

Changed 18 months ago by olly

Hmm, you can't just make the root of (this part of) the tree a SynonymPostList? subclassed from OrPostList?, with normal OrPostLists? beneath, since it can just get replaced with one of its siblings if when a prune happens.

So I think OrPostList? and all the decay products need get_wdf() (which is fine as it's just unimplemented there at present) and then SynonymPostList? is required which just overrides get_weight() (etc) and calculates it based on get_wdf() from the sub-tree.

I can't see how to avoid proxying all the methods here without implementing Synonym variants of OrPostList?, AndPostList?, XorPostList?, AndMaybePostList? (and perhaps others I missed) which at least at first glance seems excessive, though perhaps they can all just be subclasses of the normal versions which only override a couple of methods...

Changed 18 months ago by richard

I think having a single top-level SynonymPostList? with a single sub-postlist, and having the SynonymPostList? proxy all the relevant methods to the sub-postlist, is the clean approach. In particular, for the other approach we'd have to pass the weight object held in the synonym postlist subclass (or a clone of it) to the new class when decay happens, so we'll have to get fairly intimately involved in the code handling the decay. Also, if we have to subclass all the possible decay products, we're liable to miss one if new possibilities are added, or people submit queries we hadn't thought of (for example, imagine three phrases which are joined with a synonym operator - one of the possible decay products is a phrase postlist).

I'll try implementing the proxying approach, with get_wdf() in all the relevant postlist classes.

Changed 18 months ago by richard

Changed 18 months ago by richard

  • attachments.isobsolete changed from 0 to 1

Changed 18 months ago by richard

Updated patch resolving the cosmetic issues

Changed 12 months ago by richard

Changed 12 months ago by richard

  • attachments.isobsolete changed from 0 to 1

Changed 12 months ago by richard

Updated implementation patch

Changed 12 months ago by richard

There is now a branch (called "opsynonym") in xapian SVN containing the implementation of this feature. The status is essentially still that described in the previous comment, though.

Changed 12 months ago by richard

  • blocking changed from 104 to 104, 160

Changed 12 months ago by trac

  • platform set to All

Changed 8 months ago by richard

  • description modified (diff)
  • milestone set to 1.1

Changed 8 months ago by richard

  • blocking changed from 104, 160 to 104

Changed 8 months ago by richard

I'm slowly working on merging the opsysnoym branch to HEAD. The outstanding aspects are that the calculation of the percentage weight when the top query doesn't match 100% of the terms is broken for synonyms, which requires reworking, and that relevance reweighting doesn't work correctly with synonyms. I'd be willing to merge to HEAD once all percentage calculation work is complete, because the relevance reweighting stuff is relatively unused, and the incorrect value isn't terribly incorrect (an incorrect value for the relevance term frequency is used in the calculation, but this doesn't break anything badly).

Changed 6 weeks ago by richard

  • blocking changed from 104 to 104, 307
Note: See TracTickets for help on using tickets.