Opened 15 years ago

Closed 12 months ago

#423 closed defect (fixed)

Document termlist get_termfreq() method behaviour depends on whether terms are cached

Reported by: Richard Boulton Owned by: Olly Betts
Priority: normal Milestone: 1.5.0
Component: Library API Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

The TermIterator objects returned by Document.termlist_begin(), for a document obtained from a database, can sometimes be used to obtain the term frequency, and sometimes can't be. It's unpredictable which behaviour is obtained, unless you know the details of the implementation of caching of terms in Documents.

For example, the following python code uses this to obtain the frequency, and works fine:

import xapian
db=xapian.WritableDatabase('foo', xapian.DB_CREATE_OR_OVERWRITE)
doc=xapian.Document()
doc.add_term('foo')
db.add_document(doc)
doc=db.get_document(1)
t=doc.termlist()
item=next(t)
item.termfreq

However, the following code (with one added line) doesn't:

import xapian
db=xapian.WritableDatabase('foo', xapian.DB_CREATE_OR_OVERWRITE)
doc=xapian.Document()
doc.add_term('foo')
db.add_document(doc)
doc=db.get_document(1)
doc.termlist_count()   # Added line
t=doc.termlist()
item=next(t)
item.termfreq

For me, this code raises: "InvalidOperationError: Can't get term frequency from a document termlist which is not associated with a database."

This behaviour is because the termlist_count() method causes the terms to be loaded into the document, and Document then uses a MapTermList to return the term iterator.

Not sure of the easiest way to fix this - we could make MapTermList be able to keep a reference to a database, and pass off such requests to the database if set (or, better, subclass MapTermList for documents which are connected to a database).

Change History (9)

comment:1 by Richard Boulton, 15 years ago

Description: modified (diff)

comment:2 by Olly Betts, 14 years ago

Milestone: 1.2.x

comment:3 by Olly Betts, 12 years ago

Milestone: 1.2.x1.3.x

Hmm, unpleasant, but also fairly obscure. Bumping to 1.3.x, though it would be good to improve in 1.2 too (if only with a comment in the API docs or a tweak to make that exception UnimplementedError in this case, or at least give it a more helpful exception message).

comment:4 by Olly Betts, 10 years ago

This would actually be fairly easy to fix by changing Document::termlist_count() to open the termlist and call get_approx_size() on it (which is exact for a leaf termlist) to get this info from the termlist table, or by calling get_freqs() on the database to get the term frequency from the postlist table. The advantage of the former is that it is more cache friendly if we then go on to read the termlist, while the latter works even if there's no termlist table.

Once you actually modify a document from a database, I think it actually makes most sense for a request for the termfreq to fail as the document is no longer one that's actually in the database (though at first encounter this might seem surprising).

However, just to add to the fun, the termfreq reported here is only for the current subdatabase, and to fix that I think we'd need to keep a reference to the whole Xapian::Database rather than just the current subdb. Perhaps TermIterator::get_termfreq() should just be deprecated - all it actually does is to effectively call Database::get_termfreq(*it).

comment:5 by Olly Betts, 9 years ago

Milestone: 1.3.x1.4.x

An issue with deprecating it is that for an all terms iterator, it returns the termfreq efficiently (and probably usefully).

This isn't something I'd want to block 1.4.0 on, so bumping the milestone. If we have a fix it can go in still - the hard part of fixing it seems to really be to decide how best to do so...

comment:6 by Olly Betts, 6 years ago

In current master and 1.4 branch, apitest testcases adddoc2 and adddoc5 are disabled for sharded databases because they would fail due to the termfreq only being for the current shard (as noted in the final paragraph in comment:4).

comment:7 by Olly Betts, 20 months ago

Perhaps we should only support TermIterator::get_termfreq() for an allterms iterator.

in reply to:  7 comment:8 by Olly Betts, 20 months ago

Description: modified (diff)
Milestone: 1.4.x1.5.0

The original issue reported here was actually fixed a while ago by [747ba3ef354a2b0180b18f0d878e1add7697ae5b] which reimplemented Document::Internal and implemented termlist_count() differently (essentially in the way I suggested above). That change wasn't backported so this will be fixed in 1.5.0.

There's also the related issue that the termfreq reported here is only for the current subdatabase.

Replying to Olly Betts:

Perhaps we should only support TermIterator::get_termfreq() for an allterms iterator.

I had a look at what that would involve.

It needs extra work to make this fail in the document case without also making it fail for an iterator from db.termlist_begin(docid).

It requires that REPLY_TERMLIST stop eagerly calling get_termfreq() for each entry, which is a speed-up if this information isn't wanted, but probably a significant slow-down if it is. However this change would break get_eset() with a remote shard. Ideally that would do more work remotely, but I'm not sure how feasible that actually is when there are both local and remote shards. It probably would need get_eset() to work more like the matcher, which would not be a bad thing but is a significant amount of development work.

Overall I think it's best to resolve this part by documenting that the TermIterator here only knows about the shard (effectively the Document object only knows about the shard it is in since get_docid() reports the docid in the shard, so this is reasonably consistent). If you really want the termfreq from the full database then you can keep a reference to it and call db.get_termfreq(*term_iterator).

comment:9 by Olly Betts, 12 months ago

Resolution: fixed
Status: newclosed

Overall I think it's best to resolve this part by documenting that the TermIterator here only knows about the shard (effectively the Document object only knows about the shard it is in since get_docid() reports the docid in the shard, so this is reasonably consistent). If you really want the termfreq from the full database then you can keep a reference to it and call db.get_termfreq(*term_iterator).

Documented in ed5be3aee549cd295474ebe497f2a8512bb82f5e.

Backported for 1.4.25 in 5f8be092464b5649225dfc3a41850ef2ee334f39.

I've also adjusted testcases adddoc2 and adddoc5 to run for sharded databases, but to skip the TermIterator::get_termfreq() checks in 66a48b1ee0122cc15e73d71ded2fff02de5edb6e, backported in f9c8141a3074f215ee912bd17e13c8471ae100d6.

Note: See TracTickets for help on using tickets.