Opened 15 years ago
Closed 12 months ago
#423 closed defect (fixed)
Document termlist get_termfreq() method behaviour depends on whether terms are cached
Reported by: | Richard Boulton | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.5.0 |
Component: | Library API | Version: | SVN trunk |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
The TermIterator objects returned by Document.termlist_begin(), for a document obtained from a database, can sometimes be used to obtain the term frequency, and sometimes can't be. It's unpredictable which behaviour is obtained, unless you know the details of the implementation of caching of terms in Documents.
For example, the following python code uses this to obtain the frequency, and works fine:
import xapian db=xapian.WritableDatabase('foo', xapian.DB_CREATE_OR_OVERWRITE) doc=xapian.Document() doc.add_term('foo') db.add_document(doc) doc=db.get_document(1) t=doc.termlist() item=next(t) item.termfreq
However, the following code (with one added line) doesn't:
import xapian db=xapian.WritableDatabase('foo', xapian.DB_CREATE_OR_OVERWRITE) doc=xapian.Document() doc.add_term('foo') db.add_document(doc) doc=db.get_document(1) doc.termlist_count() # Added line t=doc.termlist() item=next(t) item.termfreq
For me, this code raises: "InvalidOperationError: Can't get term frequency from a document termlist which is not associated with a database."
This behaviour is because the termlist_count() method causes the terms to be loaded into the document, and Document then uses a MapTermList to return the term iterator.
Not sure of the easiest way to fix this - we could make MapTermList be able to keep a reference to a database, and pass off such requests to the database if set (or, better, subclass MapTermList for documents which are connected to a database).
Change History (9)
comment:1 by , 15 years ago
Description: | modified (diff) |
---|
comment:2 by , 14 years ago
Milestone: | → 1.2.x |
---|
comment:3 by , 12 years ago
Milestone: | 1.2.x → 1.3.x |
---|
comment:4 by , 10 years ago
This would actually be fairly easy to fix by changing Document::termlist_count()
to open the termlist and call get_approx_size()
on it (which is exact for a leaf termlist) to get this info from the termlist table, or by calling get_freqs()
on the database to get the term frequency from the postlist table. The advantage of the former is that it is more cache friendly if we then go on to read the termlist, while the latter works even if there's no termlist table.
Once you actually modify a document from a database, I think it actually makes most sense for a request for the termfreq to fail as the document is no longer one that's actually in the database (though at first encounter this might seem surprising).
However, just to add to the fun, the termfreq reported here is only for the current subdatabase, and to fix that I think we'd need to keep a reference to the whole Xapian::Database
rather than just the current subdb. Perhaps TermIterator::get_termfreq()
should just be deprecated - all it actually does is to effectively call Database::get_termfreq(*it)
.
comment:5 by , 9 years ago
Milestone: | 1.3.x → 1.4.x |
---|
An issue with deprecating it is that for an all terms iterator, it returns the termfreq efficiently (and probably usefully).
This isn't something I'd want to block 1.4.0 on, so bumping the milestone. If we have a fix it can go in still - the hard part of fixing it seems to really be to decide how best to do so...
comment:6 by , 6 years ago
In current master and 1.4 branch, apitest testcases adddoc2 and adddoc5 are disabled for sharded databases because they would fail due to the termfreq only being for the current shard (as noted in the final paragraph in comment:4).
follow-up: 8 comment:7 by , 20 months ago
Perhaps we should only support TermIterator::get_termfreq()
for an allterms iterator.
comment:8 by , 20 months ago
Description: | modified (diff) |
---|---|
Milestone: | 1.4.x → 1.5.0 |
The original issue reported here was actually fixed a while ago by [747ba3ef354a2b0180b18f0d878e1add7697ae5b] which reimplemented Document::Internal
and implemented termlist_count()
differently (essentially in the way I suggested above). That change wasn't backported so this will be fixed in 1.5.0.
There's also the related issue that the termfreq reported here is only for the current subdatabase.
Replying to Olly Betts:
Perhaps we should only support
TermIterator::get_termfreq()
for an allterms iterator.
I had a look at what that would involve.
It needs extra work to make this fail in the document case without also making it fail for an iterator from db.termlist_begin(docid)
.
It requires that REPLY_TERMLIST
stop eagerly calling get_termfreq()
for each entry, which is a speed-up if this information isn't wanted, but probably a significant slow-down if it is. However this change would break get_eset()
with a remote shard. Ideally that would do more work remotely, but I'm not sure how feasible that actually is when there are both local and remote shards. It probably would need get_eset()
to work more like the matcher, which would not be a bad thing but is a significant amount of development work.
Overall I think it's best to resolve this part by documenting that the TermIterator
here only knows about the shard (effectively the Document
object only knows about the shard it is in since get_docid()
reports the docid in the shard, so this is reasonably consistent). If you really want the termfreq from the full database then you can keep a reference to it and call db.get_termfreq(*term_iterator)
.
comment:9 by , 12 months ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Overall I think it's best to resolve this part by documenting that the TermIterator here only knows about the shard (effectively the Document object only knows about the shard it is in since get_docid() reports the docid in the shard, so this is reasonably consistent). If you really want the termfreq from the full database then you can keep a reference to it and call db.get_termfreq(*term_iterator).
Documented in ed5be3aee549cd295474ebe497f2a8512bb82f5e.
Backported for 1.4.25 in 5f8be092464b5649225dfc3a41850ef2ee334f39.
I've also adjusted testcases adddoc2 and adddoc5 to run for sharded databases, but to skip the TermIterator::get_termfreq()
checks in 66a48b1ee0122cc15e73d71ded2fff02de5edb6e, backported in f9c8141a3074f215ee912bd17e13c8471ae100d6.
Hmm, unpleasant, but also fairly obscure. Bumping to 1.3.x, though it would be good to improve in 1.2 too (if only with a comment in the API docs or a tweak to make that exception UnimplementedError in this case, or at least give it a more helpful exception message).