#125 closed defect (released)
Python posting iterators should access information lazily
Reported by: | Richard Boulton | Owned by: | Richard Boulton |
---|---|---|---|
Priority: | normal | Milestone: | 1.0.0 |
Component: | Xapian-bindings | Version: | SVN trunk |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | #118 | Operating System: | All |
Description (last modified by )
Currently the Python implementation of PostingIter returns a list for each item, containing the docid, length, wdf, and a position iterator. The position iterator in particular is expensive to generate for each item, so it would be better to move to a lazy implementation which returns a Posting object, with accessor methods to get at each of the pieces of information. This would allow any laziness in the Xapian API implementation to benefit Python applications.
I'll take a look at this shortly, it shouldn't be hard to fix, but implementing a backwards compatible interface would be a pain, so I'll mark it as blocking 1.0. It would also be good to implement similar laziness for all the python iterator implementations - in particular, MSetIterators (which currently always call get_document() whether the document contents are wanted or not) and TermIterators, which have a similar issue with position lists.
Change History (9)
comment:1 by , 18 years ago
Blocking: | 118 added |
---|
comment:2 by , 18 years ago
rep_platform: | PC → All |
---|---|
Status: | new → assigned |
comment:3 by , 18 years ago
The usual python idiom would be to say:
for obj in db.postlist(tname):
do_stuff_on(obj)
If the methods are on the iterator, and we just return the docid for each posting from next(), this has to change to:
it = db.postlist(tname): for docid in it:
do_stuff_on(obj, it)
which is much less tidy, and python programmers are likely to moan.
Yes, returning a posting object each time means that an extra object creation happens each time a posting is returned - but this is Python and objects are being created and destroyed all the time - tidy programming is more important here, I think.
(We already have to use the second form if we want to be able to call skip_to() on the iterator, but that's much more reasonable conceptually, and most use of the postlist iterators from Python probably won't require skip_to(). Actually, I note that skip_to() isn't implemented for the pythonic iterators; I'll fix that, but the fact that no-one has asked for it implies that it's not terribly needed.)
comment:4 by , 18 years ago
I've made a start on tidying up the python iterators: I started with the mset iterator, since it's at the top of the file. The changes are backwards compatible, thanks to a _SequenceMixIn class.
I'll get on to the other iterators later this week (they should take much less time now that I've got the infrastructure in place).
comment:5 by , 18 years ago
Update on status of this bug: About 60% complete, I reckon, but a couple of hours hacking should finish it off.
MSet, ESet, and Termlist iterators are implemented. Posting, Position, and Value iterators are not.
The only tricky outstanding issue is for posting and termlist iterators for which there is no efficient way to access ancillary information (such as, with the posting lists, the document length) once the underlying information has moved on. But I think I have a good solution for that.
comment:7 by , 18 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Now implemented and tested for all python iterators, so marking this bug as fixed.
comment:8 by , 18 years ago
Operating System: | → All |
---|---|
Resolution: | fixed → released |
Fixed in 1.0.0 release.
comment:10 by , 17 years ago
Description: | modified (diff) |
---|---|
Milestone: | → 1.0.0 |
I totally agree about the laziness - the C++ API is generally careful to be lazy in these cases so it's a shame to blow it in the wrappers.
I'm also happy for us to require minor updates to existing code if it results in a substantially better API going forwards, especially if we can sort this out for 1.0.
Is there a need for a Posting object? It would probably be simpler to just make these methods of the PostingIter (like we do in C++!)