Opened 17 years ago

Closed 17 years ago

Last modified 16 years ago

#125 closed defect (released)

Python posting iterators should access information lazily

Reported by: Richard Boulton Owned by: Richard Boulton
Priority: normal Milestone: 1.0.0
Component: Xapian-bindings Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: #118 Operating System: All

Description (last modified by Richard Boulton)

Currently the Python implementation of PostingIter returns a list for each item, containing the docid, length, wdf, and a position iterator. The position iterator in particular is expensive to generate for each item, so it would be better to move to a lazy implementation which returns a Posting object, with accessor methods to get at each of the pieces of information. This would allow any laziness in the Xapian API implementation to benefit Python applications.

I'll take a look at this shortly, it shouldn't be hard to fix, but implementing a backwards compatible interface would be a pain, so I'll mark it as blocking 1.0. It would also be good to implement similar laziness for all the python iterator implementations - in particular, MSetIterators (which currently always call get_document() whether the document contents are wanted or not) and TermIterators, which have a similar issue with position lists.

Change History (9)

comment:1 by Richard Boulton, 17 years ago

Blocking: 118 added

comment:2 by Olly Betts, 17 years ago

rep_platform: PCAll
Status: newassigned

I totally agree about the laziness - the C++ API is generally careful to be lazy in these cases so it's a shame to blow it in the wrappers.

I'm also happy for us to require minor updates to existing code if it results in a substantially better API going forwards, especially if we can sort this out for 1.0.

Is there a need for a Posting object? It would probably be simpler to just make these methods of the PostingIter (like we do in C++!)

comment:3 by Richard Boulton, 17 years ago

The usual python idiom would be to say:

for obj in db.postlist(tname):

do_stuff_on(obj)

If the methods are on the iterator, and we just return the docid for each posting from next(), this has to change to:

it = db.postlist(tname): for docid in it:

do_stuff_on(obj, it)

which is much less tidy, and python programmers are likely to moan.

Yes, returning a posting object each time means that an extra object creation happens each time a posting is returned - but this is Python and objects are being created and destroyed all the time - tidy programming is more important here, I think.

(We already have to use the second form if we want to be able to call skip_to() on the iterator, but that's much more reasonable conceptually, and most use of the postlist iterators from Python probably won't require skip_to(). Actually, I note that skip_to() isn't implemented for the pythonic iterators; I'll fix that, but the fact that no-one has asked for it implies that it's not terribly needed.)

comment:4 by Richard Boulton, 17 years ago

I've made a start on tidying up the python iterators: I started with the mset iterator, since it's at the top of the file. The changes are backwards compatible, thanks to a _SequenceMixIn class.

I'll get on to the other iterators later this week (they should take much less time now that I've got the infrastructure in place).

comment:5 by Richard Boulton, 17 years ago

Update on status of this bug: About 60% complete, I reckon, but a couple of hours hacking should finish it off.

MSet, ESet, and Termlist iterators are implemented. Posting, Position, and Value iterators are not.

The only tricky outstanding issue is for posting and termlist iterators for which there is no efficient way to access ancillary information (such as, with the posting lists, the document length) once the underlying information has moved on. But I think I have a good solution for that.

comment:6 by Richard Boulton, 17 years ago

Now completed for all but Posting iterators.

comment:7 by Richard Boulton, 17 years ago

Resolution: fixed
Status: assignedclosed

Now implemented and tested for all python iterators, so marking this bug as fixed.

comment:8 by Olly Betts, 17 years ago

Operating System: All
Resolution: fixedreleased

Fixed in 1.0.0 release.

comment:10 by Richard Boulton, 16 years ago

Description: modified (diff)
Milestone: 1.0.0
Note: See TracTickets for help on using tickets.