Opened 20 years ago

Closed 18 years ago

Last modified 18 years ago

#47 closed enhancement (released)

AllDocumentIterator

Reported by: Olly Betts Owned by: Richard Boulton
Priority: high Milestone:
Component: Backend-Flint Version: other
Severity: minor Keywords:
Cc: Blocked By:
Blocking: #99 Operating System: other

Description

Provide fake term (empty termname) which indexes all documents, thus providing a clean way to iterate through them. This would be used for a real "NOT" operator.

Attachments (3)

alldocsiterator.patch (8.2 KB ) - added by Olly Betts 18 years ago.
Prototype patch against old version of Xapian
alldocpostlists.2.patch (50.2 KB ) - added by Richard Boulton 18 years ago.
Complete implementation of all document postlists
alldocpostlists.patch (13.3 KB ) - added by Richard Boulton 18 years ago.
Updated patch for inmemory case, and tests

Download all attachments as: .zip

Change History (19)

comment:1 by Olly Betts, 20 years ago

Severity: blockernormal
Status: newassigned

comment:2 by Olly Betts, 19 years ago

This is the prototype patch I created in November 2002 (just to stop it getting lost - it won't apply cleanly now).

comment:3 by Olly Betts, 18 years ago

Component: otherBackend-Flint
Severity: normalenhancement

comment:4 by Richard Boulton, 18 years ago

Cc: richard@… added

Just to note - I'm going to be looking at this for a customer in the next couple of weeks. Added myself to the CC list; Olly should feel free to assign the bug to me if he likes.

comment:5 by Olly Betts, 18 years ago

Cc: olly@… added; richard@… removed
Owner: changed from Olly Betts to Richard Boulton
Status: assignednew

comment:6 by Richard Boulton, 18 years ago

I've brought the patch up to date for the inmemory case - will look at the quartz and flint cases shortly.

I was wondering, though, whether the idea that the empty term is a term which matches all documents should be implemented for other methods. For example, should Xapian::Database::term_exists("") always return true (except for a database which contains no documents). And, should get_termfreq("") and get_collection_freq("") return the size of the database? It seems to me that they should, for consistency of the metaphor, if for no other reason.

by Olly Betts, 18 years ago

Attachment: alldocsiterator.patch added

Prototype patch against old version of Xapian

comment:7 by Richard Boulton, 18 years ago

attachments.isobsolete: 01

comment:8 by Olly Betts, 18 years ago

Cc: olly@… removed

Regarding whether an empty termname should "work" everywhere, I think it probably might as well provided it doesn't require real effort to support (which I don't think it will). I'd be dubious about (for example) add a new member variable to a class just to support this.

comment:9 by Richard Boulton, 18 years ago

Having difficulty implementing this for quartz and flint. I tried using the values table to get the list of document IDs, but unfortunately the values are not stored in docid order (they're keyed by the docid packed using pack_uint_last()). The record and termlist tables also use the same ordering.

I can think of only two possible approaches, without changing the on-disk storage:

  1. Write complicated code to seek around in the table to get the list out in

ascending order.

  1. Read the whole list of document IDs into memory, and then sort it. (Could

use range-compression in memory to avoid using too much memory in the common case.)

Alternatively, we could make a whole new table to store the list of active document IDs, or change the order of the information held in one of these tables to be keyed by something which sorts in document ID order, but that seems a rather invasive change for a small feature.

I'm inclining towards reading the whole thing into memory at present, since seeking randomly through the file sounds likely to be disasterous, performance wise. The whole memory thing could be a lot better, as long as the document IDs currently existing are in reasonably contiguous blocks.

Any better ideas?

comment:10 by Olly Betts, 18 years ago

The ordering is as you describe for quartz, but flint has the entries in docid order in termlist, record, and value tables (this is one of the differences).

I wouldn't worry too much about supporting this for quartz - I think we'll make the current flint the default backend in 1.0 so use of quartz will decline sharply soon.

Incidentally, the value table isn't a good choice, since if there are no values, it'll be empty! Termlist is probably the best choice.

comment:11 by Olly Betts, 18 years ago

Another way to iterate all documents in order, which would work for quartz, is to simply start at n=1 and have something like:

while (n <= get_lastdocid() && !document_exists(n)) ++n;

This is what examples/copydatabase.cc does. It's best in the (common) case where there aren't many gaps, and doesn't suffer from the working set size issues which trying to suck all the document ids into memory does for a large database.

by Richard Boulton, 18 years ago

Attachment: alldocpostlists.patch added

Updated patch for inmemory case, and tests

comment:12 by Richard Boulton, 18 years ago

attachments.isobsolete: 01

comment:13 by Olly Betts, 18 years ago

Seems a reasonable approach, as range compressing should give most of the benefits for a typical non-sparse database (except for bad cases like every other document being there!) but is likely to be better for sparse cases.

It's less good for partial iteration, which might make a difference when the matcher can terminate early.

But I don't think at this point quartz performance is an issue - as I said in an earlier comment, I wouldn't object if this wasn't implemented at all for quartz!

My plan is to try to avoid commits which are disruptive for a few days in case we need a rapid 0.9.8 - there are a lot of changes between 0.9.6 and 0.9.7, and based on past experience we may need a "mopping up" release and it's easier not to have to apply changes to multiple branches. Then we can merge in the utf8 branch and commit this patch and anything else that's queued up.

comment:14 by Richard Boulton, 18 years ago

Blocking: 99 added

comment:15 by Richard Boulton, 18 years ago

Status: newassigned

It's been a few days since the 0.9.7 release now, and there don't seem to have been any major problems. Time to commit this patch?

comment:16 by Olly Betts, 18 years ago

Resolution: fixed
Status: assignedclosed

OK, I've just applied this.

I rearranged the tests a little - I'm hoping to implement the missing methods for the remote backend fairly soon, so I tweaked the new tests to be ready for that - they currently skip if the first call to postlist_begin() throws UnimplementedError (like several existing tests do). This way I won't need to rearrange the testsuite when I implement the missing methods - the tests will just start passing instead of skipping and the UnimplementedError checks can then be retired.

Also get_database() just duplicates what get_database("") already does so I cut that in the interests of avoiding proliferation of different ways to do that same thing.

comment:17 by Olly Betts, 18 years ago

Operating System: other
Resolution: fixedreleased

Fixed in 1.0.0 release.

Note: See TracTickets for help on using tickets.