#47 closed enhancement (released)
AllDocumentIterator
Reported by: | Olly Betts | Owned by: | Richard Boulton |
---|---|---|---|
Priority: | high | Milestone: | |
Component: | Backend-Flint | Version: | other |
Severity: | minor | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | #99 | Operating System: | other |
Description
Provide fake term (empty termname) which indexes all documents, thus providing a clean way to iterate through them. This would be used for a real "NOT" operator.
Attachments (3)
Change History (19)
comment:1 by , 20 years ago
Severity: | blocker → normal |
---|---|
Status: | new → assigned |
comment:2 by , 20 years ago
comment:3 by , 18 years ago
Component: | other → Backend-Flint |
---|---|
Severity: | normal → enhancement |
comment:4 by , 18 years ago
Cc: | added |
---|
Just to note - I'm going to be looking at this for a customer in the next couple of weeks. Added myself to the CC list; Olly should feel free to assign the bug to me if he likes.
comment:5 by , 18 years ago
Cc: | added; removed |
---|---|
Owner: | changed from | to
Status: | assigned → new |
comment:6 by , 18 years ago
I've brought the patch up to date for the inmemory case - will look at the quartz and flint cases shortly.
I was wondering, though, whether the idea that the empty term is a term which matches all documents should be implemented for other methods. For example, should Xapian::Database::term_exists("") always return true (except for a database which contains no documents). And, should get_termfreq("") and get_collection_freq("") return the size of the database? It seems to me that they should, for consistency of the metaphor, if for no other reason.
by , 18 years ago
Attachment: | alldocsiterator.patch added |
---|
Prototype patch against old version of Xapian
comment:7 by , 18 years ago
attachments.isobsolete: | 0 → 1 |
---|
comment:8 by , 18 years ago
Cc: | removed |
---|
Regarding whether an empty termname should "work" everywhere, I think it probably might as well provided it doesn't require real effort to support (which I don't think it will). I'd be dubious about (for example) add a new member variable to a class just to support this.
comment:9 by , 18 years ago
Having difficulty implementing this for quartz and flint. I tried using the values table to get the list of document IDs, but unfortunately the values are not stored in docid order (they're keyed by the docid packed using pack_uint_last()). The record and termlist tables also use the same ordering.
I can think of only two possible approaches, without changing the on-disk storage:
- Write complicated code to seek around in the table to get the list out in
ascending order.
- Read the whole list of document IDs into memory, and then sort it. (Could
use range-compression in memory to avoid using too much memory in the common case.)
Alternatively, we could make a whole new table to store the list of active document IDs, or change the order of the information held in one of these tables to be keyed by something which sorts in document ID order, but that seems a rather invasive change for a small feature.
I'm inclining towards reading the whole thing into memory at present, since seeking randomly through the file sounds likely to be disasterous, performance wise. The whole memory thing could be a lot better, as long as the document IDs currently existing are in reasonably contiguous blocks.
Any better ideas?
comment:10 by , 18 years ago
The ordering is as you describe for quartz, but flint has the entries in docid order in termlist, record, and value tables (this is one of the differences).
I wouldn't worry too much about supporting this for quartz - I think we'll make the current flint the default backend in 1.0 so use of quartz will decline sharply soon.
Incidentally, the value table isn't a good choice, since if there are no values, it'll be empty! Termlist is probably the best choice.
comment:11 by , 18 years ago
Another way to iterate all documents in order, which would work for quartz, is to simply start at n=1 and have something like:
while (n <= get_lastdocid() && !document_exists(n)) ++n;
This is what examples/copydatabase.cc does. It's best in the (common) case where there aren't many gaps, and doesn't suffer from the working set size issues which trying to suck all the document ids into memory does for a large database.
by , 18 years ago
Attachment: | alldocpostlists.patch added |
---|
Updated patch for inmemory case, and tests
comment:12 by , 18 years ago
attachments.isobsolete: | 0 → 1 |
---|
comment:13 by , 18 years ago
Seems a reasonable approach, as range compressing should give most of the benefits for a typical non-sparse database (except for bad cases like every other document being there!) but is likely to be better for sparse cases.
It's less good for partial iteration, which might make a difference when the matcher can terminate early.
But I don't think at this point quartz performance is an issue - as I said in an earlier comment, I wouldn't object if this wasn't implemented at all for quartz!
My plan is to try to avoid commits which are disruptive for a few days in case we need a rapid 0.9.8 - there are a lot of changes between 0.9.6 and 0.9.7, and based on past experience we may need a "mopping up" release and it's easier not to have to apply changes to multiple branches. Then we can merge in the utf8 branch and commit this patch and anything else that's queued up.
comment:14 by , 18 years ago
Blocking: | 99 added |
---|
comment:15 by , 18 years ago
Status: | new → assigned |
---|
It's been a few days since the 0.9.7 release now, and there don't seem to have been any major problems. Time to commit this patch?
comment:16 by , 18 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
OK, I've just applied this.
I rearranged the tests a little - I'm hoping to implement the missing methods for the remote backend fairly soon, so I tweaked the new tests to be ready for that - they currently skip if the first call to postlist_begin() throws UnimplementedError (like several existing tests do). This way I won't need to rearrange the testsuite when I implement the missing methods - the tests will just start passing instead of skipping and the UnimplementedError checks can then be retired.
Also get_database() just duplicates what get_database("") already does so I cut that in the interests of avoiding proliferation of different ways to do that same thing.
comment:17 by , 18 years ago
Operating System: | → other |
---|---|
Resolution: | fixed → released |
Fixed in 1.0.0 release.
This is the prototype patch I created in November 2002 (just to stop it getting lost - it won't apply cleanly now).