Opened 14 years ago

Closed 12 years ago

Last modified 12 years ago

#47 closed enhancement (released)

AllDocumentIterator

Reported by: olly Owned by: richard
Priority: high Milestone:
Component: Backend-Flint Version: other
Severity: minor Keywords:
Cc: Blocked By:
Blocking: #99 Operating System: other

Description

Provide fake term (empty termname) which indexes all documents, thus providing a clean way to iterate through them. This would be used for a real "NOT" operator.

Attachments (3)

alldocsiterator.patch (8.2 KB) - added by olly 12 years ago.
Prototype patch against old version of Xapian
alldocpostlists.2.patch (50.2 KB) - added by richard 12 years ago.
Complete implementation of all document postlists
alldocpostlists.patch (13.3 KB) - added by richard 12 years ago.
Updated patch for inmemory case, and tests

Download all attachments as: .zip

Change History (19)

comment:1 Changed 14 years ago by olly

  • Severity changed from blocker to normal
  • Status changed from new to assigned

comment:2 Changed 13 years ago by olly

This is the prototype patch I created in November 2002 (just to stop it getting lost - it won't apply cleanly now).

comment:3 Changed 12 years ago by olly

  • Component changed from other to Backend-Flint
  • Severity changed from normal to enhancement

comment:4 Changed 12 years ago by richard

  • Cc richard@… added

Just to note - I'm going to be looking at this for a customer in the next couple of weeks. Added myself to the CC list; Olly should feel free to assign the bug to me if he likes.

comment:5 Changed 12 years ago by olly

  • Cc olly@… added; richard@… removed
  • Owner changed from olly to richard
  • Status changed from assigned to new

comment:6 Changed 12 years ago by richard

I've brought the patch up to date for the inmemory case - will look at the quartz and flint cases shortly.

I was wondering, though, whether the idea that the empty term is a term which matches all documents should be implemented for other methods. For example, should Xapian::Database::term_exists("") always return true (except for a database which contains no documents). And, should get_termfreq("") and get_collection_freq("") return the size of the database? It seems to me that they should, for consistency of the metaphor, if for no other reason.

Changed 12 years ago by olly

Prototype patch against old version of Xapian

comment:7 Changed 12 years ago by richard

  • attachments.isobsolete changed from 0 to 1

comment:8 Changed 12 years ago by olly

  • Cc olly@… removed

Regarding whether an empty termname should "work" everywhere, I think it probably might as well provided it doesn't require real effort to support (which I don't think it will). I'd be dubious about (for example) add a new member variable to a class just to support this.

comment:9 Changed 12 years ago by richard

Having difficulty implementing this for quartz and flint. I tried using the values table to get the list of document IDs, but unfortunately the values are not stored in docid order (they're keyed by the docid packed using pack_uint_last()). The record and termlist tables also use the same ordering.

I can think of only two possible approaches, without changing the on-disk storage:

  1. Write complicated code to seek around in the table to get the list out in

ascending order.

  1. Read the whole list of document IDs into memory, and then sort it. (Could

use range-compression in memory to avoid using too much memory in the common case.)

Alternatively, we could make a whole new table to store the list of active document IDs, or change the order of the information held in one of these tables to be keyed by something which sorts in document ID order, but that seems a rather invasive change for a small feature.

I'm inclining towards reading the whole thing into memory at present, since seeking randomly through the file sounds likely to be disasterous, performance wise. The whole memory thing could be a lot better, as long as the document IDs currently existing are in reasonably contiguous blocks.

Any better ideas?

comment:10 Changed 12 years ago by olly

The ordering is as you describe for quartz, but flint has the entries in docid order in termlist, record, and value tables (this is one of the differences).

I wouldn't worry too much about supporting this for quartz - I think we'll make the current flint the default backend in 1.0 so use of quartz will decline sharply soon.

Incidentally, the value table isn't a good choice, since if there are no values, it'll be empty! Termlist is probably the best choice.

comment:11 Changed 12 years ago by olly

Another way to iterate all documents in order, which would work for quartz, is to simply start at n=1 and have something like:

while (n <= get_lastdocid() && !document_exists(n)) ++n;

This is what examples/copydatabase.cc does. It's best in the (common) case where there aren't many gaps, and doesn't suffer from the working set size issues which trying to suck all the document ids into memory does for a large database.

Changed 12 years ago by richard

Updated patch for inmemory case, and tests

comment:12 Changed 12 years ago by richard

  • attachments.isobsolete changed from 0 to 1

comment:13 Changed 12 years ago by olly

Seems a reasonable approach, as range compressing should give most of the benefits for a typical non-sparse database (except for bad cases like every other document being there!) but is likely to be better for sparse cases.

It's less good for partial iteration, which might make a difference when the matcher can terminate early.

But I don't think at this point quartz performance is an issue - as I said in an earlier comment, I wouldn't object if this wasn't implemented at all for quartz!

My plan is to try to avoid commits which are disruptive for a few days in case we need a rapid 0.9.8 - there are a lot of changes between 0.9.6 and 0.9.7, and based on past experience we may need a "mopping up" release and it's easier not to have to apply changes to multiple branches. Then we can merge in the utf8 branch and commit this patch and anything else that's queued up.

comment:14 Changed 12 years ago by richard

  • Blocking 99 added

comment:15 Changed 12 years ago by richard

  • Status changed from new to assigned

It's been a few days since the 0.9.7 release now, and there don't seem to have been any major problems. Time to commit this patch?

comment:16 Changed 12 years ago by olly

  • Resolution set to fixed
  • Status changed from assigned to closed

OK, I've just applied this.

I rearranged the tests a little - I'm hoping to implement the missing methods for the remote backend fairly soon, so I tweaked the new tests to be ready for that - they currently skip if the first call to postlist_begin() throws UnimplementedError? (like several existing tests do). This way I won't need to rearrange the testsuite when I implement the missing methods - the tests will just start passing instead of skipping and the UnimplementedError? checks can then be retired.

Also get_database() just duplicates what get_database("") already does so I cut that in the interests of avoiding proliferation of different ways to do that same thing.

comment:17 Changed 12 years ago by olly

  • Operating System set to other
  • Resolution changed from fixed to released

Fixed in 1.0.0 release.

Note: See TracTickets for help on using tickets.