Opened 16 years ago

Closed 12 years ago

#387 closed enhancement (fixed)

Optimisation for a filter term matching all documents

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 1.3.1
Component: Matcher Version:
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

If a term matches all documents (i.e. termfreq == doccount) and we don't want weight information for it (e.g. we're using OP_FILTER or BoolWeight) then we don't actually need to look at its postings at all - we can just treat it as QueryMatchAll.

This might not seem very useful, but if it worked at the submatch level, it would allow database selection from within the query string using "database:1" to filter by XDB1, or something like that.

This would also naturally extend to optimising a term which matches exactly a contiguous range of documents if we start storing the highest and lowest document id a term indexes, which we plan to do at some point, since if (termfreq = highest_docid - lowest_docid + 1) then we know exactly the documents the term indexes without looking at the postings.

Change History (3)

comment:1 by Richard Boulton, 16 years ago

Do we really want to encourage users to store terms in their database specially for this purpose that match all the documents? Wouldn't a better approach for the situation you suggest be to have a way to generate a query from a database which matches all documents in that database. ie, in python, something like:

 import xapian
 db1 = xapian.Database('db1')
 db2 = xapian.Database('db2')
 db = Database()
 db.add_database(db1)
 db.add_database(db2)

 q = Query('foo')

 # To restrict to db1
 filter = Query(db1)
 q = Query(xapian.OP_FILTER, q, filter)

This wouldn't be accessible from the query string without adding explicit support for the queryparser, but seems a nicer solution than encouraging storing of lots of data which isn't ever actually going to be read. Even if it's well compressible data, adding it is going to add an overhead at indexing time.

That said, this optimisation still seems worthwhile, and should be easy and non-invasive to do.

comment:2 by Olly Betts, 16 years ago

I don't think implementing this optimisation is encouraging that, but it makes it work faster if you do it!

Your approach can actually be implemented with the current API by adding only db1 to db, but neither approach allows the user to write database:1 in their query string and QueryParser to handle that. But that isn't really what I intended this ticket to be about - I was just attempting to provide some background to the optimisation...

comment:3 by Olly Betts, 12 years ago

Milestone: 1.3.1
Resolution: fixed
Status: newclosed

Implemented in trunk r16901.

The check is done at the subdatabase level.

Note: See TracTickets for help on using tickets.