Opened 15 years ago
Closed 12 years ago
#387 closed enhancement (fixed)
Optimisation for a filter term matching all documents
Reported by: | Olly Betts | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.3.1 |
Component: | Matcher | Version: | |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
If a term matches all documents (i.e. termfreq == doccount) and we don't want weight information for it (e.g. we're using OP_FILTER
or BoolWeight
) then we don't actually need to look at its postings at all - we can just treat it as QueryMatchAll
.
This might not seem very useful, but if it worked at the submatch level, it would allow database selection from within the query string using "database:1" to filter by XDB1, or something like that.
This would also naturally extend to optimising a term which matches exactly a contiguous range of documents if we start storing the highest and lowest document id a term indexes, which we plan to do at some point, since if (termfreq = highest_docid - lowest_docid + 1) then we know exactly the documents the term indexes without looking at the postings.
Change History (3)
comment:1 by , 15 years ago
comment:2 by , 15 years ago
I don't think implementing this optimisation is encouraging that, but it makes it work faster if you do it!
Your approach can actually be implemented with the current API by adding only db1 to db, but neither approach allows the user to write database:1 in their query string and QueryParser to handle that. But that isn't really what I intended this ticket to be about - I was just attempting to provide some background to the optimisation...
comment:3 by , 12 years ago
Milestone: | → 1.3.1 |
---|---|
Resolution: | → fixed |
Status: | new → closed |
Implemented in trunk r16901.
The check is done at the subdatabase level.
Do we really want to encourage users to store terms in their database specially for this purpose that match all the documents? Wouldn't a better approach for the situation you suggest be to have a way to generate a query from a database which matches all documents in that database. ie, in python, something like:
This wouldn't be accessible from the query string without adding explicit support for the queryparser, but seems a nicer solution than encouraging storing of lots of data which isn't ever actually going to be read. Even if it's well compressible data, adding it is going to add an overhead at indexing time.
That said, this optimisation still seems worthwhile, and should be easy and non-invasive to do.