Ticket #207 (new defect)
Add ability to accelerate wildcard queries for short terms
| Reported by: | richard | Owned by: | richard |
|---|---|---|---|
| Priority: | normal | Milestone: | |
| Component: | QueryParser | Version: | SVN trunk |
| Severity: | normal | Keywords: | |
| Cc: | Blocked By: | ||
| Operating System: | All | Blocking: |
Description (last modified by olly) (diff)
When doing a wildcard query (or a partial term query), it may be desirable to precompute the lists of documents for short query terms to avoid very slow searches. One strategy I've experimented with is indexing the first 1, 2, and 3 characters of each term, marked by an I prefix, to so that 1, 2 or 3 letter searches only need to access a single term.
For example, "words" would be indexed as "Iw", "Iwo", "Iwor" and "words".
The expansion would be done on unstemmed terms - if you try and apply it to stemmed words, all sorts of confusion occurs if the stem has a different first 3 characters than the unstemmed form. Wildcards are currently handled by looking for unstemmed forms anyway, so I don't think this is a problem.
Obviously, it might be sensible to use a different maximum prefix length than 3.
Also, it may not be desirable to store all the prefixes: for example, if only
the 3 letter prefixes were stored (rather than the 2 and 1 letter prefixes being stored as well) a search for "w*" could still be implemented more efficiently than before using the conjunction of all the 3 letter prefixes terms which begin with "Iw". However, there could still be a large number of these.
To implement this, support needs to be added to the Term::as_partial_query and Term::as_wildcard_query methods in queryparser/queryparser.lemony. This doesn't necessarily need a query parser flag, since if the terms aren't present, the old behaviour can be used. However, it might be desirable to have a flag to turn the behaviour on to avoid imposing an overhead on wildcard searches in databases without the acceleration terms. Also, support for generating the terms needs to be added to the TermGenerator? - this should be very easy, but will require a new configuration option.
