Ticket #207 (new defect)

Opened 13 months ago

Last modified 7 months ago

Add ability to accelerate wildcard queries for short terms

Reported by: richard Owned by: richard
Priority: normal Milestone:
Component: QueryParser Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Operating System: All Blocking:

Description (last modified by olly) (diff)

When doing a wildcard query (or a partial term query), it may be desirable to precompute the lists of documents for short query terms to avoid very slow searches. One strategy I've experimented with is indexing the first 1, 2, and 3 characters of each term, marked by an I prefix, to so that 1, 2 or 3 letter searches only need to access a single term.

For example, "words" would be indexed as "Iw", "Iwo", "Iwor" and "words".

The expansion would be done on unstemmed terms - if you try and apply it to stemmed words, all sorts of confusion occurs if the stem has a different first 3 characters than the unstemmed form. Wildcards are currently handled by looking for unstemmed forms anyway, so I don't think this is a problem.

Obviously, it might be sensible to use a different maximum prefix length than 3.

Also, it may not be desirable to store all the prefixes: for example, if only

the 3 letter prefixes were stored (rather than the 2 and 1 letter prefixes being stored as well) a search for "w*" could still be implemented more efficiently than before using the conjunction of all the 3 letter prefixes terms which begin with "Iw". However, there could still be a large number of these.

To implement this, support needs to be added to the Term::as_partial_query and Term::as_wildcard_query methods in queryparser/queryparser.lemony. This doesn't necessarily need a query parser flag, since if the terms aren't present, the old behaviour can be used. However, it might be desirable to have a flag to turn the behaviour on to avoid imposing an overhead on wildcard searches in databases without the acceleration terms. Also, support for generating the terms needs to be added to the TermGenerator? - this should be very easy, but will require a new configuration option.

Change History

Changed 13 months ago by richard

  • status changed from new to assigned

Changed 13 months ago by trac

  • platform set to All

Changed 7 months ago by olly

  • owner changed from newbugs to richard
  • status changed from assigned to new
  • description modified (diff)
Note: See TracTickets for help on using tickets.