Opened 17 years ago

Closed 5 years ago

#207 closed enhancement (incomplete)

Add ability to accelerate wildcard queries for short terms

Reported by: Richard Boulton Owned by: Richard Boulton
Priority: normal Milestone:
Component: QueryParser Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

When doing a wildcard query (or a partial term query), it may be desirable to precompute the lists of documents for short query terms to avoid very slow searches. One strategy I've experimented with is indexing the first 1, 2, and 3 characters of each term, marked by an I prefix, to so that 1, 2 or 3 letter searches only need to access a single term.

For example, "words" would be indexed as "Iw", "Iwo", "Iwor" and "words".

The expansion would be done on unstemmed terms - if you try and apply it to stemmed words, all sorts of confusion occurs if the stem has a different first 3 characters than the unstemmed form. Wildcards are currently handled by looking for unstemmed forms anyway, so I don't think this is a problem.

Obviously, it might be sensible to use a different maximum prefix length than 3. Also, it may not be desirable to store all the prefixes: for example, if only the 3 letter prefixes were stored (rather than the 2 and 1 letter prefixes being stored as well) a search for "w*" could still be implemented more efficiently than before using the conjunction of all the 3 letter prefixes terms which begin with "Iw". However, there could still be a large number of these.

To implement this, support needs to be added to the Term::as_partial_query and Term::as_wildcard_query methods in queryparser/queryparser.lemony. This doesn't necessarily need a query parser flag, since if the terms aren't present, the old behaviour can be used. However, it might be desirable to have a flag to turn the behaviour on to avoid imposing an overhead on wildcard searches in databases without the acceleration terms. Also, support for generating the terms needs to be added to the TermGenerator - this should be very easy, but will require a new configuration option.

Change History (5)

comment:1 by Richard Boulton, 17 years ago

Operating System: All
Status: newassigned

comment:3 by Olly Betts, 17 years ago

Description: modified (diff)
Owner: changed from New Bugs to Richard Boulton
Status: assignednew

comment:4 by Olly Betts, 14 years ago

Description: modified (diff)

comment:5 by Olly Betts, 10 years ago

Type: defectenhancement

Now we have OP_WILDCARD, this would need to happen when that gets turned into a tree of PostList objects.

However, OP_WILDCARD plus an earlier change to reuse the btree cursor when turning the query into postlists mean that wildcards are significantly faster now, so possibly this idea is no longer worthwhile - it would presumably add a significant amount to the database size, and that has caching implications.

Also, the patch in #608 provides a way to greatly reduce the number of terms involved in for short partial queries while still offering useful expansions.

comment:6 by Olly Betts, 5 years ago

Resolution: incomplete
Status: newclosed

I'm going to close this now - wildcard speed is hugely improved by a number of changes since this ticket was opened.

I'm certainly not ruling out the idea entirely, but somebody needs to demonstrate it is actually an approach which helps with current Xapian.

Note: See TracTickets for help on using tickets.