Opened 19 years ago
Last modified 2 years ago
#113 assigned enhancement
QueryParser should allow specifying word characters, phrase generators, etc
| Reported by: | Federico Schwindt | Owned by: | Olly Betts | 
|---|---|---|---|
| Priority: | high | Milestone: | 2.0.0 | 
| Component: | QueryParser | Version: | git master | 
| Severity: | minor | Keywords: | |
| Cc: | Richard Boulton | Blocked By: | |
| Blocking: | Operating System: | All | 
Description (last modified by )
Hi,
I've been using xapian (0.9.9 and now 0.9.10) recently at work and I've found that the exquisite QueryParser (no irony intended) imposes some serious limitations for certain queries, as it does treat some characters specially, even when flags does not contain FLAG_PHRASE.
I'm talking about the method is_phrase_generator(). In the organization I work for we have a mixed setup of html documents and code. This includes several references to text in the word_word format. Unfortunately the QueryParser treats underscore as phrase generator, making impossible to search for terms indexed using whitespace separators, even when allterms() shows the term exists on the database.
I believe this is an inconsistency and also a limitation in the QueryParser, as it does not matter what flags are used, in such cases where the query string contains any of the characters defined in is_phrase_generator(), the query will be automatically converted to a phrase search (note that these characters can't be changed).
In an ideal world (mine at least), I'd expect the user to define a phrase (using " or any other previously defined character) and if this is not the case the QueryParser should not try to convert the query to anything else (except for the defined operations, OR, AND, etc).
ITOH, I could change the indexing to strip the underscores (and the other characters) and treat every part of the word_word as a separate term, but that would also mean that "word word" would match as well, when it's not what you wanted.
I hope you have this into consideration. Feel free to contact me if you need further details or I can clarify anything else.
Many thanks,
f.-
Change History (18)
comment:1 by , 19 years ago
| Status: | new → assigned | 
|---|
comment:2 by , 19 years ago
| Blocking: | 118 added | 
|---|
comment:3 by , 19 years ago
| Cc: | added | 
|---|
comment:4 by , 18 years ago
| rep_platform: | PC → All | 
|---|---|
| Version: | 0.9.9 → SVN HEAD | 
I think the best approach here is to allow the list of phrase-generators to be specified, probably via a predicate function ("is this a phrase-generator?")
I don't think we should scrap the concept as it has its uses.  For example,
people aren't consistent about hyphenating terms, so a search for
phrase-generator' should probably act similarly to one for "phrase
generator"'.  If you experiment with Google, you'll see they generally do this.
The real problem is that in 0.9.x and earlier we overuse this mechanism for things which ought not be handled as phrase searches.
comment:5 by , 18 years ago
In SVN HEAD, "_" is now treated as any other word character and an apostrophe with word characters on either side is included in a term.
The predicate functions should still be specifiable though. I'm tracking term generation/query parsing todo items here:
comment:6 by , 18 years ago
| Blocking: | 120 added; 118 removed | 
|---|---|
| Operating System: | → All | 
I've decided to defer this to 1.0.X. The updated QueryParser and new TermGenerator functions uses rather a patchwork set of predicate and conversion functions internally, and I'd like to take a bit of time to come up with a coherent, logical user API for them rather than rushing to add one which we'll later regret.
I believe we can safely add methods for this without breaking ABI backward compatibility.
comment:8 by , 18 years ago
| Description: | modified (diff) | 
|---|---|
| Milestone: | → 1.1 | 
comment:9 by , 18 years ago
| Blocking: | 120 removed | 
|---|
(In #120) Remove the unfixed dependencies so we can close this bug - they're all marked for the 1.1.0 milestone.
comment:10 by , 17 years ago
| Description: | modified (diff) | 
|---|---|
| Milestone: | 1.1.0 → 1.1.1 | 
Bumping to milestone:1.1.1
(and fix description wiki formatting)
comment:11 by , 17 years ago
This (untested) patch may help anyone wanting to add a character to those considered as part of a word and happy to patch the library source code:
http://oligarchy.co.uk/xapian/patches/make-hyphen-a-word-character-untested.patch
comment:13 by , 16 years ago
| Priority: | normal → high | 
|---|
comment:15 by , 16 years ago
| Summary: | QueryParser limitation/inconsistency → QueryParser should allow specifying word characters, phrase generators, etc | 
|---|
comment:16 by , 14 years ago
| Milestone: | 1.3.0 → 1.3.x | 
|---|
comment:17 by , 11 years ago
| Milestone: | 1.3.x → 1.3.4 | 
|---|
comment:18 by , 10 years ago
| Milestone: | 1.3.4 → 1.4.x | 
|---|
I really want to get 1.4.0 out, so regrettably bumping this.
comment:19 by , 2 years ago
| Milestone: | 1.4.x → 2.0.0 | 
|---|---|
| Version: | SVN trunk → git master | 
The original motivating case here was addressed long ago and underscore is now treated as a word character, but more control over how a word is defined would be useful in some situations.  Currently you can achieve that but you have to roll your own code to do the jobs of TermGenerator and QueryParser which is not helpful.
This change doesn't seem appropriate for 1.4.x at this point, so adjusting milestone.


I was already hoping to sort this out for 1.0.
The history of this is that the QueryParser class was originally part of Omega, and made various assumptions about how text was indexed. I've fixed a number of these, but there are more to go.
bug#22 is somewhat related.