Opened 17 years ago

Last modified 5 months ago

#113 assigned enhancement

QueryParser should allow specifying word characters, phrase generators, etc

Reported by: Federico Schwindt Owned by: Olly Betts
Priority: high Milestone: 2.0.0
Component: QueryParser Version: git master
Severity: minor Keywords:
Cc: Richard Boulton Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

Hi,

I've been using xapian (0.9.9 and now 0.9.10) recently at work and I've found that the exquisite QueryParser (no irony intended) imposes some serious limitations for certain queries, as it does treat some characters specially, even when flags does not contain FLAG_PHRASE.

I'm talking about the method is_phrase_generator(). In the organization I work for we have a mixed setup of html documents and code. This includes several references to text in the word_word format. Unfortunately the QueryParser treats underscore as phrase generator, making impossible to search for terms indexed using whitespace separators, even when allterms() shows the term exists on the database.

I believe this is an inconsistency and also a limitation in the QueryParser, as it does not matter what flags are used, in such cases where the query string contains any of the characters defined in is_phrase_generator(), the query will be automatically converted to a phrase search (note that these characters can't be changed).

In an ideal world (mine at least), I'd expect the user to define a phrase (using " or any other previously defined character) and if this is not the case the QueryParser should not try to convert the query to anything else (except for the defined operations, OR, AND, etc).

ITOH, I could change the indexing to strip the underscores (and the other characters) and treat every part of the word_word as a separate term, but that would also mean that "word word" would match as well, when it's not what you wanted.

I hope you have this into consideration. Feel free to contact me if you need further details or I can clarify anything else.

Many thanks,

f.-

Change History (18)

comment:1 by Olly Betts, 17 years ago

Status: newassigned

I was already hoping to sort this out for 1.0.

The history of this is that the QueryParser class was originally part of Omega, and made various assumptions about how text was indexed. I've fixed a number of these, but there are more to go.

bug#22 is somewhat related.

comment:2 by Richard Boulton, 17 years ago

Blocking: 118 added

comment:3 by Richard Boulton, 17 years ago

Cc: richard@… added

comment:4 by Olly Betts, 17 years ago

rep_platform: PCAll
Version: 0.9.9SVN HEAD

I think the best approach here is to allow the list of phrase-generators to be specified, probably via a predicate function ("is this a phrase-generator?")

I don't think we should scrap the concept as it has its uses. For example, people aren't consistent about hyphenating terms, so a search for phrase-generator' should probably act similarly to one for "phrase generator"'. If you experiment with Google, you'll see they generally do this.

The real problem is that in 0.9.x and earlier we overuse this mechanism for things which ought not be handled as phrase searches.

comment:5 by Olly Betts, 17 years ago

In SVN HEAD, "_" is now treated as any other word character and an apostrophe with word characters on either side is included in a term.

The predicate functions should still be specifiable though. I'm tracking term generation/query parsing todo items here:

http://wiki.xapian.org/BraveNewTerms

comment:6 by Olly Betts, 17 years ago

Blocking: 120 added; 118 removed
Operating System: All

I've decided to defer this to 1.0.X. The updated QueryParser and new TermGenerator functions uses rather a patchwork set of predicate and conversion functions internally, and I'd like to take a bit of time to come up with a coherent, logical user API for them rather than rushing to add one which we'll later regret.

I believe we can safely add methods for this without breaking ABI backward compatibility.

comment:8 by Richard Boulton, 16 years ago

Description: modified (diff)
Milestone: 1.1

comment:9 by Richard Boulton, 16 years ago

Blocking: 120 removed

(In #120) Remove the unfixed dependencies so we can close this bug - they're all marked for the 1.1.0 milestone.

comment:10 by Olly Betts, 15 years ago

Description: modified (diff)
Milestone: 1.1.01.1.1

Bumping to milestone:1.1.1

(and fix description wiki formatting)

comment:11 by Olly Betts, 15 years ago

This (untested) patch may help anyone wanting to add a character to those considered as part of a word and happy to patch the library source code:

http://oligarchy.co.uk/xapian/patches/make-hyphen-a-word-character-untested.patch

comment:12 by Olly Betts, 15 years ago

Milestone: 1.1.11.1.4

Triaging milestone:1.1.1 bugs.

comment:13 by Olly Betts, 15 years ago

Priority: normalhigh

comment:14 by Olly Betts, 15 years ago

Milestone: 1.1.41.3.0

Bumping to stay on track for release.

comment:15 by Olly Betts, 14 years ago

Summary: QueryParser limitation/inconsistencyQueryParser should allow specifying word characters, phrase generators, etc

comment:16 by Olly Betts, 12 years ago

Milestone: 1.3.01.3.x

comment:17 by Olly Betts, 9 years ago

Milestone: 1.3.x1.3.4

comment:18 by Olly Betts, 8 years ago

Milestone: 1.3.41.4.x

I really want to get 1.4.0 out, so regrettably bumping this.

comment:19 by Olly Betts, 5 months ago

Milestone: 1.4.x2.0.0
Version: SVN trunkgit master

The original motivating case here was addressed long ago and underscore is now treated as a word character, but more control over how a word is defined would be useful in some situations. Currently you can achieve that but you have to roll your own code to do the jobs of TermGenerator and QueryParser which is not helpful.

This change doesn't seem appropriate for 1.4.x at this point, so adjusting milestone.

Note: See TracTickets for help on using tickets.