Ticket #113 (assigned enhancement)

Opened 21 months ago

Last modified 7 months ago

QueryParser limitation/inconsistency

Reported by: federico.schwindt Owned by: olly
Priority: normal Milestone: 1.1.0
Component: QueryParser Version: SVN trunk
Severity: minor Keywords:
Cc: richard Blocked By:
Operating System: All Blocking:

Description (last modified by richard) (diff)

Hi,

I've been using xapian (0.9.9 and now 0.9.10) recently at work and I've found

that the exquisite QueryParser? (no irony intended) imposes some serious limitations for certain queries, as it does treat some characters specially, even when flags does not contain FLAG_PHRASE.

I'm talking about the method is_phrase_generator(). In the organization I work

for we have a mixed setup of html documents and code. This includes several references to text in the word_word format. Unfortunately the QueryParser? treats underscore as phrase generator, making impossible to search for terms indexed using whitespace separators, even when allterms() shows the term exists on the database.

I believe this is an inconsistency and also a limitation in the QueryParser?,

as it does not matter what flags are used, in such cases where the query string contains any of the characters defined in is_phrase_generator(), the query will be automatically converted to a phrase search (note that these characters can't be changed).

In an ideal world (mine at least), I'd expect the user to define a phrase

(using " or any other previously defined character) and if this is not the case the QueryParser? should not try to convert the query to anything else (except for the defined operations, OR, AND, etc).

ITOH, I could change the indexing to strip the underscores (and the other

characters) and treat every part of the word_word as a separate term, but that would also mean that "word word" would match as well, when it's not what you wanted.

I hope you have this into consideration. Feel free to contact me if you need

further details or I can clarify anything else.

Many thanks,

f.-

Change History

Changed 20 months ago by olly

  • status changed from new to assigned

I was already hoping to sort this out for 1.0.

The history of this is that the QueryParser? class was originally part of Omega, and made various assumptions about how text was indexed. I've fixed a number of these, but there are more to go.

bug#22 is somewhat related.

Changed 20 months ago by richard

  • blocking set to 118

Changed 20 months ago by richard

  • cc richard@… added

Changed 19 months ago by olly

  • rep_platform changed from PC to All
  • version changed from 0.9.9 to SVN HEAD

I think the best approach here is to allow the list of phrase-generators to be specified, probably via a predicate function ("is this a phrase-generator?")

I don't think we should scrap the concept as it has its uses. For example, people aren't consistent about hyphenating terms, so a search for phrase-generator' should probably act similarly to one for "phrase generator"'. If you experiment with Google, you'll see they generally do this.

The real problem is that in 0.9.x and earlier we overuse this mechanism for things which ought not be handled as phrase searches.

Changed 19 months ago by olly

In SVN HEAD, "_" is now treated as any other word character and an apostrophe with word characters on either side is included in a term.

The predicate functions should still be specifiable though. I'm tracking term generation/query parsing todo items here:

http://wiki.xapian.org/BraveNewTerms

Changed 19 months ago by olly

  • blocking changed from 118 to 120

I've decided to defer this to 1.0.X. The updated QueryParser? and new TermGenerator? functions uses rather a patchwork set of predicate and conversion functions internally, and I'd like to take a bit of time to come up with a coherent, logical user API for them rather than rushing to add one which we'll later regret.

I believe we can safely add methods for this without breaking ABI backward compatibility.

Changed 19 months ago by trac

  • platform set to All

Changed 7 months ago by richard

  • description modified (diff)
  • milestone set to 1.1

Changed 7 months ago by richard

  • blocking deleted

(In #120) Remove the unfixed dependencies so we can close this bug - they're all marked for the 1.1.0 milestone.

Note: See TracTickets for help on using tickets.