Ticket #113 (assigned enhancement)

Opened 3 years ago

Last modified 6 months ago

QueryParser limitation/inconsistency

Reported by: federico.schwindt Owned by: olly
Priority: high Milestone: 1.3.0
Component: QueryParser Version: SVN trunk
Severity: minor Keywords:
Cc: richard Blocked By:
Operating System: All Blocking:

Description (last modified by olly) (diff)

Hi,

I've been using xapian (0.9.9 and now 0.9.10) recently at work and I've found that the exquisite QueryParser (no irony intended) imposes some serious limitations for certain queries, as it does treat some characters specially, even when flags does not contain FLAG_PHRASE.

I'm talking about the method is_phrase_generator(). In the organization I work for we have a mixed setup of html documents and code. This includes several references to text in the word_word format. Unfortunately the QueryParser treats underscore as phrase generator, making impossible to search for terms indexed using whitespace separators, even when allterms() shows the term exists on the database.

I believe this is an inconsistency and also a limitation in the QueryParser, as it does not matter what flags are used, in such cases where the query string contains any of the characters defined in is_phrase_generator(), the query will be automatically converted to a phrase search (note that these characters can't be changed).

In an ideal world (mine at least), I'd expect the user to define a phrase (using " or any other previously defined character) and if this is not the case the QueryParser should not try to convert the query to anything else (except for the defined operations, OR, AND, etc).

ITOH, I could change the indexing to strip the underscores (and the other characters) and treat every part of the word_word as a separate term, but that would also mean that "word word" would match as well, when it's not what you wanted.

I hope you have this into consideration. Feel free to contact me if you need further details or I can clarify anything else.

Many thanks,

f.-

Change History

Changed 3 years ago by olly

  • status changed from new to assigned

I was already hoping to sort this out for 1.0.

The history of this is that the QueryParser? class was originally part of Omega, and made various assumptions about how text was indexed. I've fixed a number of these, but there are more to go.

bug#22 is somewhat related.

Changed 3 years ago by richard

  • blocking 118 added

Changed 3 years ago by richard

  • cc richard@… added

Changed 3 years ago by olly

  • rep_platform changed from PC to All
  • version changed from 0.9.9 to SVN HEAD

I think the best approach here is to allow the list of phrase-generators to be specified, probably via a predicate function ("is this a phrase-generator?")

I don't think we should scrap the concept as it has its uses. For example, people aren't consistent about hyphenating terms, so a search for phrase-generator' should probably act similarly to one for "phrase generator"'. If you experiment with Google, you'll see they generally do this.

The real problem is that in 0.9.x and earlier we overuse this mechanism for things which ought not be handled as phrase searches.

Changed 3 years ago by olly

In SVN HEAD, "_" is now treated as any other word character and an apostrophe with word characters on either side is included in a term.

The predicate functions should still be specifiable though. I'm tracking term generation/query parsing todo items here:

 http://wiki.xapian.org/BraveNewTerms

Changed 3 years ago by olly

  • blocking 120 added; 118 removed

I've decided to defer this to 1.0.X. The updated QueryParser? and new TermGenerator? functions uses rather a patchwork set of predicate and conversion functions internally, and I'd like to take a bit of time to come up with a coherent, logical user API for them rather than rushing to add one which we'll later regret.

I believe we can safely add methods for this without breaking ABI backward compatibility.

Changed 3 years ago by trac

  • platform set to All

Changed 22 months ago by richard

  • description modified (diff)
  • milestone set to 1.1

Changed 22 months ago by richard

  • blocking 120 removed

(In #120) Remove the unfixed dependencies so we can close this bug - they're all marked for the 1.1.0 milestone.

Changed 12 months ago by olly

  • description modified (diff)
  • milestone changed from 1.1.0 to 1.1.1

Bumping to milestone:1.1.1

(and fix description wiki formatting)

Changed 12 months ago by olly

This (untested) patch may help anyone wanting to add a character to those considered as part of a word and happy to patch the library source code:

 http://oligarchy.co.uk/xapian/patches/make-hyphen-a-word-character-untested.patch

Changed 10 months ago by olly

  • milestone changed from 1.1.1 to 1.1.4

Triaging milestone:1.1.1 bugs.

Changed 7 months ago by olly

  • priority changed from normal to high

Changed 6 months ago by olly

  • milestone changed from 1.1.4 to 1.3.0

Bumping to stay on track for release.

Note: See TracTickets for help on using tickets.