Opened 17 years ago

Closed 9 years ago

Last modified 9 years ago

#128 closed enhancement (fixed)

Allow queryparser to treat some prefixes as literal text

Reported by: Richard Boulton Owned by: Olly Betts
Priority: high Milestone: 1.3.1
Component: QueryParser Version: SVN trunk
Severity: minor Keywords:
Cc: Olly Betts, Sidnei da Silva, Mark Hammond, c.hack@… Blocked By:
Blocking: Operating System: All

Description (last modified by Richard Boulton)

By default, the query parser splits words at spaces and applies lower-casing, stemming, and other normalisation to generate terms.

I believe that it should be possible to override the query parser's default behaviour for fields with a given set of prefixs, such that the query parser will treat some terms as literal text, allowing any character to occur in the term (including spaces and quotes), and not applying stemming or other normalisation to the term.

My thinking is that this can be implemented by adding a third prefix type (which I've called "EXACT_TEXT" for want of a better name), which causes the query parser to put all the characters following the prefix until the next space or ')' into the term (like terms with a "BOOL_FILTER" prefix type). The terms so generated are then included in the query structure in the same way as "FREE_TEXT" terms - ie, they obey surrounding boolean operators, and '+' and '-' prefixes.

In order to allow spaces (and ')' characters) in the terms, the query parser should support basic backslash escaping for the contents of such fields.

I have a patch which implements this that I'll attach to this bug report shortly. The patch has a few test cases (but more are needed for such a new feature), and has I've not written any documentation for it yet.

I know that Sidnei needs this for something he's working on, and I'd be delighted if we managed to get this into 1.0 since I'm going to have to maintain it until it gets committed, but it needs thorough review before being committed and timescales for 1.0 may not allow this.

Attachments (1)

qp_exact_text.patch (6.7 KB ) - added by Richard Boulton 17 years ago.
Draft implementation, no documentation, too few tests

Download all attachments as: .zip

Change History (24)

by Richard Boulton, 17 years ago

Attachment: qp_exact_text.patch added

Draft implementation, no documentation, too few tests

comment:1 by Olly Betts, 17 years ago

Cc: olly@… added
rep_platform: PCAll
Severity: normalenhancement

I've been thinking about the same problems actually. Fabrice Colin talked about it a bit on the list a while back, and it's also connected to changing how we parse terms.

The idea of EXACT_TEXT vs FREE_TEXT seems rather non-orthogonal. Why shouldn't I be able to specify this for boolean prefixes too? And there are other transformations that should be possible (for example, for gmane I'd love to be able to turn a boolean group:gmane.discuss into the term Gdiscuss.

So my analysis is that the user should be able to specify for each prefix:

  • If it's a boolean filter (apply with OP_FILTER vs default_op)
  • Which characters to include when parsing
  • How to convert the characters parsed into a term

(The parsing and converting may be best combined in some way...)

Currently we allow the first and a very limited form of the last (i.e. you can specify what to prefix the term with; and there's a global choice of one of three ways of processing terms from FREE_TEXT prefixes).

Take a look at the ValueRangeProcessor hierarchy - I was thinking of taking a similar direction here as well (partly for API consistency and partly cos I can reuse the code!)

I'd like to get that done for 1.0, but April is zooming by at quite a rate, so we'll have to see. Bugs are at least getting ticked off. But even if it misses 1.0, I think it can be added without breaking existing API use or changing the ABI incompatibly, so I'd rather add an API for a partial solution only to deprecate it again.

I've never been very convinced by the idea of allowing backslash-escaping of characters in a search engine UI. Lucene's query parser allows it IIRC, but do you have evidence that people really make use of this feature? I'd be interested to know. Anyway, I think we can provide the flexibility to support it.

Incidentally, your patch seems to add support for "unary HATE", which I don't think is a good idea (in fact I'm pretty sure we already dropped the idea once). The problem is that it's too likely to be invoked accidentally (e.g. by a search for `-fno-gnu-keywords') and the same effect can easily be achieved with "NOT fno-gnu-keywords" if that's really what you mean.

Last edited 9 years ago by Olly Betts (previous) (diff)

comment:2 by Richard Boulton, 17 years ago

Blocking: 120 added

I spent some time yesterday wondering about the inflexibility of just adding a new prefix type, and was coming to similar conclusions to yours. (I'm assuming you meant "I'd rather _not_ add an API for a partial solution".)

Regarding backslash escaping, I've never wanted to use it personally, but I have seen requests for it by users on the Lucene mailing lists. The biggest downside of it, of course, is that it breaks terms which have backslashes for users who don't know that they need to double backslashes - which could be a problem if the term is something like a windows path with a leading double backslash. We could use an alternative escaping mechanism - since we're only wanting to escape ", we could simply require doubling of " characters (so, " ends the string, and "" adds a single " to the string). The escaping only needs to happen for query strings which have a " immediately following a prefix, of course.

I suspect escaping would mainly be useful for people who are "abusing" the query parser by sending it machine generated queries, but since we can implement it without much difficulty and it can be made optional with your suggested scheme, I think we should keep it available.

Being able to quote "exact" terms to allow spaces in them is definitely useful to users, however.

So: for each field (ie, user entered text before a colon), we'd have a flag indicating whether it's a boolean filter or not, a list of characters to include when parsing (or possibly a list of characters not to include), and flags indicating whether to allow escape characters during the parsing and whether to allow quoting the value in " or some other character.

We'd also have a list of FieldProcessors for each field which are applied to that field in turn, until one returns sucessfully. For full flexibility, the return value would be able to represent multiple terms, or a structured query, but that would probably be best left simple for now.

I'll take a look at this either today or tomorrow, unless you're already working on it.

For now, I'm marking this bug as desired for the 1.0 series, but if we can get it into 1.0 that would be very helpful to me.

comment:3 by Olly Betts, 17 years ago

(I'm assuming you meant "I'd rather _not_ add an API for a partial solution".)

Indeed.

Regarding backslash escaping, I've never wanted to use it personally, but I have seen requests for it by users on the Lucene mailing lists. The biggest downside of it, of course, is that it breaks terms which have backslashes for users who don't know that they need to double backslashes - which could be a problem if the term is something like a windows path with a leading double backslash.

Yes, those are the sort of problems I see with the approach. Technical users will hopefully understand the concept of escaping, but non-technical ones probably won't have encountered it, and may have trouble grasping it even if explained to them (I recall a discussion about cave survey data exchange standards where someone failed to grasp for some time how you could escape a delimiter character, and also escape the escape character, thus allowing any string to be represented...)

I was wondering last night if just allowing "..." or '...' would be sufficient, or if people would really want to be able to specify a string with both " and ' in, but your "double the quote to escape it" idea seems good. It shouldn't break random casual query usage (and it's used by some versions of BASIC IIRC, so it will be familiar to many already).

I suspect escaping would mainly be useful for people who are "abusing" the query parser by sending it machine generated queries

If anything, that's an argument for not supporting it! I noticed Lucene's query parser documentation also tries to discourage people from doing this, incidentally.

Being able to quote "exact" terms to allow spaces in them is definitely useful to users, however.

Yes, that's the main use I can see.

So: for each field (ie, user entered text before a colon), we'd have a flag indicating whether it's a boolean filter or not, a list of characters to include when parsing (or possibly a list of characters not to include), and flags indicating whether to allow escape characters during the parsing and whether to allow quoting the value in " or some other character.

A list of characters becomes unworkable with unicode - I think it's better to have a virtual method on the processor object which the user can override - it just needs to decide "is this a word character". That makes the parsing of AT&T, I.B.M., etc which we currently do hard to reproduce under this scheme though, so perhaps the interface needs to be a little more sophisticated. There's a balance to strike between simplicity and functionality here, and it needs to be reasonably efficient too.

We'd also have a list of FieldProcessors for each field which are applied to that field in turn, until one returns sucessfully. For full flexibility, the return value would be able to represent multiple terms, or a structured query, but that would probably be best left simple for now.

I'd not really thought about this, but a Query object would be more general. Perhaps ultimately features like wildcards should be implemented like this?

I'll take a look at this either today or tomorrow, unless you're already working on it.

I'm not, though there's overlap with other changes. But feel free to work on it if you want, but beware the there are a lot of edge cases in the QueryParser and the test coverage could be better. I'm away this weekend anyway.

comment:4 by Richard Boulton, 17 years ago

As it turned out, I didn't get a chance to look at it. I'll comment here when I do get round to looking at it; or feel free to assign the bug to you if you want to look at it before then.

comment:5 by Richard Boulton, 17 years ago

Status: newassigned

comment:6 by Mark Hammond, 17 years ago

Cc: mhammond@… added
Operating System: All

comment:8 by Richard Boulton, 16 years ago

Description: modified (diff)
Milestone: 1.1

comment:9 by Richard Boulton, 16 years ago

Blocking: 120 removed

(In #120) Remove the unfixed dependencies so we can close this bug - they're all marked for the 1.1.0 milestone.

comment:10 by Olly Betts, 15 years ago

Milestone: 1.1.01.1.1

This is an API addition, so moving to milestone:1.1.1.

comment:11 by Olly Betts, 15 years ago

Milestone: 1.1.11.1.4

Triaging milestone:1.1.1 bugs.

comment:12 by Christoph Hack, 15 years ago

Cc: c.hack@… added

I just stumbled upon this ticket (ok, ojwb and richardb looked it up for me *g*) and I think this ticket would be really an important API addition. A common use case might be to translate "user:username" to "Uuserid" (or category-names to category-id's and so on).

comment:13 by Olly Betts, 15 years ago

Priority: normalhigh

comment:14 by Olly Betts, 15 years ago

Milestone: 1.1.41.3.0

Bumping to stay on track for release.

comment:15 by Richard Boulton, 15 years ago

I'm quite keen on having some support for this fairly soon, so if I find time to make a working, tested, patch, I might unbump this. Leaving this for 1.3.0 for now, though, since I haven't got time yet!

comment:16 by Olly Betts, 15 years ago

This shouldn't need incompatible API changes, and it should be possible to do it without breaking the ABI, so it could probably go in 1.2.x. I mostly chose 1.3.0 as it's likely to be quite a major change.

There may be scope for "reprieving" the odd ticket, but it's going to reach the point very soon where either we have to delay 1.2.0, or start to chop out new features which are largely in place, and neither is very appealing.

comment:17 by Olly Betts, 14 years ago

Added support for quoting boolean terms with " (and "" is a literal ") to trunk r13823.

comment:18 by Olly Betts, 12 years ago

Milestone: 1.3.01.3.x

comment:19 by Olly Betts, 12 years ago

Milestone: 1.3.x1.3.1
Owner: changed from Richard Boulton to Olly Betts
Status: assignednew

I have now added FieldProcessor on trunk. I'm unsure of exactly how it should interact with other features, and it would be rather nice if it was possible to implement many of the existing QP features using it - it looks like it isn't currently for most of them, which I feel is a bit of a sign that the current design isn't flexible enough.

Marked as experimental for now. Will be in 1.3.1 in some form so updating milestone.

comment:20 by Olly Betts, 12 years ago

Status: newassigned

So to clarify, FieldProcessor currently provides the "How to convert the characters parsed into a term" part, and it's set as boolean or not, so that's the first part too.

So this is enough to implement tux21b's «translate "user:username" to "Uuserid"» case, for example.

We don't have the "Which characters to include when parsing" part yet. With that, at least some QP features could be handled (like wildcards for example).

I guess a virtual method which says whether a given unicode character is in the word (or not) would work. Another way would be to pass in a std::string::const_iterator pair, though that might be harder to wrap. Or the query string and an offset perhaps?

comment:21 by Olly Betts, 11 years ago

Milestone: 1.3.11.3.2

It's time we got 1.3.1 out.

comment:22 by Olly Betts, 11 years ago

Milestone: 1.3.21.3.3

comment:23 by Olly Betts, 9 years ago

Milestone: 1.3.31.1.0
Resolution: fixed
Status: assignedclosed
Version: SVN trunk1.3.1

Reviewing the dicussion, most of the issues raised were addressed by allowing quoting double quotes by doubling them, or the addition of FieldProcessor in 1.3.1.

The only remaining aspect seems to be being able to specify the characters which make up words, etc, which is already covered by #113, so closing this ticket.

comment:24 by Olly Betts, 9 years ago

Milestone: 1.1.01.3.1
Version: 1.3.1SVN trunk
Note: See TracTickets for help on using tickets.