Opened 12 years ago

Closed 10 years ago

#608 closed enhancement (fixed)

as_partial_query / as_wildcarded_query and limiting the expansion of terms

Reported by: Greg Owned by: Olly Betts
Priority: normal Milestone: 1.3.3
Component: QueryParser Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

  1. At the moment only as_wildcarded_query uses the max_wildcard_expansion argument while as_partial_query doesn't.
  1. It would probably be better to instead of throwing an exception, to return N (max_wildcard_expansion) terms ordered by termfreq.

Attachments (3)

x.patch (2.1 KB ) - added by Greg 12 years ago.
A barebone implementation, sorting terms based on their frequency
ticket-608-max-wildcard.patch (3.2 KB ) - added by Olly Betts 12 years ago.
updated patch against trunk
608-lazy-wildcard.patch (2.1 KB ) - added by Olly Betts 10 years ago.
New implementation on top of the lazy wildcard support

Download all attachments as: .zip

Change History (14)

by Greg, 12 years ago

Attachment: x.patch added

A barebone implementation, sorting terms based on their frequency

comment:1 by Olly Betts, 12 years ago

Milestone: 1.3.2

You mentioned on IRC you'd fixed some issues and would update the patch, so I've been holding off on this. Any updates?

in reply to:  1 comment:2 by Greg, 12 years ago

Replying to olly:

You mentioned on IRC you'd fixed some issues and would update the patch, so I've been holding off on this. Any updates?

This is the latest patch file, I overwrote the previous ones

by Olly Betts, 12 years ago

updated patch against trunk

comment:3 by Olly Betts, 12 years ago

Status: newassigned

Sorry for the delay in looking at this.

I've attached an updated version of the patch. Changes are:

  • It's against trunk (since we need to apply this there before we can consider backporting it)
  • A number of tweaks to make it match the coding standards in HACKING
  • Use appropriate types rather than "unsigned int".
  • Pass std::string and TermF by const reference rather than value
  • Use a "shim" iterator to avoid needing the intermediate subqs_partial
  • If there are multiple active prefixes, we pick the best N terms over all of them, rather than picking the best N from the first, and only considering the others if we haven't found N
  • Don't bother tracking min_freq until we make the heap, at which point we can just read it off the top of the heap

Still to do:

  • Needs feature tests
  • Needs corresponding documentation updates
  • I think if we backport this to 1.2, we should default to no limit to avoid a change in behaviour, but for trunk we could reasonably impose a default - limiting the number of terms for FLAG_PARTIAL will probably give better results as well as being faster. Currently this is 100, but is that a good default?
  • Do we want the same feature for wildcards? If so, do we want it instead of the current limit on expansion which throws an exception, or as well?
  • Should we have separate limits in the API for wildcards and partial expansions?

in reply to:  3 comment:4 by Greg, 12 years ago

As far as the magic number (100) goes, we did a test to find out what would the number need to be to hit whatever the user typed.

We used our database and actual user queries as the test cases, the numbers for us were that 100 terms gives us 84% hit ratio given two letters and 98% for 3 letters, changing the number of terms to 40 for 3 letters gave us 95% coverage. In general the number of terms needed rises exponentially from around 80%. For comparison the previous numbers were 1340 terms for 84% (2l) and 600 for 98% (3l) without sorting.

The test though is only valid for our test case and usage pattern, however, the number of letters in the partial term should probably be taken into account when using a default or changing max_wildcard_expansion.

comment:5 by Olly Betts, 12 years ago

To check I'm following, that "84% hit ratio" is of the documents which would be matched when completing to the most frequent N terms with the prefix rather than to all terms with the prefix?

That's certainly useful information, though it might be that returning all the documents which the full expansion would isn't actually the target to be aiming for - we want to return documents matching what the user is looking for, and it seems likely that the frequencies of occurrence of terms in queries is going to be roughly the same as in documents, so the word they are in the middle of typing is more likely to be one with a higher termfreq. But that's harder to evaluate (you'd need some sort of user relevance assessments, rather than just counting the matches). 100 seems a plausible choice anyway.

comment:6 by Greg, 12 years ago

No the hit ratio is for the actual term to which the partial 'should' expand, eg. if my end query is "crocodile dundee" than I expect to get "dundee" as one of the terms for the partial expansion of "crocodile d".

So we basically took all the queries the users submitted over a time (that yield results) and checked how high does N need to be so that the last term is actually expanded to what they looked for.

To test it I compiled a test build that also sorted (sort_heap) the terms before creating the query and we used a high max_wildcard_expansion, than parsed out the expanded terms out of get_description() and checked which place the term the user meant was at.

comment:7 by Olly Betts, 10 years ago

Milestone: 1.3.21.3.3

Sorry, I think I'm going to have to bump this to 1.3.3 (though the 1.3.2 - 1.3.3 gap should be much much shorter than the 1.3.1 - 1.3.2 gap has turned out to be).

by Olly Betts, 10 years ago

Attachment: 608-lazy-wildcard.patch added

New implementation on top of the lazy wildcard support

comment:8 by Olly Betts, 10 years ago

Version: SVN trunk

I merged the lazy wildcard support earlier today (which includes applying the wildcard expansion limit to partial terms), and I've just attached a patch which implements choosing the most frequent terms instead of giving an exception.

I think we probably want a way to choose between the exception and picking the most frequent terms, and probably the limit and exception/picking settings should be separate for partial and wildcard - picking the best N seems a good strategy for a partial expansion, but it will probably surprise users to find that a wildcard only expands to a subset of the terms it matches.

Also need tests and docs.

Last edited 10 years ago by Olly Betts (previous) (diff)

comment:9 by Olly Betts, 10 years ago

I've pushed [11250717e48da2d2c3fa86437c7670a8b8034d8b/git] which provides a choice of modes for limiting expansion: throw an exception (i.e. the existing behaviour), just the first N in term sort order, or the most frequent N terms.

QueryParser now uses the latter for partial expansion.

There's one issue - for multidatabases the limit is applied separately for each subdatabase, which isn't great. OP_ELITE_SET has a similar problem currently too. I think we probably need custom (and slower) handling for both these cases when multiple databases are involved.

Also, QueryParser should probably have separate limits for partial expansion and wildcards.

comment:10 by Olly Betts, 10 years ago

Created #677 for the multidatabase issues.

comment:11 by Olly Betts, 10 years ago

Resolution: fixed
Status: assignedclosed

[52fb536c45d4e6097148203e5b7c38a2b3194f16/git] provides separate limits for partial terms and wildcards (and partial terms now default to a limit of 100) so closing this ticket now.

Note: See TracTickets for help on using tickets.