Opened 12 years ago
Closed 10 years ago
#608 closed enhancement (fixed)
as_partial_query / as_wildcarded_query and limiting the expansion of terms
Reported by: | Greg | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.3.3 |
Component: | QueryParser | Version: | SVN trunk |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
- At the moment only as_wildcarded_query uses the max_wildcard_expansion argument while as_partial_query doesn't.
- It would probably be better to instead of throwing an exception, to return N (max_wildcard_expansion) terms ordered by termfreq.
Attachments (3)
Change History (14)
by , 12 years ago
follow-up: 2 comment:1 by , 12 years ago
Milestone: | → 1.3.2 |
---|
You mentioned on IRC you'd fixed some issues and would update the patch, so I've been holding off on this. Any updates?
comment:2 by , 12 years ago
Replying to olly:
You mentioned on IRC you'd fixed some issues and would update the patch, so I've been holding off on this. Any updates?
This is the latest patch file, I overwrote the previous ones
follow-up: 4 comment:3 by , 12 years ago
Status: | new → assigned |
---|
Sorry for the delay in looking at this.
I've attached an updated version of the patch. Changes are:
- It's against trunk (since we need to apply this there before we can consider backporting it)
- A number of tweaks to make it match the coding standards in HACKING
- Use appropriate types rather than "unsigned int".
- Pass std::string and TermF by const reference rather than value
- Use a "shim" iterator to avoid needing the intermediate subqs_partial
- If there are multiple active prefixes, we pick the best N terms over all of them, rather than picking the best N from the first, and only considering the others if we haven't found N
- Don't bother tracking min_freq until we make the heap, at which point we can just read it off the top of the heap
Still to do:
- Needs feature tests
- Needs corresponding documentation updates
- I think if we backport this to 1.2, we should default to no limit to avoid a change in behaviour, but for trunk we could reasonably impose a default - limiting the number of terms for FLAG_PARTIAL will probably give better results as well as being faster. Currently this is 100, but is that a good default?
- Do we want the same feature for wildcards? If so, do we want it instead of the current limit on expansion which throws an exception, or as well?
- Should we have separate limits in the API for wildcards and partial expansions?
comment:4 by , 12 years ago
As far as the magic number (100) goes, we did a test to find out what would the number need to be to hit whatever the user typed.
We used our database and actual user queries as the test cases, the numbers for us were that 100 terms gives us 84% hit ratio given two letters and 98% for 3 letters, changing the number of terms to 40 for 3 letters gave us 95% coverage. In general the number of terms needed rises exponentially from around 80%. For comparison the previous numbers were 1340 terms for 84% (2l) and 600 for 98% (3l) without sorting.
The test though is only valid for our test case and usage pattern, however, the number of letters in the partial term should probably be taken into account when using a default or changing max_wildcard_expansion.
comment:5 by , 12 years ago
To check I'm following, that "84% hit ratio" is of the documents which would be matched when completing to the most frequent N terms with the prefix rather than to all terms with the prefix?
That's certainly useful information, though it might be that returning all the documents which the full expansion would isn't actually the target to be aiming for - we want to return documents matching what the user is looking for, and it seems likely that the frequencies of occurrence of terms in queries is going to be roughly the same as in documents, so the word they are in the middle of typing is more likely to be one with a higher termfreq. But that's harder to evaluate (you'd need some sort of user relevance assessments, rather than just counting the matches). 100 seems a plausible choice anyway.
comment:6 by , 12 years ago
No the hit ratio is for the actual term to which the partial 'should' expand, eg. if my end query is "crocodile dundee" than I expect to get "dundee" as one of the terms for the partial expansion of "crocodile d".
So we basically took all the queries the users submitted over a time (that yield results) and checked how high does N need to be so that the last term is actually expanded to what they looked for.
To test it I compiled a test build that also sorted (sort_heap) the terms before creating the query and we used a high max_wildcard_expansion, than parsed out the expanded terms out of get_description() and checked which place the term the user meant was at.
comment:7 by , 10 years ago
Milestone: | 1.3.2 → 1.3.3 |
---|
Sorry, I think I'm going to have to bump this to 1.3.3 (though the 1.3.2 - 1.3.3 gap should be much much shorter than the 1.3.1 - 1.3.2 gap has turned out to be).
by , 10 years ago
Attachment: | 608-lazy-wildcard.patch added |
---|
New implementation on top of the lazy wildcard support
comment:8 by , 10 years ago
Version: | → SVN trunk |
---|
I merged the lazy wildcard support earlier today (which includes applying the wildcard expansion limit to partial terms), and I've just attached a patch which implements choosing the most frequent terms instead of giving an exception.
I think we probably want a way to choose between the exception and picking the most frequent terms, and probably the limit and exception/picking settings should be separate for partial and wildcard - picking the best N seems a good strategy for a partial expansion, but it will probably surprise users to find that a wildcard only expands to a subset of the terms it matches.
Also need tests and docs.
comment:9 by , 10 years ago
I've pushed [11250717e48da2d2c3fa86437c7670a8b8034d8b/git] which provides a choice of modes for limiting expansion: throw an exception (i.e. the existing behaviour), just the first N in term sort order, or the most frequent N terms.
QueryParser
now uses the latter for partial expansion.
There's one issue - for multidatabases the limit is applied separately for each subdatabase, which isn't great. OP_ELITE_SET
has a similar problem currently too. I think we probably need custom (and slower) handling for both these cases when multiple databases are involved.
Also, QueryParser
should probably have separate limits for partial expansion and wildcards.
comment:11 by , 10 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
[52fb536c45d4e6097148203e5b7c38a2b3194f16/git] provides separate limits for partial terms and wildcards (and partial terms now default to a limit of 100) so closing this ticket now.
A barebone implementation, sorting terms based on their frequency