Opened 17 years ago
Last modified 20 months ago
#167 assigned enhancement
Add mode to query parser to search for both stemmed and unstemmed forms
Reported by: | Richard Boulton | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 2.0.0 |
Component: | QueryParser | Version: | git master |
Severity: | minor | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
Now that we store both the stemmed and unstemmed forms of each word in the database, it might be nice to add a new stemming mode to the query parser which takes each word in the query and generates an "OR" query for it, with two parts; one being the unstemmed form and one being the stemmed form. This would mean that each query would match any document with words which match the stemmed form, but would give documents with the unstemmed form a higher weight.
We might call this option "STEM_BOTH", or some better name that someone other than me can think of.
Change History (6)
comment:1 by , 17 years ago
Operating System: | → All |
---|---|
Severity: | normal → enhancement |
Status: | new → assigned |
comment:2 by , 17 years ago
Yes, something to adjust the weights might be a good idea. I'm not quite sure what it would do, though: perhaps a synonym, but with the wdf for the unstemmed form given a multiplier, making unstemmed forms match with a higher effective wdf. We probably need to experiment with a few things.
comment:4 by , 12 years ago
Description: | modified (diff) |
---|---|
Milestone: | → 1.3.x |
comment:5 by , 9 years ago
Milestone: | 1.3.x → 1.4.x |
---|
This is just an API addition, so bumping to 1.4.x.
comment:6 by , 7 years ago
Some thoughts on a plan of attack (from an IRC discussion):
I'd suggest first just getting the parser to generate <term> OR <stemmed_term>
and worrying about the best way to weight it once that's working.
The parser is in xapian-core/queryparser/queryparser.lemony
and uses the lemon parser generator )which is similar to yacc and bison).
I think this probably wants to happen in the same places that auto synonyms do, so look at Term::get_query_with_auto_synonyms()
to start with.
comment:7 by , 20 months ago
Milestone: | 1.4.x → 2.0.0 |
---|---|
Version: | SVN trunk → git master |
Perhaps a special query operator would be useful here - the statistics are probably going to be different since we know that documents indexed by the unstemmed for are (or at least should be) indexed by the stemmed form too.