Opened 17 years ago

Last modified 13 months ago

#167 assigned enhancement

Add mode to query parser to search for both stemmed and unstemmed forms

Reported by: Richard Boulton Owned by: Olly Betts
Priority: normal Milestone: 2.0.0
Component: QueryParser Version: git master
Severity: minor Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

Now that we store both the stemmed and unstemmed forms of each word in the database, it might be nice to add a new stemming mode to the query parser which takes each word in the query and generates an "OR" query for it, with two parts; one being the unstemmed form and one being the stemmed form. This would mean that each query would match any document with words which match the stemmed form, but would give documents with the unstemmed form a higher weight.

We might call this option "STEM_BOTH", or some better name that someone other than me can think of.

Change History (6)

comment:1 by Olly Betts, 17 years ago

Operating System: All
Severity: normalenhancement
Status: newassigned

Perhaps a special query operator would be useful here - the statistics are probably going to be different since we know that documents indexed by the unstemmed for are (or at least should be) indexed by the stemmed form too.

comment:2 by Richard Boulton, 17 years ago

Yes, something to adjust the weights might be a good idea. I'm not quite sure what it would do, though: perhaps a synonym, but with the wdf for the unstemmed form given a multiplier, making unstemmed forms match with a higher effective wdf. We probably need to experiment with a few things.

comment:4 by Olly Betts, 11 years ago

Description: modified (diff)
Milestone: 1.3.x

comment:5 by Olly Betts, 9 years ago

Milestone: 1.3.x1.4.x

This is just an API addition, so bumping to 1.4.x.

comment:6 by Olly Betts, 6 years ago

Some thoughts on a plan of attack (from an IRC discussion):

I'd suggest first just getting the parser to generate <term> OR <stemmed_term> and worrying about the best way to weight it once that's working.

The parser is in xapian-core/queryparser/queryparser.lemony and uses the lemon parser generator )which is similar to yacc and bison).

I think this probably wants to happen in the same places that auto synonyms do, so look at Term::get_query_with_auto_synonyms() to start with.

comment:7 by Olly Betts, 13 months ago

Milestone: 1.4.x2.0.0
Version: SVN trunkgit master
Note: See TracTickets for help on using tickets.