Opened 9 years ago

Closed 8 years ago

#563 closed enhancement (fixed)

Add a mode for indexing only stemmed terms in TermGenerator

Reported by: Vitaliy Filippov Owned by: Olly Betts
Priority: normal Milestone: 1.2.11
Component: QueryParser Version: 1.2.6
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

Many search engines just index stems and throw away exact terms. This may be convenient if you don't need searching for exact terms, and it greatly reduces the index.

It would be good for TermGenerator to have such indexing mode.

Attachments (1)

modified.patch (7.0 KB ) - added by Sehaj Singh Kalra 8 years ago.
Combined Patch for this(#563) as well as for #562

Download all attachments as: .zip

Change History (6)

comment:1 by Olly Betts, 9 years ago

Milestone: 1.3.x

by Sehaj Singh Kalra, 8 years ago

Attachment: modified.patch added

Combined Patch for this(#563) as well as for #562

comment:2 by Sehaj Singh Kalra, 8 years ago

This patch provides matching modes for QueryParser and TermGenerator. The mode for indexing can have following 4 values :

  1. STEM_NONE: Don't index any stemmed word
  2. STEM_SOME: Index both stemmed as well as full (non-stemmed) words.(note: prefix "Z" is present in front of the stemmed words).
  3. STEM_ALL: Index only stemmed words.(note: stemmed words DONT have "Z" prefix).
  4. STEM_ALL_Z: Index only stemmed words. (note: stemmed words have "Z" prefix).

Correspondingly a new stemming strategy QueryParser::STEM_ALL_Z has been introduced.

comment:3 by Olly Betts, 8 years ago

Description: modified (diff)

Thanks for the patch. It looks pretty good to me, but a few comments:

Some test coverage for the new modes would be good - we already have tests for the existing STEM_xxx modes in tests/queryparsertest.cc, and for the now default (previously only) stemming mode of TermGenerator in tests/termgentest.cc.

It's better to just write string stem; rather than string stem(""); since std::string objects are empty by default, and the compiler can special case default initialisation and handle it more efficiently (GCC does, I haven't looked at other compilers closely).

And a couple of style issues:

Please put a space after keywords followed by an opening bracket (so if (foo) not if(foo)) to distinguish them more clearly visually from function calls.

For Xapian code, we use 4 space indent, tab filled with a tab being 8 spaces wide - I think your editor has tabs as 4 spaces wide - the indentation of some of the changed lines is too deep with the standard settings anyway.

comment:4 by Olly Betts, 8 years ago

Milestone: 1.3.x1.2.11
Status: newassigned

Applied the remaining half of this patch which corresponds to this ticket in r16628.

Writing testcases revealed that it wasn't adding term positions in all cases where it should have been, so I tweaked it to do that correctly.

Marking to consider backporting.

comment:5 by Olly Betts, 8 years ago

Resolution: fixed
Status: assignedclosed

Backported in r16716 and r16718.

Note: See TracTickets for help on using tickets.