Opened 7 years ago

Last modified 4 months ago

#750 assigned defect

Teach QueryParser about stopping strategies

Reported by: mgautier Owned by: Olly Betts
Priority: highest Milestone: 1.5.0
Component: QueryParser Version: git master
Severity: normal Keywords:
Cc: kelson@… Blocked By:
Blocking: Operating System: All

Description

It seems that the TermGenerator (with STOP_ALL and STEM_ALL strategy) do not stop stemmed term.

For example, with french stemmer and "le" in the stopwords, le term "lea" will be stemmed to "le" and "le" term will be added to the document.

Looking into termgenerator_internal.cc (method index_text), it seems that the stopper in never called on the stem, whatever the flags are.

We are using version 1.4.2 but the stopper is never called on the stem even on git master.

Change History (11)

comment:1 by Kelson, 7 years ago

Cc: kelson@… added

comment:2 by mgautier, 7 years ago

Hi,

A little ping.

Has someone the time to look at it ?

Thanks.

comment:3 by Gaurav Arora, 7 years ago

Owner: changed from Olly Betts to Gaurav Arora

I tried looking into the issue and created a pull request proposing a fix.

I have created a test as well which failed with the old version.

Check the pull request https://github.com/xapian/xapian/pull/173/ . Will review with olly and get it fixed.

comment:4 by Gaurav Arora, 7 years ago

Issues with #173 pull request. Opened a new pull request with only changes for this ticket. https://github.com/xapian/xapian/pull/174

comment:5 by Olly Betts, 7 years ago

The patch which exposed this functionality unfortunately mis-documented what two of the three options actually do (STOP_NONE is OK, the others aren't).

The stopper is expected to always be fed the unstemmed form (it takes a word not a stem). Passing stemmed forms to a stopper which is checking a list of words seems a bad idea. The stemmer maps words to stems, and the two are really separate spaces (in some cases, the stem happens to be the same string as one of the words which stems to it, but that doesn't mean the stemmer is mapping words to words). So for example, using the English stemmer, the word "tease" has the stem "teas". But that's nothing to do with the word "teas" (which has the stem "tea").

STOP_STEMMED is actually "check the unstemmed form with the stopper, and if it's a stop word, only index its unstemmed form" - this is a useful thing to do because it means searches for phrases which include stopwords work (the unstemmed forms are indexed with positional information).

STOP_ALL is actually "check the unstemmed form with the stopper, and if it's a stop word, skip the word". At least in English there are cases where a word has multiple meanings, and only one is really a stopword. For example, "can" would probably be on an English stopword list, because it's a form of the irregular verb meaning "to be able to". But it's also a noun (a metal container) and a different regular verb (meaning to put something in such a metal container), etc, and those words shouldn't really be stopwords. So while "cans" and "canned" also stem the same way as "can", it's unhelpful to treat them as stopwords too.

If you use the same stopper when parsing queries, this should work nicely - "can" will also be treated as a stop word in queries, but a search for "canned" will still match "canned" or "cans" in documents.

English is particularly rife with words with lots of different meanings, and I'm not sure how common this situation is in other languages, but as best I can make out your example "lea" is actually a name (https://en.wiktionary.org/wiki/L%C3%A9a) which happens to stem to the same thing as the article "le", in which case I'd argue that "lea" really shouldn't be treated as a stopword either.

Given STOP_STEMMED is the default, and before this patch it was long-established as the hard-coded behaviour when a stopper was set, changing what it means now to try to match what the current API documentation says would be unhelpful, and I think fixing the documentation makes most sense.

You can actually already stop any word which stems the same way as a stopword by providing a stopper which stems its input before checking it against a list of stems of stopwords, but we could perhaps provide a mode (or a special Stopper subclass) to streamline this, if it's actually a sensible thing to be doing.

Anyway, I'm afraid the patch in that PR isn't an appropriate change.

Last edited 7 years ago by Olly Betts (previous) (diff)

comment:6 by Olly Betts, 7 years ago

Milestone: 1.4.5

I've fixed the API docs in master in [c6a10416378f0d50cfcf81ca259e2b861e0f4fe4/git] and backported for 1.4.5 in [cddaec720442b602184747c6ca204fe3b6a7cba1/git].

comment:7 by mgautier, 7 years ago

Thanks for you explanation olly.

But, I'm not sure how to resolve my issue. For now, we are using STEM_ALL strategy for queryParser and termGenerator. and STOP_ALL for termGenerator (it seems that queryParser have to configurable stop strategy).

We used STEM_ALL because we what all words to be stemmed. But we want to have "Lea" indexed. "Lea" not being a stopword, it should be indexed what ever the stop strategy is.

I'm not sure of the use of the 'Z' prefix, but is seems that we should use STEM_ALL_Z strategy (for both queryParser and termGenerator). This way, "le" (query) would not match with "le" (stemmed term). Am I right ?

comment:8 by Olly Betts, 7 years ago

Assuming you set the same stopper at query time, this should just work with those settings - testing with a slightly patched version of examples/quest.cc I get:

$ git diff
diff --git a/xapian-core/examples/quest.cc b/xapian-core/examples/quest.cc
index 9c199c340d5f..d79a7bcf414d 100644
--- a/xapian-core/examples/quest.cc
+++ b/xapian-core/examples/quest.cc
@@ -37,15 +37,7 @@ using namespace std;
 
 // Stopwords:
 static const char * sw[] = {
-    "a", "about", "an", "and", "are", "as", "at",
-    "be", "by",
-    "en",
-    "for", "from",
-    "how",
-    "i", "in", "is", "it",
-    "of", "on", "or",
-    "that", "the", "this", "to",
-    "was", "what", "when", "where", "which", "who", "why", "will", "with"
+    "la", "le"
 };
 
 struct qp_flag { const char * s; unsigned f; };
@@ -362,7 +354,7 @@ try {
 
     parser.set_database(db);
     parser.set_stemmer(stemmer);
-    parser.set_stemming_strategy(Xapian::QueryParser::STEM_SOME);
+    parser.set_stemming_strategy(Xapian::QueryParser::STEM_ALL);
     parser.set_stopper(&mystopper);
 
     Xapian::Query query = parser.parse_query(argv[optind], flags);
$ examples/quest --stemmer fr 'le camion'
Parsed Query: Query(camion@2)
No database specified so not running the query.
$ examples/quest --stemmer fr 'lea seydoux'
Parsed Query: Query((le@1 OR seydoux@2))
No database specified so not running the query.

So le is handled as a stopword, and lea is stemmed to le and included in the search.

There's one slight wrinkle, which is that for terms where search-time stopwording is suppressed (e.g. because le is used in a phrase, or an individual le is quoted, or when the query is entirely composed of stopwords) then le in the query won't be stopped and will match lea in the document, e.g.:

$ examples/quest --stemmer fr '"le voiture"'
Parsed Query: Query((le@1 PHRASE 2 voitur@2))
No database specified so not running the query.
$ examples/quest --stemmer fr '"le" voiture'
Parsed Query: Query((le@1 OR voitur@2))
No database specified so not running the query.
$ examples/quest --stemmer fr 'le la'
Parsed Query: Query((le@1 OR la@2))
No database specified so not running the query.

I think to handle such cases, we'd probably need to explicitly teach QueryParser about the different stop strategies. If we're removing stopwords at index time, then in cases like the above it could do something more appropriate.

You could switch to STEM_ALL_Z - that's the default mode of operation, and allows for exact matching of words and exact phrase searches, which you can't achieve if you only index the stemmed forms. The downside is that the database will be larger.

comment:9 by Olly Betts, 7 years ago

Backported the documentation fix for 1.4.5 in [cddaec720442b602184747c6ca204fe3b6a7cba1/git].

comment:10 by Olly Betts, 7 years ago

Milestone: 1.4.51.4.x
Summary: TermGenerator do not stop stemmed term.Teach QueryParser about stopping strategies

I think the documentation fix addresses this as far as it will be for 1.4.5.

However, it'd be good to address the wrinkle I mentioned in comment:8 above:

I think to handle such cases, we'd probably need to explicitly teach QueryParser about the different stop strategies. If we're removing stopwords at index time, then in cases like the above it could do something more appropriate.

But that's a longer term thing, so adjusting milestone.

comment:11 by Olly Betts, 4 months ago

Milestone: 1.4.x1.5.0
Owner: changed from Gaurav Arora to Olly Betts
Priority: normalhighest
Status: newassigned

It'd be good to address this, though the cases from comment:6 are never going to be handled entirely satisfactory due to inherent limitations with index-time stopping:

$ examples/quest --stemmer fr '"le voiture"'
Parsed Query: Query((le@1 PHRASE 2 voitur@2))

If le is a stopword and removed before indexing then the best we can do here would be Query(voitur@2) which will match voiture without le in front.

$ examples/quest --stemmer fr '"le" voiture'
Parsed Query: Query((le@1 OR voitur@2))

Same here.

$ examples/quest --stemmer fr 'le la'
Parsed Query: Query((le@1 OR la@2))

And here we can't do a search at all as the query is entirely made of stopwords. This particular case probably isn't a useful search, but it's possible to come up with queries entirely composed of stopwords. In English the Shakespeare quote "to be or not to be" is one example.

I'm going to mark this to do for 1.5.0 for now, but if it proves more involved than I hope it'll likely get postponed as we really need to actually get a new stable release series out (and this could be backported to a stable release).

Note: See TracTickets for help on using tickets.