Opened 7 years ago
Last modified 11 months ago
#750 assigned defect
Teach QueryParser about stopping strategies
Reported by: | mgautier | Owned by: | Olly Betts |
---|---|---|---|
Priority: | highest | Milestone: | 1.5.0 |
Component: | QueryParser | Version: | git master |
Severity: | normal | Keywords: | |
Cc: | kelson@… | Blocked By: | |
Blocking: | Operating System: | All |
Description
It seems that the TermGenerator (with STOP_ALL and STEM_ALL strategy) do not stop stemmed term.
For example, with french stemmer and "le" in the stopwords, le term "lea" will be stemmed to "le" and "le" term will be added to the document.
Looking into termgenerator_internal.cc
(method index_text
), it seems that the stopper in never called on the stem, whatever the flags are.
We are using version 1.4.2 but the stopper is never called on the stem even on git master.
Change History (11)
comment:1 by , 7 years ago
Cc: | added |
---|
comment:2 by , 7 years ago
comment:3 by , 7 years ago
Owner: | changed from | to
---|
I tried looking into the issue and created a pull request proposing a fix.
I have created a test as well which failed with the old version.
Check the pull request https://github.com/xapian/xapian/pull/173/ . Will review with olly and get it fixed.
comment:4 by , 7 years ago
Issues with #173 pull request. Opened a new pull request with only changes for this ticket. https://github.com/xapian/xapian/pull/174
comment:5 by , 7 years ago
The patch which exposed this functionality unfortunately mis-documented what two of the three options actually do (STOP_NONE
is OK, the others aren't).
The stopper is expected to always be fed the unstemmed form (it takes a word not a stem). Passing stemmed forms to a stopper which is checking a list of words seems a bad idea. The stemmer maps words to stems, and the two are really separate spaces (in some cases, the stem happens to be the same string as one of the words which stems to it, but that doesn't mean the stemmer is mapping words to words). So for example, using the English stemmer, the word "tease" has the stem "teas". But that's nothing to do with the word "teas" (which has the stem "tea").
STOP_STEMMED
is actually "check the unstemmed form with the stopper, and if it's a stop word, only index its unstemmed form" - this is a useful thing to do because it means searches for phrases which include stopwords work (the unstemmed forms are indexed with positional information).
STOP_ALL
is actually "check the unstemmed form with the stopper, and if it's a stop word, skip the word". At least in English there are cases where a word has multiple meanings, and only one is really a stopword. For example, "can" would probably be on an English stopword list, because it's a form of the irregular verb meaning "to be able to". But it's also a noun (a metal container) and a different regular verb (meaning to put something in such a metal container), etc, and those words shouldn't really be stopwords. So while "cans" and "canned" also stem the same way as "can", it's unhelpful to treat them as stopwords too.
If you use the same stopper when parsing queries, this should work nicely - "can" will also be treated as a stop word in queries, but a search for "canned" will still match "canned" or "cans" in documents.
English is particularly rife with words with lots of different meanings, and I'm not sure how common this situation is in other languages, but as best I can make out your example "lea" is actually a name (https://en.wiktionary.org/wiki/L%C3%A9a) which happens to stem to the same thing as the article "le", in which case I'd argue that "lea" really shouldn't be treated as a stopword either.
Given STOP_STEMMED
is the default, and before this patch it was long-established as the hard-coded behaviour when a stopper was set, changing what it means now to try to match what the current API documentation says would be unhelpful, and I think fixing the documentation makes most sense.
You can actually already stop any word which stems the same way as a stopword by providing a stopper which stems its input before checking it against a list of stems of stopwords, but we could perhaps provide a mode (or a special Stopper
subclass) to streamline this, if it's actually a sensible thing to be doing.
Anyway, I'm afraid the patch in that PR isn't an appropriate change.
comment:6 by , 7 years ago
Milestone: | → 1.4.5 |
---|
I've fixed the API docs in master in [c6a10416378f0d50cfcf81ca259e2b861e0f4fe4/git] and backported for 1.4.5 in [cddaec720442b602184747c6ca204fe3b6a7cba1/git].
comment:7 by , 7 years ago
Thanks for you explanation olly.
But, I'm not sure how to resolve my issue.
For now, we are using STEM_ALL
strategy for queryParser and termGenerator. and STOP_ALL
for termGenerator (it seems that queryParser have to configurable stop strategy).
We used STEM_ALL because we what all words to be stemmed. But we want to have "Lea" indexed. "Lea" not being a stopword, it should be indexed what ever the stop strategy is.
I'm not sure of the use of the 'Z' prefix, but is seems that we should use STEM_ALL_Z
strategy (for both queryParser and termGenerator). This way, "le" (query) would not match with "le" (stemmed term). Am I right ?
comment:8 by , 7 years ago
Assuming you set the same stopper at query time, this should just work with those settings - testing with a slightly patched version of examples/quest.cc
I get:
$ git diff diff --git a/xapian-core/examples/quest.cc b/xapian-core/examples/quest.cc index 9c199c340d5f..d79a7bcf414d 100644 --- a/xapian-core/examples/quest.cc +++ b/xapian-core/examples/quest.cc @@ -37,15 +37,7 @@ using namespace std; // Stopwords: static const char * sw[] = { - "a", "about", "an", "and", "are", "as", "at", - "be", "by", - "en", - "for", "from", - "how", - "i", "in", "is", "it", - "of", "on", "or", - "that", "the", "this", "to", - "was", "what", "when", "where", "which", "who", "why", "will", "with" + "la", "le" }; struct qp_flag { const char * s; unsigned f; }; @@ -362,7 +354,7 @@ try { parser.set_database(db); parser.set_stemmer(stemmer); - parser.set_stemming_strategy(Xapian::QueryParser::STEM_SOME); + parser.set_stemming_strategy(Xapian::QueryParser::STEM_ALL); parser.set_stopper(&mystopper); Xapian::Query query = parser.parse_query(argv[optind], flags); $ examples/quest --stemmer fr 'le camion' Parsed Query: Query(camion@2) No database specified so not running the query. $ examples/quest --stemmer fr 'lea seydoux' Parsed Query: Query((le@1 OR seydoux@2)) No database specified so not running the query.
So le
is handled as a stopword, and lea
is stemmed to le
and included in the search.
There's one slight wrinkle, which is that for terms where search-time stopwording is suppressed (e.g. because le
is used in a phrase, or an individual le
is quoted, or when the query is entirely composed of stopwords) then le
in the query won't be stopped and will match lea
in the document, e.g.:
$ examples/quest --stemmer fr '"le voiture"' Parsed Query: Query((le@1 PHRASE 2 voitur@2)) No database specified so not running the query. $ examples/quest --stemmer fr '"le" voiture' Parsed Query: Query((le@1 OR voitur@2)) No database specified so not running the query. $ examples/quest --stemmer fr 'le la' Parsed Query: Query((le@1 OR la@2)) No database specified so not running the query.
I think to handle such cases, we'd probably need to explicitly teach QueryParser
about the different stop strategies. If we're removing stopwords at index time, then in cases like the above it could do something more appropriate.
You could switch to STEM_ALL_Z
- that's the default mode of operation, and allows for exact matching of words and exact phrase searches, which you can't achieve if you only index the stemmed forms. The downside is that the database will be larger.
comment:9 by , 7 years ago
Backported the documentation fix for 1.4.5 in [cddaec720442b602184747c6ca204fe3b6a7cba1/git].
comment:10 by , 7 years ago
Milestone: | 1.4.5 → 1.4.x |
---|---|
Summary: | TermGenerator do not stop stemmed term. → Teach QueryParser about stopping strategies |
I think the documentation fix addresses this as far as it will be for 1.4.5.
However, it'd be good to address the wrinkle I mentioned in comment:8 above:
I think to handle such cases, we'd probably need to explicitly teach QueryParser about the different stop strategies. If we're removing stopwords at index time, then in cases like the above it could do something more appropriate.
But that's a longer term thing, so adjusting milestone.
comment:11 by , 11 months ago
Milestone: | 1.4.x → 1.5.0 |
---|---|
Owner: | changed from | to
Priority: | normal → highest |
Status: | new → assigned |
It'd be good to address this, though the cases from comment:6 are never going to be handled entirely satisfactory due to inherent limitations with index-time stopping:
$ examples/quest --stemmer fr '"le voiture"' Parsed Query: Query((le@1 PHRASE 2 voitur@2))
If le
is a stopword and removed before indexing then the best we can do here would be Query(voitur@2)
which will match voiture
without le
in front.
$ examples/quest --stemmer fr '"le" voiture' Parsed Query: Query((le@1 OR voitur@2))
Same here.
$ examples/quest --stemmer fr 'le la' Parsed Query: Query((le@1 OR la@2))
And here we can't do a search at all as the query is entirely made of stopwords. This particular case probably isn't a useful search, but it's possible to come up with queries entirely composed of stopwords. In English the Shakespeare quote "to be or not to be" is one example.
I'm going to mark this to do for 1.5.0 for now, but if it proves more involved than I hope it'll likely get postponed as we really need to actually get a new stable release series out (and this could be backported to a stable release).
Hi,
A little ping.
Has someone the time to look at it ?
Thanks.