Opened 4 years ago
Closed 2 weeks ago
#812 closed defect (fixed)
Stemming of proper nouns
| Reported by: | Olly Betts | Owned by: | Olly Betts |
|---|---|---|---|
| Priority: | normal | Milestone: | 2.0.0 |
| Component: | QueryParser | Version: | git master |
| Severity: | normal | Keywords: | |
| Cc: | Blocked By: | ||
| Blocking: | Operating System: | All |
Description
Currently the QueryParser suppresses stemming for words with an initial capital, with the assumption that these are proper nouns where stemming can be unhelpful.
However, that's a bit English-centric - e.g. nouns in German always have an initial capital, and names are inflected in some languages (Russian and Czech are two I'm aware of but there are likely others).
We could scrap this special handling completely, but it seems useful for some languages. Perhaps each stemmer should be able to report whether it's desirable to do this or not?
This issue is present in 1.4.x but so far nobody has actually complained about it. Therefore I think we should decide how to address it best without the restriction of backportability, then we can look at that once we have addressed it.
Change History (7)
comment:1 by , 2 years ago
comment:2 by , 18 months ago
Breakdown for languages which are currently supported or in the process of being reviewed for Snowball:
Should stem with initial capital
- Armenian (alphabet has case)
- Basque (source: https://en.wikipedia.org/wiki/Basque_language)
- Czech
- Finnish
- German
- Hungarian
- Irish (source: https://en.wikipedia.org/wiki/Irish_declension#Vocative though that example doesn't seem to be handled by the Snowball Irish stemmer)
- Lithuanian
- Polish
- Romanian
- Russian
- Turkish
- Ukrainian
Should not stem with initial capital
- Catalan (no evidence of declension for proper nouns in Wikipedia AFAICS; https://en.wiktionary.org/wiki/Category:Catalan_proper_noun_forms lists two words but neither seems relevant - plural of Sicily? titan without a capital)
- Danish (-s indicating possessive: https://en.wikipedia.org/wiki/Danish_grammar#Articles)
- Dutch (Kraaij_pohlmann, Porter) (has genitive-s: https://en.wikipedia.org/wiki/Dutch_language#Genders_and_cases)
- English (English, Earlyenglish, Lovins, Porter)
- French
- Indonesian (no evidence of declension for proper nouns in Wikipedia or wiktionary AFAICS)
- Italian
- Norwegian (has genitive-s, e.g. "Sondres flotte bil ('Sondre's nice car', Sondre being a personal name)")
- Portuguese (no evidence of declension for proper nouns in Wikipedia AFAICS)
- Spanish
- Swedish (has genitive-s, e.g. Köpenhamn -> genitive Köpenhamns)
Alphabet doesn't have upper case
- Arabic
- Farsi/Persian
- Nepali
- Tamil
Need to determine
follow-up: 5 comment:4 by , 2 months ago
I've noticed there's an unhelpful interaction here with handling of apostrophe (#609) for some languages, such as English and Danish where 's is suffixed to indicate possession (for Danish, only for a proper noun - for other nouns just s is suffixed).
We want to include apostrophe as a word character, but doing so hurts here. We could perhaps have more nuanced handling of apostrophe, but it seems tricky to define rules for.
Maybe we should have flags to select the handling of capitalised words in QueryParser - we could have "no stem" (1.4.x behaviour), "stem" (don't treat them specially) and "auto" (treat them specially for some languages, which could work by asking the stemmer what to do).
comment:5 by , 3 weeks ago
Replying to Olly Betts:
Maybe we should have flags to select the handling of capitalised words in QueryParser - we could have "no stem" (1.4.x behaviour), "stem" (don't treat them specially) and "auto" (treat them specially for some languages, which could work by asking the stemmer what to do).
We already have stem_strategy STEM_SOME which selects the current "don't stem if capitalised", so maybe it would be better to handle this via that.
Either STEM_SOME could be made smart (optionally with a new STEM_SOME_FOR_ALL_LANGUAGES), or we could add a new STEM_SOME_FOR_SOME_LANGUAGES (and probably make that the default).
I think I'm currently leaning towards just enhancing STEM_SOME, which does argue to making this change in 2.0.0. That also wouldn't preclude adding a new option to give the 1.4.x STEM_SOME behaviour later.
comment:6 by , 3 weeks ago
Having thought about the API for this some more, it's perhaps better handled separately to STEM_SOME.
The stem strategy is really about what terms are available, and applies and both index and search time. E.g. in English the word stemming would give the following terms:
stem_strategy terms STEM_NONE stemming STEM_SOME Zstem stemming STEM_ALL stem STEM_ALL_Z Zstem
(STEM_SOME_FULL_POS gives the same terms as STEM_SOME but both have positional information so terms in positional operators can be stemmed e.g. stemming NEAR searching -> Query((Zstem@1 NEAR 11 Zsearch@2)) rather than Query((stemming@1 NEAR 11 searching@2)).)
However whether capitalised words are stemmed is not a different strategy for what terms are available, but a question of which available term to use.
Also, it's relevant when both stemmed and unstemmed are available so with STEM_SOME and STEM_SOME_FULL_POS, if we tried to make it a new strategy it'd need variants of both. Neither of those variants would really make sense for TermGenerator though.
So perhaps this should be handled via QueryParser flags.
comment:7 by , 2 weeks ago
| Resolution: | → fixed |
|---|---|
| Status: | new → closed |
Addressed by 3542c89387d58c5f0b22dd43319730ab6e351538.
We now only treat an initial capital letter specially for some languages.
There's also a new QueryParser flag (FLAG_NO_PROPER_NOUN_HEURISTIC) to turn this special handling off completely.

Turkish too, if apostrophe is treated as a word character (like we currently do) - e.g.
Türkiye'dir ("it is Turkey").