Opened 3 years ago
Last modified 4 months ago
#812 new defect
Stemming of proper nouns
Reported by: | Olly Betts | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.5.0 |
Component: | QueryParser | Version: | git master |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
Currently the QueryParser suppresses stemming for words with an initial capital, with the assumption that these are proper nouns where stemming can be unhelpful.
However, that's a bit English-centric - e.g. nouns in German always have an initial capital, and names are inflected in some languages (Russian and Czech are two I'm aware of but there are likely others).
We could scrap this special handling completely, but it seems useful for some languages. Perhaps each stemmer should be able to report whether it's desirable to do this or not?
This issue is present in 1.4.x but so far nobody has actually complained about it. Therefore I think we should decide how to address it best without the restriction of backportability, then we can look at that once we have addressed it.
Change History (2)
comment:1 by , 11 months ago
comment:2 by , 4 months ago
Breakdown by currently support stemming language (and Czech which we don't have a stemmer for currently but there's one in progress upstream):
Should stem with initial capital
- German2
- German
- Russian
- Turkish
- Czech
Shouldn't stem with initial capital
- Earlyenglish
- English
- French
- Lovins
- Porter
Alphabet doesn't have upper case
- Arabic
- Tamil
Need to determine
- Armenian (alphabet has case)
- Basque
- Catalan
- Danish
- Dutch
- Finnish
- Hungarian
- Indonesian
- Irish
- Italian
- Kraaij_pohlmann (Dutch)
- Lithuanian
- Nepali
- Norwegian
- Portuguese
- Romanian
- Spanish
- Swedish
Turkish too, if apostrophe is treated as a word character (like we currently do) - e.g.
Türkiye'dir ("it is Turkey")
.