Opened 4 years ago

Last modified 21 hours ago

#812 new defect

Stemming of proper nouns

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 2.0.0
Component: QueryParser Version: git master
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Currently the QueryParser suppresses stemming for words with an initial capital, with the assumption that these are proper nouns where stemming can be unhelpful.

However, that's a bit English-centric - e.g. nouns in German always have an initial capital, and names are inflected in some languages (Russian and Czech are two I'm aware of but there are likely others).

We could scrap this special handling completely, but it seems useful for some languages. Perhaps each stemmer should be able to report whether it's desirable to do this or not?

This issue is present in 1.4.x but so far nobody has actually complained about it. Therefore I think we should decide how to address it best without the restriction of backportability, then we can look at that once we have addressed it.

Change History (5)

comment:1 by Olly Betts, 2 years ago

Turkish too, if apostrophe is treated as a word character (like we currently do) - e.g. Türkiye'dir ("it is Turkey").

comment:2 by Olly Betts, 18 months ago

Breakdown for languages which are currently supported or in the process of being reviewed for Snowball:

Should stem with initial capital

Should not stem with initial capital

  • Catalan (no evidence of declension for proper nouns in Wikipedia AFAICS; https://en.wiktionary.org/wiki/Category:Catalan_proper_noun_forms lists two words but neither seems relevant - plural of Sicily? titan without a capital)
  • English (English, Earlyenglish, Lovins, Porter)
  • French
  • Indonesian (no evidence of declension for proper nouns in Wikipedia or wiktionary AFAICS)
  • Italian
  • Portuguese (no evidence of declension for proper nouns in Wikipedia AFAICS)
  • Spanish

Alphabet doesn't have upper case

  • Arabic
  • Farsi/Persian
  • Nepali
  • Tamil

Need to determine

  • Danish
  • Dutch (Kraaij_pohlmann, Porter)
  • Norwegian
  • Swedish (has genitive-s, e.g. Köpenhamn -> genitive Köpenhamns)
Last edited 21 hours ago by Olly Betts (previous) (diff)

comment:3 by Olly Betts, 7 weeks ago

Milestone: 1.5.02.0.0

Milestone renamed

comment:4 by Olly Betts, 7 weeks ago

I've noticed there's an unhelpful interaction here with handling of apostrophe (#609) for some languages, such as English and Danish where 's is suffixed to indicate possession (for Danish, only for a proper noun - for other nouns just s is suffixed).

We want to include apostrophe as a word character, but doing so hurts here. We could perhaps have more nuanced handling of apostrophe, but it seems tricky to define rules for.

Maybe we should have flags to select the handling of capitalised words in QueryParser - we could have "no stem" (1.4.x behaviour), "stem" (don't treat them specially) and "auto" (treat them specially for some languages, which could work by asking the stemmer what to do).

in reply to:  4 comment:5 by Olly Betts, 22 hours ago

Replying to Olly Betts:

Maybe we should have flags to select the handling of capitalised words in QueryParser - we could have "no stem" (1.4.x behaviour), "stem" (don't treat them specially) and "auto" (treat them specially for some languages, which could work by asking the stemmer what to do).

We already have stem_strategy STEM_SOME which selects the current "don't stem if capitalised", so maybe it would be better to handle this via that.

Either STEM_SOME could be made smart (optionally with a new STEM_SOME_FOR_ALL_LANGUAGES), or we could add a new STEM_SOME_FOR_SOME_LANGUAGES (and probably make that the default).

I think I'm currently leaning towards just enhancing STEM_SOME, which does argue to making this change in 2.0.0. That also wouldn't preclude adding a new option to give the 1.4.x STEM_SOME behaviour later.

Note: See TracTickets for help on using tickets.