Opened 4 years ago

Last modified 4 days ago

#812 new defect

Stemming of proper nouns

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 1.5.0
Component: QueryParser Version: git master
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Currently the QueryParser suppresses stemming for words with an initial capital, with the assumption that these are proper nouns where stemming can be unhelpful.

However, that's a bit English-centric - e.g. nouns in German always have an initial capital, and names are inflected in some languages (Russian and Czech are two I'm aware of but there are likely others).

We could scrap this special handling completely, but it seems useful for some languages. Perhaps each stemmer should be able to report whether it's desirable to do this or not?

This issue is present in 1.4.x but so far nobody has actually complained about it. Therefore I think we should decide how to address it best without the restriction of backportability, then we can look at that once we have addressed it.

Change History (2)

comment:1 by Olly Betts, 21 months ago

Turkish too, if apostrophe is treated as a word character (like we currently do) - e.g. Türkiye'dir ("it is Turkey").

comment:2 by Olly Betts, 14 months ago

Breakdown for languages which are currently supported or in the process of being reviewed for Snowball:

Should stem with initial capital

  • Czech
  • German
  • Russian
  • Turkish
  • Ukrainian

Should not stem with initial capital

  • English (English, Earlyenglish, Lovins, Porter)
  • French
  • Spanish

Alphabet doesn't have upper case

  • Arabic
  • Farsi/Persian
  • Nepali
  • Tamil

Need to determine

  • Armenian (alphabet has case)
  • Basque
  • Catalan
  • Danish
  • Dutch (Kraaij_pohlmann, Porter)
  • Finnish
  • Hungarian
  • Indonesian
  • Irish
  • Italian
  • Lithuanian
  • Norwegian
  • Polish
  • Portuguese
  • Romanian
  • Swedish
Last edited 4 days ago by Olly Betts (previous) (diff)
Note: See TracTickets for help on using tickets.