Opened 3 years ago

Last modified 5 months ago

#812 new defect

Stemming of proper nouns

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 1.5.0
Component: QueryParser Version: git master
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Currently the QueryParser suppresses stemming for words with an initial capital, with the assumption that these are proper nouns where stemming can be unhelpful.

However, that's a bit English-centric - e.g. nouns in German always have an initial capital, and names are inflected in some languages (Russian and Czech are two I'm aware of but there are likely others).

We could scrap this special handling completely, but it seems useful for some languages. Perhaps each stemmer should be able to report whether it's desirable to do this or not?

This issue is present in 1.4.x but so far nobody has actually complained about it. Therefore I think we should decide how to address it best without the restriction of backportability, then we can look at that once we have addressed it.

Change History (2)

comment:1 by Olly Betts, 12 months ago

Turkish too, if apostrophe is treated as a word character (like we currently do) - e.g. Türkiye'dir ("it is Turkey").

comment:2 by Olly Betts, 5 months ago

Breakdown by currently support stemming language (and Czech which we don't have a stemmer for currently but there's one in progress upstream):

Should stem with initial capital

  • German2
  • German
  • Russian
  • Turkish
  • Czech

Shouldn't stem with initial capital

  • Earlyenglish
  • English
  • French
  • Lovins
  • Porter

Alphabet doesn't have upper case

  • Arabic
  • Tamil

Need to determine

  • Armenian (alphabet has case)
  • Basque
  • Catalan
  • Danish
  • Dutch
  • Finnish
  • Hungarian
  • Indonesian
  • Irish
  • Italian
  • Kraaij_pohlmann (Dutch)
  • Lithuanian
  • Nepali
  • Norwegian
  • Portuguese
  • Romanian
  • Spanish
  • Swedish
Note: See TracTickets for help on using tickets.