Context Navigation

← Previous Ticket
Next Ticket →

#812 new defect

Stemming of proper nouns

Reported by:	Olly Betts	Owned by:	Olly Betts
Priority:	normal	Milestone:	1.5.0
Component:	QueryParser	Version:	git master
Severity:	normal	Keywords:
Cc:		Blocked By:
Blocking:		Operating System:	All

Description

Currently the QueryParser suppresses stemming for words with an initial capital, with the assumption that these are proper nouns where stemming can be unhelpful.

However, that's a bit English-centric - e.g. nouns in German always have an initial capital, and names are inflected in some languages (Russian and Czech are two I'm aware of but there are likely others).

We could scrap this special handling completely, but it seems useful for some languages. Perhaps each stemmer should be able to report whether it's desirable to do this or not?

This issue is present in 1.4.x but so far nobody has actually complained about it. Therefore I think we should decide how to address it best without the restriction of backportability, then we can look at that once we have addressed it.

Change History (2)

comment:1 by Olly Betts, 20 months ago

Turkish too, if apostrophe is treated as a word character (like we currently do) - e.g. Türkiye'dir ("it is Turkey").

comment:2 by Olly Betts, 12 months ago

Breakdown by currently support stemming language (and Czech which we don't have a stemmer for currently but there's one in progress upstream):

Should stem with initial capital

German2
German
Russian
Turkish
Czech

Shouldn't stem with initial capital

Earlyenglish
English
French
Lovins
Porter

Alphabet doesn't have upper case

Arabic
Tamil

Need to determine

Armenian (alphabet has case)
Basque
Catalan
Danish
Dutch
Finnish
Hungarian
Indonesian
Irish
Italian
Kraaij_pohlmann (Dutch)
Lithuanian
Nepali
Norwegian
Portuguese
Romanian
Spanish
Swedish

Note: See TracTickets for help on using tickets.

Download in other formats: