Context Navigation

← Previous Ticket
Next Ticket →

#699 new defect

Better tokenisation of mixed CJK numbers

Reported by:	Olly Betts	Owned by:	Olly Betts
Priority:	normal	Milestone:
Component:	QueryParser	Version:	git master
Severity:	normal	Keywords:	GoodFirstBug
Cc:		Blocked By:
Blocking:		Operating System:	All

Description

From comment:28:ticket:180:

Dai Youli noted on IRC that mixed numbers like 2千3百 (two thousand three hundred) get indexed as four separate terms - while that's not terrible (since the same does at least happen at search time), it's not ideal either - searching for 2千3百 would find 3千2百, as well as documents containing those characters nowhere near each other.

Perhaps digits among CJK characters should be included in the span of text to be passed for n-gramming though.

Change History (1)

comment:1 by Olly Betts, 8 years ago

Keywords:	GoodFirstBug added

Note: See TracTickets for help on using tickets.

Download in other formats:

Comma-delimited Text
Tab-delimited Text
RSS Feed