Opened 12 years ago

Closed 5 years ago

#594 closed enhancement (incomplete)

Add support for SCWS Chinese segmentation library

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 1.4.x
Component: Library API Version:
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Attachments (3)

xapian-scws-1.3.x-trunk.patch.txt (13.4 KB ) - added by Olly Betts 12 years ago.
Updated patch from second email thread
xapian-scws-1.3.x-trunk.updated.patch (14.2 KB ) - added by Olly Betts 12 years ago.
Updated patch which compiles without SCWS
xapian-scws-1.3.x-snap.patch (18.1 KB ) - added by Olly Betts 10 years ago.
newer patch from original author

Download all attachments as: .zip

Change History (7)

by Olly Betts, 12 years ago

Updated patch from second email thread

by Olly Betts, 12 years ago

Updated patch which compiles without SCWS

comment:1 by Olly Betts, 12 years ago

I've cleaned up the patch a little, and fixed some warnings so it compiles cleanly without SCWS. I've not tried it with SCWS yet.

comment:2 by Olly Betts, 10 years ago

There's a newer patch, but not based on my cleaned up version:

http://article.gmane.org/gmane.comp.search.xapian.general/9359

I'll also attach the patch file so it can't get lost.

by Olly Betts, 10 years ago

newer patch from original author

comment:3 by Olly Betts, 8 years ago

Milestone: 1.3.x1.4.x

Unfortunately, I think we have to bump this from this release cycle. It's really the sort of change that should go in early in the cycle so there's ample time to shake out problems, not right at the end which is where we now are.

I should have sorted this out earlier in this cycle, but frankly the updated patch which totally ignored the work I'd already done cleaning up the previous one was quite a demotivator for spending time on this.

Anyway, let's get this in early in 1.5.x (which we don't yet have a milestone for so 1.4.x for now).

comment:4 by Olly Betts, 5 years ago

Resolution: incomplete
Status: newclosed

We now have support for CJK segmentation using ICU (merged in [f881f0bd1609]).

The patches here are all sadly very outdated (mostly my fault). But I think at this point closing this ticket makes the most sense.

If SCWS does a better job for Chinese than ICU we could potentially support both (and indeed other segmentation algorithms). I think for maintainability additional alternatives would each needs to be wrapped up cleanly in an iterator class in the same way that CJKNgramIterator and CJKWordIterator wrap the ngram and ICU algorithms.

Note: See TracTickets for help on using tickets.