Opened 13 years ago
Closed 6 years ago
#594 closed enhancement (incomplete)
Add support for SCWS Chinese segmentation library
Reported by: | Olly Betts | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.4.x |
Component: | Library API | Version: | |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
Attachments (3)
Change History (7)
by , 13 years ago
Attachment: | xapian-scws-1.3.x-trunk.patch.txt added |
---|
by , 13 years ago
Attachment: | xapian-scws-1.3.x-trunk.updated.patch added |
---|
Updated patch which compiles without SCWS
comment:1 by , 13 years ago
I've cleaned up the patch a little, and fixed some warnings so it compiles cleanly without SCWS. I've not tried it with SCWS yet.
comment:2 by , 11 years ago
There's a newer patch, but not based on my cleaned up version:
http://article.gmane.org/gmane.comp.search.xapian.general/9359
I'll also attach the patch file so it can't get lost.
comment:3 by , 9 years ago
Milestone: | 1.3.x → 1.4.x |
---|
Unfortunately, I think we have to bump this from this release cycle. It's really the sort of change that should go in early in the cycle so there's ample time to shake out problems, not right at the end which is where we now are.
I should have sorted this out earlier in this cycle, but frankly the updated patch which totally ignored the work I'd already done cleaning up the previous one was quite a demotivator for spending time on this.
Anyway, let's get this in early in 1.5.x (which we don't yet have a milestone for so 1.4.x for now).
comment:4 by , 6 years ago
Resolution: | → incomplete |
---|---|
Status: | new → closed |
We now have support for CJK segmentation using ICU (merged in [f881f0bd1609]).
The patches here are all sadly very outdated (mostly my fault). But I think at this point closing this ticket makes the most sense.
If SCWS does a better job for Chinese than ICU we could potentially support both (and indeed other segmentation algorithms). I think for maintainability additional alternatives would each needs to be wrapped up cleanly in an iterator class in the same way that CJKNgramIterator
and CJKWordIterator
wrap the ngram and ICU algorithms.
Updated patch from second email thread