Opened 7 years ago

Closed 3 months ago

#594 closed enhancement (incomplete)

Add support for SCWS Chinese segmentation library

Reported by: olly Owned by: olly
Priority: normal Milestone: 1.4.x
Component: Library API Version:
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Attachments (3)

xapian-scws-1.3.x-trunk.patch.txt (13.4 KB) - added by olly 7 years ago.
Updated patch from second email thread
xapian-scws-1.3.x-trunk.updated.patch (14.2 KB) - added by olly 7 years ago.
Updated patch which compiles without SCWS
xapian-scws-1.3.x-snap.patch (18.1 KB) - added by olly 5 years ago.
newer patch from original author

Download all attachments as: .zip

Change History (7)

Changed 7 years ago by olly

Updated patch from second email thread

Changed 7 years ago by olly

Updated patch which compiles without SCWS

comment:1 Changed 7 years ago by olly

I've cleaned up the patch a little, and fixed some warnings so it compiles cleanly without SCWS. I've not tried it with SCWS yet.

comment:2 Changed 5 years ago by olly

There's a newer patch, but not based on my cleaned up version:

http://article.gmane.org/gmane.comp.search.xapian.general/9359

I'll also attach the patch file so it can't get lost.

Changed 5 years ago by olly

newer patch from original author

comment:3 Changed 3 years ago by olly

  • Milestone changed from 1.3.x to 1.4.x

Unfortunately, I think we have to bump this from this release cycle. It's really the sort of change that should go in early in the cycle so there's ample time to shake out problems, not right at the end which is where we now are.

I should have sorted this out earlier in this cycle, but frankly the updated patch which totally ignored the work I'd already done cleaning up the previous one was quite a demotivator for spending time on this.

Anyway, let's get this in early in 1.5.x (which we don't yet have a milestone for so 1.4.x for now).

comment:4 Changed 3 months ago by olly

  • Resolution set to incomplete
  • Status changed from new to closed

We now have support for CJK segmentation using ICU (merged in [f881f0bd1609]).

The patches here are all sadly very outdated (mostly my fault). But I think at this point closing this ticket makes the most sense.

If SCWS does a better job for Chinese than ICU we could potentially support both (and indeed other segmentation algorithms). I think for maintainability additional alternatives would each needs to be wrapped up cleanly in an iterator class in the same way that CJKNgramIterator and CJKWordIterator wrap the ngram and ICU algorithms.

Note: See TracTickets for help on using tickets.