Opened 15 years ago
Closed 14 years ago
#448 closed enhancement (fixed)
Allow usage of custom stemmers
Reported by: | Evgeny Sizikov | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.2.1 |
Component: | Library API | Version: | 1.0.17 |
Severity: | normal | Keywords: | |
Cc: | asaf.bartov@… | Blocked By: | |
Blocking: | Operating System: | All |
Description
Currently it seems to be impossible to use custom stemmer with Xapian neither from C++ code nor from Xapian-bindings. This seems to be a nice feature to have,
Attachments (11)
Change History (49)
by , 15 years ago
Attachment: | xapian-core-1.0.17-stem-virtual-destructor.patch added |
---|
by , 15 years ago
Attachment: | xapian-bindings-1.0.17-stem-director.patch added |
---|
comment:1 by , 15 years ago
comment:2 by , 15 years ago
Resolution: | → duplicate |
---|---|
Status: | new → closed |
I'm afraid your patches won't actually work!
First problem is that you need to make operator() virtual too (and ideally get_description()).
However, Xapian::Stem (like many Xapian API classes) is just a wrapper around a reference counted pointer. So although with your patches you can now subclass Xapian::Stem in Python, when you pass an instance of the subclass to Xapian API functions, your Python stemming code won't actually get called, even if you make the operator() virtual.
This is all already covered by #186, so closing this ticket as a duplicate.
by , 15 years ago
Attachment: | xapian_hunspell_stemmer-0.1.tar.gz added |
---|
Python extension which provides a new stemmer for Xapian
by , 15 years ago
Attachment: | test_xapian_hunspell_stemmer.py added |
---|
comment:3 by , 15 years ago
The missed part in my previous comment is that actually I've made a Python extension which provides a new HunspellStem C++ class and a SWIG binding for it.
This DOES actually work and I'm using this in my project. The idea of this ticket would be not to solve the problem of subclassing xapian.Stem in Python but more a low level solution for the problem, which requires developing a C++ extension.
Personally I don't see why this could be covered by #186 in its current form.
comment:4 by , 15 years ago
Component: | Other → Library API |
---|---|
Resolution: | duplicate |
Status: | closed → reopened |
Version: | → 1.0.17 |
OK, your initial description and the contents of the first two patches strongly suggested you were just trying to subclass Xapian::Stem in Python, which really won't work. I don't think you need to enable directors for Xapian::Stem with your actual approach, and I also don't see why need to make Xapian::Stem's destructor virtual (since Xapian::Stem has no other virtual methods, and the destructor of your subclass doesn't do anything, I can't see why you need to do this).
We can't make a change in 1.0.x which involves changing the library ABI incompatibly, and making Xapian::Stem's destructor virtual would be such a change, so if you do need to do that, this would have to be a trunk-only change.
But if these alternative stemmers are useful, then I think it is better to build them in at the C++ API level - it's really an implementation detail whether they are part of Snowball or not, and that way they will automatically work from C++ and any wrapped language.
Casting away const to write into the result of calling std::string::data() is undefined behaviour, so don't do that!
Also, your claim of sole copyright on xapian_hunspell_stemmer.cpp is rather dubious since the definition of create_s() is lifted verbatim from Xapian's steminternal.cc (although it seems create_s() isn't actually used in your code). We need to be particularly careful about tracking the copyright status of files in Xapian as we want to reach a point where we can relicense the code, but there are currently copyright holders who aren't interested in doing so.
by , 15 years ago
Attachment: | xapian_hunspell_stemmer-0.1.1.tar.gz added |
---|
comment:5 by , 15 years ago
- Seems like Xapian::Stem has to have a vtable for allow correct. Without any virtual method it just doesn't work.
- Seems like SWIG "director" feature is not really required, thus xapian-bindings-1.0.17-stem-director.patch could be dropped out. Thanks for pointing that (I've meantioned I'm not a SWIG expert).
- It's OK to put ABI changes into trunk only if I'll get it as a stable release in a near future
- Yeah, sure. We could even have some "standard" stemmers in C++ level and allow one to use all/any of them by configure-time options. The best would be proceeding this way with providing SWIG bindings for each stemmer implementation.
- std::string::data() is a pointer to already prepared buffer of BUFSIZ size (see the line # 158 in the attachment v 0.1.1, so I'm using std::string as a self-contained buffer which will be freed immediately after going out of scopes.
- I didn't manage to build without copying create_s() function into my code - I would say it's a design issue of Xapian::Stem : why doesn't it a protected member of the class but an extern function? If I'm subclassing Xapian::Stem from outside the Xapian core/bindings (e.g. this extension module) I do not have it's implementation but only declaration. If you tell me how to get it to work without copying the code I'd gladly do it that way.
- I agree with you regarding copyrights - just forgot to update headers before attaching it here.
comment:6 by , 15 years ago
To put some more details to ppoint 1. above: if there's no any virtual method in Xapian::Stem (no vtable) I got this error when trying to use the subclassed instance from my extension:
Traceback (most recent call last): File "test.py", line 26, in <module> main() File "test.py", line 17, in main generator.set_stemmer(stem) TypeError: in method 'TermGenerator_set_stemmer', argument 2 of type 'Xapian::Stem const &'
comment:7 by , 15 years ago
It seems after all that the problem is not in C++ level at all. I could use a subclassed Xapian::Stem without the latter to have its destructor "virtual".
The problem is in the SWIG bindings. Could you please give me hint how the type missmatch problem in comment 6 could be workarounded?
comment:8 by , 15 years ago
Milestone: | → 1.2.x |
---|
I've no useful ideas what the problem with SWIG is I'm afraid. You're doing some unusual hacking around with types here, which may be what is confusing it.
It's OK to put ABI changes into trunk only if I'll get it as a stable release in a near future
I'm afraid at this point in the release cycle, you aren't likely to get ABI changes for a new feature into anything other than trunk. There would have to a compelling reason and a clean patch ready for merging. 1.0.x is ABI stable, and we're at the release candidate stage for 1.2.0. We're already months past where we were hoping to release 1.2.0, so we want to avoid changes which might cause further delays.
std::string::data() is a pointer to already prepared buffer of BUFSIZ size (see the line # 158 in the attachment v 0.1.1, so I'm using std::string as a self-contained buffer which will be freed immediately after going out of scopes.
That's simply not a valid thing to do. It might happen to work with the current version of your compiler, but that really means nothing if the language standard doesn't permit it. You need to use a temporary buffer and create (or assign to) the string object from that. It's annoying that you have to have an extra copying step, but that's life.
I didn't manage to build without copying create_s() function into my code - I would say it's a design issue of Xapian::Stem : why doesn't it a protected member of the class but an extern function? If I'm subclassing Xapian::Stem from outside the Xapian core/bindings (e.g. this extension module) I do not have it's implementation but only declaration. If you tell me how to get it to work without copying the code I'd gladly do it that way.
Mostly just because of how the code evolved from Snowball's C support code I guess. But you don't need these Snowball-related structures for your stemmer, so it's just pointless overhead to have them in there in your stemmer and to initialise them.
Stepping back for a moment, I think a cleaner approach to allowing user-provided stemming algorithms would be:
- Rename "Stem::Internal" to "Stem::SnowballImplementation"
- Create an API-visible Stem::Implementation class. + Change Xapian::Stem to hold this in its internal pointer. + Make "Stem::SnowballImplementation" a subclass of this.
- Add a new Xapian::Stem constructor to the API which takes "Stem::Implementation*".
Then you can just create a MyHunspellStem subclass of "Xapian::Stem::Implementation", and wrap it in a Xapian::Stem object to pass to the Xapian API. No need to grub around with Xapian's internals, and have your code break when we change how the Snowball stemmers work. You can just return a std::string instead of messing around with the Snowball support code's odd interface. Should be easier to wrap for SWIG. And no need to have both an internal and wrapper class for each user defined stemmer.
Would that meet your needs?
This looks to me to only involve upwardly compatible API and ABI changes, so could potentially even get backported to 1.0.x. So marking for 1.2.x.
comment:9 by , 15 years ago
Yeah, your suggestion would be just great.
Do you need me to provide an initial implementation?
comment:10 by , 15 years ago
Do you need me to provide an initial implementation?
That would be great. If you need any help, let me know.
comment:11 by , 15 years ago
Issues:
- C# binding fails to compile with this patch - seems to be a SWIG-specific issue, and I have too low SWIG experience to fix it.
- Where should the Stem::Implementation class public declaration be put in? It can't go into stem.h as SWIG complain about Xapian::Internal::RefCntBase base class, and it should be in steminternal.h (as it is now) for user to be able to include and subclass. Any thoughts on that?
by , 15 years ago
Attachment: | xapian-core-stem-implementation.patch added |
---|
The initial implementation of Stem::Implementation according to comment:8
comment:12 by , 15 years ago
I suspect the C# bindings issue will be that SWIG is generating a new .cs source file for the new class being wrapped, so you should just be able to update the list in csharp/Makefile.am.
For the second issue, I think you should be able to just add "%ignore Xapian::Internal::RefCntBase;" to xapian.i before "%include <xapian/stem.h>".
If either (or both!) of those suggestions are wrong, let me know.
comment:13 by , 15 years ago
- fixed - a new file has been added into xapian-bindings/csharp/Makefile.am
- adding "%ignore Xapian::Internal::RefCntBase;" to xapian.i before "%include <xapian/stem.h>" doesn't fix the problem, SWIG still fails during parsing.
The 2. could be fixed if we move Xapian::Stem::Implementation (and SnowballImplementation too) out of the Xapian::Stem - it would be public Xapian::StemImplementation and private/hidden Xapian::StemSnowballImplementation then. Is that acceptable?
comment:14 by , 15 years ago
Ah, the issue is probably that the version of SWIG we're using doesn't handle nested classes. There's improved support in newer SWIG, although not in a released version yet and I think it still has some limitations, so it's probably better to avoid adding such classes to the API.
So yes, Xapian::StemImplementation
seems good.
by , 15 years ago
Attachment: | xapian-core-stem-implementation-2.patch added |
---|
Fixed C# binding build error, moved Xapian::Stem::Implementation into the xapian/stem.h header. The only question is about the name for the Xapian::SnowballStemImplementation class ...
comment:15 by , 15 years ago
Ah, forgot to put a comment for you to get notified.
Well, could you please review the second patch?
comment:16 by , 15 years ago
Actually, trac does notify for attaching a patch, but I'm afraid I've just been acutely short of time recently.
A quick review:
- It's better not to make whitespace changes (wrapping some long lines) in the same patch as functional changes - it makes reviewing harder as the reviewer has to compare the wrapped and unwrapped versions to see if there are actually also functional changes there or not. One logical change per patch is greatly preferable.
Stem::Internal::Implementation
's empty default constructor doesn't seem to be needed.
- In comment:13 and comment:14 we discussed going with
Xapian::StemImplementation
but you haven't after all. Instead you've added a preprocessor check for SWIG in the API headers which seems the wrong approach. Firstly, we want to avoid SWIG-specific logic here, but more importantly we want SWIG to see the "stem implementation" class so that it will wrap it and allow it to be subclassed in Python (or whatever) by a user wanting to implement a stemmer.
comment:17 by , 15 years ago
The problem with Xapian::StemImplementation
is that SWIG fails if got full definition of
class Xapian::StemImplementation : public Xapian::Internal::RefCntBase { ... };
Having %ignore Xapian::Internal::RefCntBase;
in xapian.i
doesn't help.
comment:18 by , 15 years ago
It would be a shame not to be able to subclass your own stemmer in the SWIG-wrapped languages (it may be a little slow compared to a "pure C++" stemmer, but that's only likely to be a problem at index time when working with large amounts of data).
I'll take a look and see if I can get that SWIG to work with that somehow.
by , 15 years ago
Attachment: | xapian-core-stem-implementation-3.r14269.patch added |
---|
Reverts some code alignment not related to the ticket's main idea
comment:20 by , 15 years ago
If we'd like to be able to subclass our own stemmer in the SWIG-wrapped languages we need the Xapian::StemImplementation
to be a "director": %feature("director") Xapian::StemImplementation;
This doesn't work and the SWIG fails with
/home/esizikov/svn/xapian/xapian-core/include/xapian/stem.h:41: Warning(473): Returning a pointer or reference in a director method is not recommended.
When trying to enable the "director" feature for the only operator()
we really need from this interface class (like %feature("director") Xapian::StemImplementation::operator();
) I've got the error like:
modern/xapian_wrap.cc:34173: error: '_swig_thread_allow' was not declared in this scope
In the generated modern/xapian_wrap.cc we have:
... director = SWIG_DIRECTOR_CAST(arg1); upcall = (director && (director->swig_get_self()==swig_obj[0])); try { { try { if (upcall) { Swig::DirectorPureVirtualException::raise("Xapian::StemImplementation::operator ()"); } else { result = (arg1)->operator ()((std::string const &)*arg2); } } catch (...) { SWIG_PYTHON_THREAD_END_ALLOW; Xapian::SetPythonException(); SWIG_fail; } } } catch (Swig::DirectorException&) { SWIG_fail; } ...
As you can see SWIG_PYTHON_THREAD_END_ALLOW;
is used without a proper initialization. It would be OK if the generated code looks like this:
... director = SWIG_DIRECTOR_CAST(arg1); upcall = (director && (director->swig_get_self()==swig_obj[0])); try { SWIG_PYTHON_THREAD_BEGIN_ALLOW; { try { if (upcall) { Swig::DirectorPureVirtualException::raise("Xapian::StemImplementation::operator ()"); } else { result = (arg1)->operator ()((std::string const &)*arg2); } } catch (...) { SWIG_PYTHON_THREAD_END_ALLOW; Xapian::SetPythonException(); SWIG_fail; } } SWIG_PYTHON_THREAD_END_ALLOW; } catch (Swig::DirectorException&) { SWIG_fail; } ...
Seems like be a SWIG bug.
comment:21 by , 15 years ago
Cc: | added |
---|
comment:22 by , 15 years ago
I look forward to seeing this committed, as I'm trying to integrate an external (non-Snowball) stemmer for Hebrew based on libhspell.
by , 15 years ago
Attachment: | xapian-core-stem-implementation-4.r14281.patch added |
---|
Made StemImplementation be a SWIG director - unfinished, has a critical no-go issue.
comment:23 by , 15 years ago
I've attached a patch which makes StemImplementation
to be a SWIG "director" class, which allows its overloading in scripting languages.
Issues:
- Had to change
get_description()
function fromconst char * get_description() const = 0;
toconst std::string get_description() const = 0;
- approve needed.
- I've missed something related to threading because I'm always having the
Fatal Python error: PyEval_SaveThread: NULL tstate
error with process been aborted just before the Python script is going to terminate.
Besides these 2 points it works (as a proof of concept): I'm now able to use a custom stemmer from a Python script:
# -*- coding: utf-8 -*- import sys sys.path.insert(0, '/home/esizikov/svn/xapian/build/xapian-bindings/python/xapian') sys.path.insert(0, '/home/esizikov/svn/xapian/build/xapian-bindings/python/modern') import xapian import hunspell class HunspellStemmer(xapian.StemImplementation): def __init__(self, lang): super(HunspellStemmer, self).__init__() self._h = hunspell.HunSpell('/usr/share/myspell/%s.dic' % lang, '/usr/share/myspell/%s.aff' % lang) self._enc = self._h.get_dic_encoding() def __call__(self, s): return self._h.stem(unicode(s, 'utf-8').encode(self._enc))[0].decode(self._enc) def main(): text = 'платья из золота на продажу' stem_impl = HunspellStemmer('ru_RU') stem = xapian.Stem(stem_impl) print stem('платья') doc = xapian.Document() generator = xapian.TermGenerator() generator.set_document(doc) generator.set_stemmer(stem) generator.index_text(text) ti = doc.termlist_begin() print ti.get_term() query_parser = xapian.QueryParser() query_parser.set_stemmer(stem) query_parser.set_stemming_strategy(xapian.QueryParser.STEM_ALL) for term in query_parser.parse_query(text): print term, print if __name__ == '__main__': main()
comment:24 by , 15 years ago
Ah, forgot to mention issue 3:
stem = xapian.Stem(HunspellStemmer('ru_RU'))
doesn't work - need a reference counting hack like# file: xapian-bindings/python/extra.i # When we set a Stopper into the QueryParser, keep a python reference so it # won't be deleted. This hack can probably be removed once xapian bug #186 is # fixed. __queryparser_set_stopper_orig = QueryParser.set_stopper def _queryparser_set_stopper(self, stopper): self._stopper = stopper return __queryparser_set_stopper_orig(self, stopper) _queryparser_set_stopper.__doc__ = __queryparser_set_stopper_orig.__doc__ QueryParser.set_stopper = _queryparser_set_stopper del _queryparser_set_stopper
comment:25 by , 15 years ago
Milestone: | 1.2.x → 1.2.1 |
---|
OK, I'm just releasing 1.2.0. Will try to get this sorted out for 1.2.1.
comment:26 by , 15 years ago
I've just been working on merging this.
I think issue 2 is probably to do with ownership of the StemImplementation object - SWIG will allocate it with new, but then think it owns it, but once it is passed in to make a Stem object, it no longer does, so it presumably gets deleted twice, and the error is a confusing side effect. I'm currently failing to work out how to use SWIG's DISOWN mechanism here, which seems the relevant one.
comment:27 by , 15 years ago
OK, I was right in comment:26 - adding a call to the Python code to disown the object fixes the error on exit:
stem_impl.this.disown()
I've now got this working automatically using a DISOWN typemap, except that SWIG doesn't seem to currently support that for C# or Java. But we can just not wrap StemImplementation for these languages for now.
Issue 1 is OK - elsewhere in the API we return std::string in similar cases for the same reason. I've made the return value std::string (removing the const) which is consistent with what we do elsewhere. It would probably be slightly better to return a const object here (I've seen that recommended somewhere, since being able to modify the temporary object isn't useful), but I think consistency is more important, and if we change this everywhere, we'll break compatibility with existing user subclasses in the other cases.
Issue 3 is just a matter of copying the solution used elsewhere.
And I really should check the formalities - have you read the "Licensing of patches" section in HACKING? Is that all OK?
comment:28 by , 14 years ago
Yes, it's OK with me. My employer does not claim on code written by me in my free time. And I'm personally agree to put the patch under the terms of both MIT/X and GPL 2 and all later versions licenses.
comment:29 by , 14 years ago
OK, cool.
Issue 3 actually goes away when the DISOWN typemap is added - it was just a side effect of both Python and Xapian thinking they were responsible for deleting the same object.
I've committed the xapian-core parts with a few tweaks (in particular, I've added basic testsuite coverage for the new feature) in r14572. More test cases would be useful, and we ought to write a topic document for this (and stemming in general - the current stemming.html probably has the wrong focus for most users).
I'll try to get the xapian-bindings part sorted out next.
comment:30 by , 14 years ago
OK, cool.
Issue 3 actually goes away when the DISOWN typemap is added - it was just a side effect of both Python and Xapian thinking they were responsible for deleting the same object.
I've committed the xapian-core parts with a few tweaks (in particular, I've added basic testsuite coverage for the new feature) in r14572. More test cases would be useful, and we ought to write a topic document for this (and stemming in general - the current stemming.html probably has the wrong focus for most users).
I'll try to get the xapian-bindings part sorted out next.
comment:31 by , 14 years ago
Gah, I managed to repaste the previous comment, not the one I was trying to make.
DISOWN doesn't actually do the job we want, instead we need to call __disown__()
on the Python StemImplementation object. It seems these two "disown" features mean different things!
So r14598 adds this to the bindings, and is currently only tested for Python (with the __disown__()
calls explicit). We should try to make the implicit before we release this new feature.
comment:32 by , 14 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
OK, I've managed to get the DISOWN typemap approach to work, so this is fixed in r14733.
comment:33 by , 14 years ago
It is impossible to use in-place instantiation of a custom stem implementation (e.g. stem = xapian.Stem(MyStemmer()) ...) with the current implementation. If the following patch would be applied to the Python bindings then it will fail with segmentation error:
Index: xapian-bindings/python/smoketest2.py =================================================================== --- xapian-bindings/python/smoketest2.py (revision 14737) +++ xapian-bindings/python/smoketest2.py (working copy) @@ -342,6 +342,9 @@ "5 * foo") def test_userstem(): + stem = xapian.Stem(MyStemmer()) + expect(stem('test'), 'tst') + mystem = MyStemmer() stem = xapian.Stem(mystem) expect(stem('test'), 'tst')
comment:34 by , 14 years ago
Actually, I'm getting a segfault further down in smoketest2.py without your patch, where a MyStemmer stemmer is instantiated in just that way.
I suspect this will be the old problem of the python "MyStemmer()" object being deleted while the xapian.Stem() object still has a reference to it, which we had with PostingSource queries. I can try and put a fix for this together tomorrow morning, unless Olly gets there first.
comment:35 by , 14 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Hmm, I wonder why I don't see the SEGV lower down.
So I think we're back to how we were before. Further digging in SWIG suggests that a DISOWN
typemap should be the same as calling __disown__()
, so I'm not sure why this isn't working.
Perhaps Python handles "temporary" objects in a special way?
comment:36 by , 14 years ago
I suspect you weren't seeing the SEGV lower down purely by chance, because the memory in use had been freed but not yet overwritten.
I've applied a change (r14739) which fixes this problem, for python at least. The fix is the same as numerous other such fixes in python/extra.i; for python subclasses of Director classes, there are two objects: the python object, and the C++ Director subclass. The reference being kept automatically is to the C++ Director subclass, so we need to take extra steps to keep a reference to the Python subclass. Ideally, we'd make the C++ Director subclass keep a reference to the corresponding Python subclass automatically, but I don't know how to do that, so we have the set of hacks in extra.i to do the equivalent explicitly.
I'm not sure whether this is a problem in other languages - it may be that SWIG is setting up the references appropriately automatically, but we should really write tests to check.
comment:37 by , 14 years ago
I don't think that's the right fix.
I think we want to do the equivalent of the patch I'll attach in a moment to the generated code.
Also, please change the ownership of a ticket if you're working on it - I was just working on this patch, and am rather surprised to find a fix committed...
by , 14 years ago
Attachment: | call-swig_disown.patch added |
---|
patch showing change I think we want in the gnerated code
comment:38 by , 14 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
Getting this into the generated C++ cleanly seems awkward, so for now I've made it call __disown__()
in the generated Python wrapper in r14742, and marking as closed again.
Personally I need in this feature to be available from Python bindings. For that to work I used a very small patch of both Xapian and Xapian bindings v. 1.0.17 (see attached patches).
What it does is simple declares the destructor of Xapian::Stem (the first patch) to be virtual which enables usage of SWIG directors' for the Stem class (the second patch).
With patched Xapian and Xapian bindings it becomes possible to subclass the Xapian::Stem and Xapian::Stem::Internal with the latter to use another stemmer (personally I'm using Hunspell for that).
Current shortcommings: