#119 closed defect (released)
Finish off Xapian::TermGenerator
Reported by: | Richard Boulton | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.0.0 |
Component: | QueryParser | Version: | SVN trunk |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | #118 | Operating System: | All |
Description (last modified by )
We should provide text processing routines which are compatible with the query parser in xapian core. The obvious way to do this is to move Omega's indextext.cc code across, but it will need reworking to fit in properly.
The name "TextSplitter" will not be used for this class, but gives some kind of idea about what it should do.
We should also add apostrophe normalisation to this class, to match the english stemmer (perhaps this should only be performed for english text, so the class will have to be language specific). See http://snowball.tartarus.org/texts/apostrophe.html
(I've assigned this to Olly since he said he has made some progress on this, but am happy to work on it if it would help.)
Change History (13)
comment:1 by , 18 years ago
Blocking: | 118 added |
---|
comment:2 by , 18 years ago
op_sys: | Linux → All |
---|---|
rep_platform: | PC → All |
Status: | new → assigned |
Or perhaps "library api", but that's not important really. BTW, feel free to suggest new components (as you probably have noticed, I added "MSVC makefiles" recently.
comment:3 by , 18 years ago
Summary: | Move Omega's indextext.cc to xapian-core → Finish off Xapian::TermGenerator |
---|
Status update: there's now a "Xapian::TermGenerator" class, which is used by Omega instead of index_text.cc. Bug summary changed to reflect this.
Left to do are:
- Sort out details of exactly how the class indexes (which needs to align with
changes in the QueryParser class).
- Sort out term normalisation - e.g. different apostrophe forms all converted to
"'" (again, QueryParser wants matching features).
- Write some feature tests for TermGenerator.
comment:4 by , 18 years ago
Looks good so far. I was expecting the code to output a vector or list of terms, which could then be added to the document (or used in some other way - for example, for passing to a highlighting routine). However, adding directly to the document is probably a better way, since that'll be the usual use of the code, and the terms can always be extracted again from the document if some other use of them is desired.
One more thing to add to the list to be done is adding the new code to the bindings. I'll take a look at the swig ones now.
comment:5 by , 18 years ago
Passing back a list or vector of terms seems very clumsy - the natural C++ way would be an iterator which returns terms as asked. I looked at this, but it seemed too complex for the timescale we want to get 1.0 out in - but perhaps in the future (we could provide a TermGenerator compatibility wrapper around it easily enough).
Note that highlighting really requires more than just a list of terms. You either need byte offsets into the text, or to get the non-term parts of the text too.
If you're looking at SWIG wrappers, try to implement by just getting SWIG to parse the xapian/termgenerator.h header if possible.
comment:6 by , 18 years ago
I've just about got the SWIG wrappers working, but I've had to wrap the utf8iterator too. I've done it by %ignore or %renaming the appropriate things, and then "%include"ing the header files, which seems to work okay. I'm checking through the generated code to check that it looks plausible, then I'll commit and add some basic tests to the smoketests.
comment:7 by , 18 years ago
Hmm, thinking about this a bit, perhaps rather than exposing Utf8Iterator in the stemming languages (which is problematic in general since Utf8Iterator requires that its buffer remains valid while it is active; besides scripting languages will generally already have their own Utf-8 support) we should provide a C++ helper TermGenerator method which takes std::string and constructs the Utf8Iterator for you, and only wrap that.
I'll add the C++ method anyway, since that's what most C++ API users will want to generate terms from.
comment:8 by , 18 years ago
I've just come to the same conclusion, after struggling for a while to work out why the iterator wasn't working as I expected.
I'm not quite sure how you can make a helper method to do this, because it needs to ensure that the string is valid for the lifetime of the Utf8Iterator. I'd have just overloaded the two index_text functions to have simple variants which take a std::string instead of a Utf8Iterator. However, I'll leave it to you, since it sounds like you have a better idea.
comment:9 by , 18 years ago
What you suggest is exactly what I had in mind (and I committed just now). Perhaps "convenience method" would have been a better description than "helper method".
comment:10 by , 18 years ago
TermGenerator now normalises apostrophes.
Remaining to do items now listed here: http://wiki.xapian.org/BraveNewTerms
comment:11 by , 18 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Now all implemented. We could usefully have more test cases, but that's almost always the case...
comment:12 by , 18 years ago
Operating System: | → All |
---|---|
Resolution: | fixed → released |
Fixed in 1.0.0 release.
comment:14 by , 17 years ago
Description: | modified (diff) |
---|---|
Milestone: | → 1.0.0 |
This is wanted for the 1.0 release.
There doesn't seem to be a useful component for this work, so I've put it under QueryParser, since that seems closest for now.