Opened 17 years ago

Closed 17 years ago

Last modified 16 years ago

#119 closed defect (released)

Finish off Xapian::TermGenerator

Reported by: Richard Boulton Owned by: Olly Betts
Priority: normal Milestone: 1.0.0
Component: QueryParser Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: #118 Operating System: All

Description (last modified by Richard Boulton)

We should provide text processing routines which are compatible with the query parser in xapian core. The obvious way to do this is to move Omega's indextext.cc code across, but it will need reworking to fit in properly.

The name "TextSplitter" will not be used for this class, but gives some kind of idea about what it should do.

We should also add apostrophe normalisation to this class, to match the english stemmer (perhaps this should only be performed for english text, so the class will have to be language specific). See http://snowball.tartarus.org/texts/apostrophe.html

(I've assigned this to Olly since he said he has made some progress on this, but am happy to work on it if it would help.)

Change History (13)

comment:1 by Richard Boulton, 17 years ago

Blocking: 118 added

This is wanted for the 1.0 release.

There doesn't seem to be a useful component for this work, so I've put it under QueryParser, since that seems closest for now.

comment:2 by Olly Betts, 17 years ago

op_sys: LinuxAll
rep_platform: PCAll
Status: newassigned

Or perhaps "library api", but that's not important really. BTW, feel free to suggest new components (as you probably have noticed, I added "MSVC makefiles" recently.

comment:3 by Olly Betts, 17 years ago

Summary: Move Omega's indextext.cc to xapian-coreFinish off Xapian::TermGenerator

Status update: there's now a "Xapian::TermGenerator" class, which is used by Omega instead of index_text.cc. Bug summary changed to reflect this.

Left to do are:

  • Sort out details of exactly how the class indexes (which needs to align with

changes in the QueryParser class).

  • Sort out term normalisation - e.g. different apostrophe forms all converted to

"'" (again, QueryParser wants matching features).

comment:4 by Richard Boulton, 17 years ago

Looks good so far. I was expecting the code to output a vector or list of terms, which could then be added to the document (or used in some other way - for example, for passing to a highlighting routine). However, adding directly to the document is probably a better way, since that'll be the usual use of the code, and the terms can always be extracted again from the document if some other use of them is desired.

One more thing to add to the list to be done is adding the new code to the bindings. I'll take a look at the swig ones now.

comment:5 by Olly Betts, 17 years ago

Passing back a list or vector of terms seems very clumsy - the natural C++ way would be an iterator which returns terms as asked. I looked at this, but it seemed too complex for the timescale we want to get 1.0 out in - but perhaps in the future (we could provide a TermGenerator compatibility wrapper around it easily enough).

Note that highlighting really requires more than just a list of terms. You either need byte offsets into the text, or to get the non-term parts of the text too.

If you're looking at SWIG wrappers, try to implement by just getting SWIG to parse the xapian/termgenerator.h header if possible.

comment:6 by Richard Boulton, 17 years ago

I've just about got the SWIG wrappers working, but I've had to wrap the utf8iterator too. I've done it by %ignore or %renaming the appropriate things, and then "%include"ing the header files, which seems to work okay. I'm checking through the generated code to check that it looks plausible, then I'll commit and add some basic tests to the smoketests.

comment:7 by Olly Betts, 17 years ago

Hmm, thinking about this a bit, perhaps rather than exposing Utf8Iterator in the stemming languages (which is problematic in general since Utf8Iterator requires that its buffer remains valid while it is active; besides scripting languages will generally already have their own Utf-8 support) we should provide a C++ helper TermGenerator method which takes std::string and constructs the Utf8Iterator for you, and only wrap that.

I'll add the C++ method anyway, since that's what most C++ API users will want to generate terms from.

comment:8 by Richard Boulton, 17 years ago

I've just come to the same conclusion, after struggling for a while to work out why the iterator wasn't working as I expected.

I'm not quite sure how you can make a helper method to do this, because it needs to ensure that the string is valid for the lifetime of the Utf8Iterator. I'd have just overloaded the two index_text functions to have simple variants which take a std::string instead of a Utf8Iterator. However, I'll leave it to you, since it sounds like you have a better idea.

comment:9 by Olly Betts, 17 years ago

What you suggest is exactly what I had in mind (and I committed just now). Perhaps "convenience method" would have been a better description than "helper method".

comment:10 by Olly Betts, 17 years ago

TermGenerator now normalises apostrophes.

Remaining to do items now listed here: http://wiki.xapian.org/BraveNewTerms

comment:11 by Olly Betts, 17 years ago

Resolution: fixed
Status: assignedclosed

Now all implemented. We could usefully have more test cases, but that's almost always the case...

comment:12 by Olly Betts, 17 years ago

Operating System: All
Resolution: fixedreleased

Fixed in 1.0.0 release.

comment:14 by Richard Boulton, 16 years ago

Description: modified (diff)
Milestone: 1.0.0
Note: See TracTickets for help on using tickets.