Opened 17 years ago

Last modified 8 months ago

#178 assigned defect

No remote backend support for: spelling correction, matchdecider

Reported by: Richard Boulton Owned by: Olly Betts
Priority: normal Milestone: 2.0.0
Component: Backend-Remote Version: git master
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

The remote database was briefly feature complete, but it's fallen behind again - it doesn't support spelling correction (adding and removing spellings at index time is supported though) or matchdeciders.

We should ideally add these in to it at some point.

Attachments (5)

metadata_and_mvcms_register.patch (5.7 KB ) - added by Paul Rudin 14 years ago.
metadata_and_mvcms_register_and_spelling.patch (14.7 KB ) - added by Paul Rudin 14 years ago.
remote-spell.patch (22.7 KB ) - added by Paul Rudin 14 years ago.
remote-spell-remainder.patch (20.4 KB ) - added by Olly Betts 14 years ago.
Updated patch with unapplied changes
remote-spell-remainder-updated.patch (21.7 KB ) - added by Bruno Deferrari 13 years ago.
Updated remote spelling suggestions patch.

Download all attachments as: .zip

Change History (40)

comment:1 by Richard Boulton, 17 years ago

Blocking: 120 added

We should probably try and get this done in the 1.0.x series.

comment:2 by Richard Boulton, 17 years ago

Owner: changed from Olly Betts to Richard Boulton

comment:3 by Richard Boulton, 17 years ago

Status: newassigned

comment:4 by Olly Betts, 17 years ago

Cc: olly@… added
Summary: Remote database does not support spelling correctionRemote database does not support spelling correction or synonyms

It never supported matchdeciders. I don't believe those or matchspies are implementable in 1.0.x.

Let's make this bug for spelling correction and synonyms, and bug#183 can be for remote support for Xapian::MatchDecider.

comment:5 by Olly Betts, 17 years ago

Operating System: All
Summary: Remote database does not support spelling correction or synonymsNo remote backend support for: spelling correction, synonyms, metadata

And now ... metadata!

comment:7 by Richard Boulton, 16 years ago

Description: modified (diff)
Milestone: 1.1

comment:8 by Richard Boulton, 16 years ago

Blocking: 120 removed

(In #120) Remove the unfixed dependencies so we can close this bug - they're all marked for the 1.1.0 milestone.

comment:9 by Olly Betts, 15 years ago

Milestone: 1.1.01.1.1

These aren't incompatible changes (except for remote protocol changes) so bumping to 1.1.1.

comment:10 by Olly Betts, 15 years ago

Milestone: 1.1.11.1.7

Triaging milestone:1.1.1 bugs.

comment:11 by Olly Betts, 15 years ago

Milestone: 1.1.71.2.0

These can be implemented in 1.2.x without incompatible changes. Bumping.

by Paul Rudin, 14 years ago

comment:12 by Paul Rudin, 14 years ago

I've added a patch against the matchspy branch that does metadata and also registers MultiValueCountMatchSpy.

comment:13 by Paul Rudin, 14 years ago

Incidentally - another thing that remote databases don't support: Enquire.get_mset can take a matchdecider, but that has no effect for remote databases.

comment:14 by Olly Betts, 14 years ago

Description: modified (diff)
Summary: No remote backend support for: spelling correction, synonyms, metadataNo remote backend support for: spelling correction, synonyms, matchdecider

The MultiValueCountMatchSpy patch hunk isn't relevant for trunk, as that class doesn't exist there.

This changes the remote protocol incompatibly, so XAPIAN_REMOTE_PROTOCOL_MAJOR_VERSION needs bumping from 33 to 34, which I've done.

Also, the documentation for the remote protocol (remote_protocol.html) needs updating, which I've also done.

RemoteServer::set_metadata() was truncating at zero bytes in the value - I've extended the metadata1 test to cover cases with zero bytes, and fixed the patch.

And the new methods use 2 space indents rather than the standard 4, so I fixed that too.

I assume this work is (C) Lemur, so I updated the headers to reflect that. If that's wrong, let me know.

Committed to trunk with the changes above as r13688.

comment:15 by Olly Betts, 14 years ago

Regarding comment:13, that should throw UnimplementedError rather than silently ignoring the matchdecider.

comment:16 by Paul Rudin, 14 years ago

I've added another patch which includes stuff for remote spellings. However I'm not really sure whether it's doing the right thing - I'm a bit confused about the separation of Database and Database::Internal as far as the various spelling related methods go. I've had to use Database::Internal inside RemoteServer, which nothing else does - so maybe this approach misses something?

open_spelling_wordlist sends them all back in one go, so is maybe not what's wanted? But that isn't used by get_spelling_suggestion which is the main thing I want to work at the moment.

comment:17 by Olly Betts, 14 years ago

It would be much more helpful to have a patch with just the new changes, not one which also includes those I've already applied to trunk.

As I said on IRC, I think Database::get_spelling_suggestion() with remote databases should cause a remote search for the best spelling suggestion and just pass that back. Only the local databases should take part in the OrTermList. You can easily calculate the edit distance for the remote spelling suggestion(s), and compare it/them to that of the best local one to find the best overall answer.

I don't think opening the spelling termlists and streaming them over is the right approach. It involves sending a lot of data, and if we have to read each as a single lump, will be rather slow too. Arrange to stream them is possible, but fiddly, and the effort is misdirected I feel.

So RemoteDatabase::open_spelling_termlist() should never get called and can simply remain unimplemented.

And RemoteDatabase::open_spelling_wordlist() can be implemented using the public API (via Xapian::Database::spellings_begin() and Xapian::Database::spellings_end()).

Also, you're not being zero-byte safe (again) - in RemoteServer::msg_getspellingsuggestion(const string & message):

    string word = string(p); 

should be:

    string word(p, p_end - p);

Otherwise word will be truncated at the first zero byte it contains (not a very likely occurrence, but it is easy to get right.

comment:18 by Olly Betts, 14 years ago

Oh, hmm - except that the search for the best spelling on the remote server should ideally use the spelling frequencies from the combined database. So this would only give the same answers if there's a single sub-database which is remote - for the more general case it won't, though the answers should be reasonable still. If you only care about that case, supporting only that would be an improvement over not supporting remote spelling at all!

I don't immediately see how to solve this issue though. The approach used for matching documents doesn't seem to work here.

comment:19 by Olly Betts, 14 years ago

Re comment:13 and comment:15, trunk as of r13690 throws UnimplementedError if you try to use a MatchDecider with the remote backend.

comment:20 by Olly Betts, 14 years ago

I'm not sure where it happened (IRC maybe), but a while ago we discussed whether MatchDecider was still useful - for the "match spy" use, there's now MatchSpy, and for filtering documents with a functor, PostingSource can do this too, but with more possibilities (e.g. weights, applying to sub-queries, applying more than one) and that already has remote support.

Perhaps we should deprecate MatchDecider in 1.2.0?

comment:21 by Olly Betts, 14 years ago

Milestone: 1.2.x1.1.7

And Paul Rubin has a newer patch for remote spelling support, which is might be nice to integrate before 1.2.0 - rationale: it'll avoid a (probably minor) protocol version bump in 1.2.x, and is likely to be minimally invasive, so if it breaks, it'll probably only be remote spelling support that is broken, and that doesn't work at all currently.

Not worth holding a release for, but marking this ticket for 1.1.7 so we take a concious decision on this and deprecating MatchDecider.

comment:22 by Richard Boulton, 14 years ago

Owner: changed from Richard Boulton to Not currently assigned
Status: assignednew

It might be a good idea to deprecate MatchDecider for 1.2.0 (for removal in 1.3.0?), but I think we should try and provide an easy migration path for users (which would be a PostingSource which implemented the MatchDecider interface, so users could simply subclass that posting source and filter the query with it). PostingSource classes are quite a bit more fiddly to implement than MatchDecider classes.

I'm not 100% sure we should deprecate the MatchDecider interface - while it can be replicated with a PostingSource, it's a more convenient API. We could change the internal implementation to use a PostingSource, which would allow us to simplify the matcher code. For this reason, I don't think there's much need to hurry to deprecate MatchDecider.

I've emailed Paul Rudin to ask if he wants to submit the spelling patch to this ticket.

I don't have time to work on this ticket in the near future, so assigning it to "nobody" for clarity.

comment:23 by Olly Betts, 14 years ago

Cc: Olly Betts removed
Owner: changed from Not currently assigned to Olly Betts
Status: newassigned

Yes, it would be for removal in 1.3.0 (or by 1.4.0 at least, assuming we have another development series, which I think probably makes sense as it generally worked well).

There are various costs to having redundant ways of doing things kept around for no good reason - there's more API for users to get to grips with, and more code around to maintain, which adds to the download and installed sizes, and to the build time of the library and applications using it, and perhaps worst of all for us, the testsuite runtime.

by Paul Rudin, 14 years ago

Attachment: remote-spell.patch added

comment:24 by Paul Rudin, 14 years ago

I've attached a patch that include remote spelling. I'd intended to rework it a bit as it was hacked together quite quickly so that I could do some experimentation, but I'm not sure when I'd get around to it at the moment so here's what I have at the moment.

comment:25 by Olly Betts, 14 years ago

The part which moves and reindents a large chunk of code doesn't apply for me. I didn't look into what had changed underneath, as this seems a bit major to commit right now with 1.2.0 very close. It would also be nicer if we didn't have to poke in internals here.

I've pulled out the add_spelling() and remove_spelling() part, as that looks good, and might be useful to some users. Applied that in r14076. I'll attach a patch with the remaining changes, tidied up a bit.

by Olly Betts, 14 years ago

Updated patch with unapplied changes

comment:26 by Olly Betts, 14 years ago

Milestone: 1.1.51.2.x

Updating milestone.

comment:27 by Olly Betts, 13 years ago

Description: modified (diff)
Summary: No remote backend support for: spelling correction, synonyms, matchdeciderNo remote backend support for: spelling correction, synonyms, matchdecider, metadata_keys_begin()

by Bruno Deferrari, 13 years ago

Updated remote spelling suggestions patch.

comment:28 by Bruno Deferrari, 13 years ago

Just attached an updated version of remote-spell-remainder.patch ('remote-spell-remainder-updated.patch') that applies cleanly against trunk (r15475 right now) and against 1.2.5.

How likely is this to be merged into mainline?

comment:29 by Olly Betts, 13 years ago

Milestone: 1.2.x1.3.0

tizoc: I'm not sure how likely it is - the patch reindents and moves code around, so it's not obvious just from reading it what is actually being changed. I'd have to review it in more detail, which I don't have time for at present.

Marking for milestone 1.3.0 for now, to make sure we consider this patch for the development series.

comment:30 by Olly Betts, 12 years ago

Milestone: 1.3.01.3.x

comment:31 by Olly Betts, 8 years ago

Milestone: 1.3.x1.4.x

Dear me, this has languished a long time. It would be good to this sorted out (at least the spelling part that there's a patch for).

Looking at the patch, it seems to implement approximately what's sketched out in comment:17 and comment:18, except that every candidate causes a fetch of the spelling frequency from each of the remote sub-databases, and that happens individually, incurring the network round-trip latency each time - that seems like it's going to be a performance killer in real-world use.

tizoc - have you been using this patch in production? If so, how well does it work?

In technical terms, it's not a 1.4.0 blocker (it's ABI-compatible and requires a minor protocol version bump), and at this point shortening the path to 1.4.0 is an important goal, so adjusting the milestone.

comment:32 by Bruno Deferrari, 8 years ago

Hi olly.

I used it for a while without issues I think, but I then wrapped the Xapian store with a small HTTP application that opened the database files directly and without using the remote backend (was not needed anymore because that HTTP application was the only store reader/writer, other parts of the system just talked to it).

So, while for a few months it worked fine on what I was working on (I don't remember having any issues), I can't say it was really put to enough stress for enough time to confirm that it is production ready.

comment:33 by Olly Betts, 8 years ago

Thanks for the update. Let's leave this as 1.4.x then.

comment:34 by Olly Betts, 4 years ago

Description: modified (diff)
Milestone: 1.4.x1.5.0
Summary: No remote backend support for: spelling correction, synonyms, matchdecider, metadata_keys_begin()No remote backend support for: spelling correction, matchdecider, metadata_keys_begin()
Version: SVN trunkgit master

Gaurav implemented remote support for synonyms in https://github.com/xapian/xapian/pull/284 (merged to master in 8bff57c22173dfbd3c14a1e1fad280ee8c38c423).

I don't think this or the remaining missing features are 1.4.x material at this point in the release cycle. Moving to 1.5.0 for now.

comment:35 by Olly Betts, 13 months ago

Description: modified (diff)
Summary: No remote backend support for: spelling correction, matchdecider, metadata_keys_begin()No remote backend support for: spelling correction, matchdecider

metadata_keys_begin() was actually implemented a long time ago, back in 2010 in [eacacefcc5d29c95292a7b1a26cd8fa5a904748a] which debuted in 1.2.20. It's tested by testcase metadata5 which runs and passes for remote backends.

comment:36 by Olly Betts, 8 months ago

Milestone: 1.5.02.0.0
Note: See TracTickets for help on using tickets.