Ticket #178 (assigned defect)

Opened 3 years ago

Last modified 6 months ago

No remote backend support for: spelling correction, synonyms, matchdecider

Reported by: richard Owned by: olly
Priority: normal Milestone: 1.2.x
Component: Backend-Remote Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Operating System: All Blocking:

Description (last modified by olly) (diff)

The remote database was briefly feature complete, but it's fallen behind again - it doesn't support spelling correction, synonym expansion, or matchdeciders.

We should add these in to it at some point.

Attachments

metadata_and_mvcms_register.patch Download (5.7 KB) - added by pr100 9 months ago.
metadata_and_mvcms_register_and_spelling.patch Download (14.7 KB) - added by pr100 9 months ago.
remote-spell.patch Download (22.7 KB) - added by pr100 6 months ago.
remote-spell-remainder.patch Download (20.4 KB) - added by olly 6 months ago.
Updated patch with unapplied changes

Change History

Changed 3 years ago by richard

  • blocking 120 added

We should probably try and get this done in the 1.0.x series.

Changed 3 years ago by richard

  • owner changed from olly to richard

Changed 3 years ago by richard

  • status changed from new to assigned

Changed 3 years ago by olly

  • cc olly@… added
  • summary changed from Remote database does not support spelling correction to Remote database does not support spelling correction or synonyms

It never supported matchdeciders. I don't believe those or matchspies are implementable in 1.0.x.

Let's make this bug for spelling correction and synonyms, and bug#183 can be for remote support for Xapian::MatchDecider?.

Changed 3 years ago by olly

  • summary changed from Remote database does not support spelling correction or synonyms to No remote backend support for: spelling correction, synonyms, metadata

And now ... metadata!

Changed 3 years ago by trac

  • platform set to All

Changed 2 years ago by richard

  • description modified (diff)
  • milestone set to 1.1

Changed 2 years ago by richard

  • blocking 120 removed

(In #120) Remove the unfixed dependencies so we can close this bug - they're all marked for the 1.1.0 milestone.

Changed 21 months ago by olly

  • milestone changed from 1.1.0 to 1.1.1

These aren't incompatible changes (except for remote protocol changes) so bumping to 1.1.1.

Changed 17 months ago by olly

  • milestone changed from 1.1.1 to 1.1.7

Triaging milestone:1.1.1 bugs.

Changed 14 months ago by olly

  • milestone changed from 1.1.7 to 1.2.0

These can be implemented in 1.2.x without incompatible changes. Bumping.

Changed 9 months ago by pr100

Changed 9 months ago by pr100

I've added a patch against the matchspy branch that does metadata and also registers MultiValueCountMatchSpy?.

Changed 9 months ago by pr100

Incidentally - another thing that remote databases don't support: Enquire.get_mset can take a matchdecider, but that has no effect for remote databases.

Changed 9 months ago by olly

  • description modified (diff)
  • summary changed from No remote backend support for: spelling correction, synonyms, metadata to No remote backend support for: spelling correction, synonyms, matchdecider

The MultiValueCountMatchSpy patch hunk isn't relevant for trunk, as that class doesn't exist there.

This changes the remote protocol incompatibly, so XAPIAN_REMOTE_PROTOCOL_MAJOR_VERSION needs bumping from 33 to 34, which I've done.

Also, the documentation for the remote protocol (remote_protocol.html) needs updating, which I've also done.

RemoteServer::set_metadata() was truncating at zero bytes in the value - I've extended the metadata1 test to cover cases with zero bytes, and fixed the patch.

And the new methods use 2 space indents rather than the standard 4, so I fixed that too.

I assume this work is (C) Lemur, so I updated the headers to reflect that. If that's wrong, let me know.

Committed to trunk with the changes above as r13688.

Changed 9 months ago by olly

Regarding comment:13, that should throw UnimplementedError rather than silently ignoring the matchdecider.

Changed 9 months ago by pr100

Changed 9 months ago by pr100

I've added another patch which includes stuff for remote spellings. However I'm not really sure whether it's doing the right thing - I'm a bit confused about the separation of Database and Database::Internal as far as the various spelling related methods go. I've had to use Database::Internal inside RemoteServer?, which nothing else does - so maybe this approach misses something?

open_spelling_wordlist sends them all back in one go, so is maybe not what's wanted? But that isn't used by get_spelling_suggestion which is the main thing I want to work at the moment.

Changed 9 months ago by olly

It would be much more helpful to have a patch with just the new changes, not one which also includes those I've already applied to trunk.

As I said on IRC, I think Database::get_spelling_suggestion() with remote databases should cause a remote search for the best spelling suggestion and just pass that back. Only the local databases should take part in the OrTermList. You can easily calculate the edit distance for the remote spelling suggestion(s), and compare it/them to that of the best local one to find the best overall answer.

I don't think opening the spelling termlists and streaming them over is the right approach. It involves sending a lot of data, and if we have to read each as a single lump, will be rather slow too. Arrange to stream them is possible, but fiddly, and the effort is misdirected I feel.

So RemoteDatabase::open_spelling_termlist() should never get called and can simply remain unimplemented.

And RemoteDatabase::open_spelling_wordlist() can be implemented using the public API (via Xapian::Database::spellings_begin() and Xapian::Database::spellings_end()).

Also, you're not being zero-byte safe (again) - in RemoteServer::msg_getspellingsuggestion(const string & message):

    string word = string(p); 

should be:

    string word(p, p_end - p);

Otherwise word will be truncated at the first zero byte it contains (not a very likely occurrence, but it is easy to get right.

Changed 9 months ago by olly

Oh, hmm - except that the search for the best spelling on the remote server should ideally use the spelling frequencies from the combined database. So this would only give the same answers if there's a single sub-database which is remote - for the more general case it won't, though the answers should be reasonable still. If you only care about that case, supporting only that would be an improvement over not supporting remote spelling at all!

I don't immediately see how to solve this issue though. The approach used for matching documents doesn't seem to work here.

Changed 9 months ago by olly

Re comment:13 and comment:15, trunk as of r13690 throws UnimplementedError if you try to use a MatchDecider with the remote backend.

Changed 7 months ago by olly

I'm not sure where it happened (IRC maybe), but a while ago we discussed whether MatchDecider was still useful - for the "match spy" use, there's now MatchSpy, and for filtering documents with a functor, PostingSource can do this too, but with more possibilities (e.g. weights, applying to sub-queries, applying more than one) and that already has remote support.

Perhaps we should deprecate MatchDecider in 1.2.0?

Changed 7 months ago by olly

  • milestone changed from 1.2.x to 1.1.7

And Paul Rubin has a newer patch for remote spelling support, which is might be nice to integrate before 1.2.0 - rationale: it'll avoid a (probably minor) protocol version bump in 1.2.x, and is likely to be minimally invasive, so if it breaks, it'll probably only be remote spelling support that is broken, and that doesn't work at all currently.

Not worth holding a release for, but marking this ticket for 1.1.7 so we take a concious decision on this and deprecating MatchDecider.

Changed 6 months ago by richard

  • owner changed from richard to nobody
  • status changed from assigned to new

It might be a good idea to deprecate MatchDecider for 1.2.0 (for removal in 1.3.0?), but I think we should try and provide an easy migration path for users (which would be a PostingSource which implemented the MatchDecider interface, so users could simply subclass that posting source and filter the query with it). PostingSource classes are quite a bit more fiddly to implement than MatchDecider classes.

I'm not 100% sure we should deprecate the MatchDecider interface - while it can be replicated with a PostingSource, it's a more convenient API. We could change the internal implementation to use a PostingSource?, which would allow us to simplify the matcher code. For this reason, I don't think there's much need to hurry to deprecate MatchDecider.

I've emailed Paul Rudin to ask if he wants to submit the spelling patch to this ticket.

I don't have time to work on this ticket in the near future, so assigning it to "nobody" for clarity.

Changed 6 months ago by olly

  • cc olly removed
  • owner changed from nobody to olly
  • status changed from new to assigned

Yes, it would be for removal in 1.3.0 (or by 1.4.0 at least, assuming we have another development series, which I think probably makes sense as it generally worked well).

There are various costs to having redundant ways of doing things kept around for no good reason - there's more API for users to get to grips with, and more code around to maintain, which adds to the download and installed sizes, and to the build time of the library and applications using it, and perhaps worst of all for us, the testsuite runtime.

Changed 6 months ago by pr100

Changed 6 months ago by pr100

I've attached a patch that include remote spelling. I'd intended to rework it a bit as it was hacked together quite quickly so that I could do some experimentation, but I'm not sure when I'd get around to it at the moment so here's what I have at the moment.

Changed 6 months ago by olly

The part which moves and reindents a large chunk of code doesn't apply for me. I didn't look into what had changed underneath, as this seems a bit major to commit right now with 1.2.0 very close. It would also be nicer if we didn't have to poke in internals here.

I've pulled out the add_spelling() and remove_spelling() part, as that looks good, and might be useful to some users. Applied that in r14076. I'll attach a patch with the remaining changes, tidied up a bit.

Changed 6 months ago by olly

Updated patch with unapplied changes

Changed 6 months ago by olly

  • milestone changed from 1.1.5 to 1.2.x

Updating milestone.

Note: See TracTickets for help on using tickets.