Ticket #295 (assigned enhancement)

Opened 3 months ago

Last modified 8 weeks ago

Add support for multi and remote databases with PostingSources

Reported by: richard Owned by: richard
Priority: normal Milestone: 1.1.0
Component: Matcher Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Operating System: All Blocking:

Description

Currently, PostingSources? don't work with multiple database searches.

The attached patch adds support for this, as follows:

  • The PostingSource?.reset() method acquires a database parameter. This is passed a subdatabase to iterate over, and is guaranteed to be called by the matcher before any of the other methods are.
  • PostingSource? also acquires a clone() method, which should return a copy of the posting source, with the same initial parameters.

The matcher then clones the posting source for each subdatabase, and calls reset() to set up an iteration over each database.

Drawbacks with the patch:

  • Because I want to supply a "Xapian::Database" reference rather than a "const Xapian::Database::Internal *" to the reset() method, I have construct a new Xapian::Database object in queryoptimiser.cc (line 68). However, this requires me to cast the const away, since there is no way to construct a Xapian::Database object otherwise. This works, and might even be safe (the only non-const methods on Database are reopen() and keep_alive(), which probably shouldn't be called in the middle of iteration, anyway), but is very ugly.
  • SWIG complains about the clone() method, because it returns a raw pointer, which SWIG doesn't know how to handle. My patch just tells SWIG to ignore this method, but this means that SWIG wrapped languages can't subclass PostingSource? in a way which will work with multiple databases.

An alternative to the clone() problem might be to require classes to provide "serialise" and "unserialise" methods. This could even allow (with some extra work) PostingSources? to work with remote databases (we'd have to provide some way to register posting sources by name, similar to that used for Weight objects).

Attachments

multipostsource.patch (11.7 kB) - added by richard 3 months ago.
Patch to allow PostingSource? to work with multi databases
remotepostsource.patch (31.5 kB) - added by richard 2 months ago.
Patch to make PostingSource? work with remote databases
remotepostsource2.patch (29.8 kB) - added by richard 2 months ago.
Latest version of patch

Change History

Changed 3 months ago by richard

Patch to allow PostingSource? to work with multi databases

Changed 2 months ago by richard

  • status changed from new to assigned
  • summary changed from Add support for multi databases with PostingSources to Add support for multi and remote databases with PostingSources

Will shortly be attaching a patch which adds support for this for remote databases, too.

With this patch, PostingSource? requires a clone() method to be implemented in all cases, and also requires a name(), serialise() and unserialise() method to be implemented for remote matches. The PostingSource? subclass needs to be registered with the remote server - ValueWeightPostingSource? is registered automatically.

I've also changed the Query constructor which takes a PostingSource? to take one by reference now, and clone() it. The Query::Internal object then deletes the newly allocated source in its destructor - this is needed to make the code in RemoteServer? which unserialises the query able to supply a newly created source to Query::Internal without having to write lots of code to handle the case of such an owned pointer specifically. It's also less likely to lead to memory leaks in general, I think, and should avoid crashes due to PostingSources? created external to Xapian going out of scope before get_mset() is called. (This has been a particular problem that I've noticed with the Python bindings - easy enough to work around, but a bit startling the first time it happens.)

Changed 2 months ago by richard

Patch to make PostingSource? work with remote databases

Changed 2 months ago by richard

Hmm - while remotepostsource.patch does make PostingSources? work with remote databases, it requires the clone() method to be implemented. This is a problem for the SWIG bindings, since I've had to tell swig to ignore the clone() method, since it returns a raw pointer which SWIG doesn't know what to do with.

Still, if you posting sources are implemented in C++, it works nicely.

Changed 2 months ago by olly

Hmm, it seems to me that clone() is potentially a costly operation, so forcing a call to it for any use of a PostingSource? subclass seems problematic, even without the wrapping problems. It's also a pain to force the user to define it - the earlier patch where you could leave it and get a default implementation which returns NULL so it doesn't work with multi-databases or remotely is easier to work with if you don't care about these situations.

I also wonder if it's better to set the database in clone() and keep reset() as it is now - that seems more logical to me if we don't expect objects to have their database changed.

Changed 2 months ago by richard

Latest version of patch

Changed 2 months ago by richard

Attached a new version of the patch - this time, clone() is optional again (default implementation returns NULL). I've had to add some tracking of whether the Query() owns the supplied posting source again.

Still to do:

  • should the Query constructor call clone on the posting source to get an owned copy (falling back to having a non-owned copy if clone returns NULL), or just assume that the supplied object remains valid indefinitely. I'm tending towards the latter (and that's what in this patch), but this does mean that if clone() is costly but implemented, it's called 1 more time than it needs to be.
  • Haven't looked into setting the database in clone() - note that it would also need to be able to be passed to unserialise() as well as clone(), of course.
  • Can we avoid the const_cast() when making the database object to pass to reset()?
  • general review of patch.

Changed 8 weeks ago by olly

should the Query constructor call clone on the posting source to get an owned copy (falling back to having a non-owned copy if clone returns NULL), or just assume that the supplied object remains valid indefinitely. I'm tending towards the latter (and that's what in this patch), but this does mean that if clone() is costly but implemented, it's called 1 more time than it needs to be.

Um, do you mean "tending towards the former"?

Changed 8 weeks ago by richard

Oops, yes, I meant "tending towards the former".

Note: See TracTickets for help on using tickets.