Opened 16 years ago

Closed 15 years ago

#295 closed enhancement (fixed)

Add support for multi and remote databases with PostingSources

Reported by: Richard Boulton Owned by: Richard Boulton
Priority: normal Milestone: 1.1.0
Component: Matcher Version: SVN trunk
Severity: normal Keywords:
Cc: Blocked By:
Blocking: #206 Operating System: All

Description

Currently, PostingSources don't work with multiple database searches.

The attached patch adds support for this, as follows:

  • The PostingSource.reset() method acquires a database parameter. This is passed a subdatabase to iterate over, and is guaranteed to be called by the matcher before any of the other methods are.
  • PostingSource also acquires a clone() method, which should return a copy of the posting source, with the same initial parameters.

The matcher then clones the posting source for each subdatabase, and calls reset() to set up an iteration over each database.

Drawbacks with the patch:

  • Because I want to supply a "Xapian::Database" reference rather than a "const Xapian::Database::Internal *" to the reset() method, I have construct a new Xapian::Database object in queryoptimiser.cc (line 68). However, this requires me to cast the const away, since there is no way to construct a Xapian::Database object otherwise. This works, and might even be safe (the only non-const methods on Database are reopen() and keep_alive(), which probably shouldn't be called in the middle of iteration, anyway), but is very ugly.
  • SWIG complains about the clone() method, because it returns a raw pointer, which SWIG doesn't know how to handle. My patch just tells SWIG to ignore this method, but this means that SWIG wrapped languages can't subclass PostingSource in a way which will work with multiple databases.

An alternative to the clone() problem might be to require classes to provide "serialise" and "unserialise" methods. This could even allow (with some extra work) PostingSources to work with remote databases (we'd have to provide some way to register posting sources by name, similar to that used for Weight objects).

Attachments (5)

multipostsource.patch (11.7 KB ) - added by Richard Boulton 16 years ago.
Patch to allow PostingSource to work with multi databases
remotepostsource.patch (31.5 KB ) - added by Richard Boulton 16 years ago.
Patch to make PostingSource work with remote databases
remotepostsource2.patch (29.8 KB ) - added by Richard Boulton 16 years ago.
Latest version of patch
remotepostsource3.patch (45.3 KB ) - added by Richard Boulton 15 years ago.
Updated implementation patch (applies to trunk rev 11827)
weightnames.patch (2.1 KB ) - added by Richard Boulton 15 years ago.
Patch to change names of weighting schemes

Download all attachments as: .zip

Change History (21)

by Richard Boulton, 16 years ago

Attachment: multipostsource.patch added

Patch to allow PostingSource to work with multi databases

comment:1 by Richard Boulton, 16 years ago

Status: newassigned
Summary: Add support for multi databases with PostingSourcesAdd support for multi and remote databases with PostingSources

Will shortly be attaching a patch which adds support for this for remote databases, too.

With this patch, PostingSource requires a clone() method to be implemented in all cases, and also requires a name(), serialise() and unserialise() method to be implemented for remote matches. The PostingSource subclass needs to be registered with the remote server - ValueWeightPostingSource is registered automatically.

I've also changed the Query constructor which takes a PostingSource to take one by reference now, and clone() it. The Query::Internal object then deletes the newly allocated source in its destructor - this is needed to make the code in RemoteServer which unserialises the query able to supply a newly created source to Query::Internal without having to write lots of code to handle the case of such an owned pointer specifically. It's also less likely to lead to memory leaks in general, I think, and should avoid crashes due to PostingSources created external to Xapian going out of scope before get_mset() is called. (This has been a particular problem that I've noticed with the Python bindings - easy enough to work around, but a bit startling the first time it happens.)

by Richard Boulton, 16 years ago

Attachment: remotepostsource.patch added

Patch to make PostingSource work with remote databases

comment:2 by Richard Boulton, 16 years ago

Hmm - while remotepostsource.patch does make PostingSources work with remote databases, it requires the clone() method to be implemented. This is a problem for the SWIG bindings, since I've had to tell swig to ignore the clone() method, since it returns a raw pointer which SWIG doesn't know what to do with.

Still, if you posting sources are implemented in C++, it works nicely.

comment:3 by Olly Betts, 16 years ago

Hmm, it seems to me that clone() is potentially a costly operation, so forcing a call to it for any use of a PostingSource subclass seems problematic, even without the wrapping problems. It's also a pain to force the user to define it - the earlier patch where you could leave it and get a default implementation which returns NULL so it doesn't work with multi-databases or remotely is easier to work with if you don't care about these situations.

I also wonder if it's better to set the database in clone() and keep reset() as it is now - that seems more logical to me if we don't expect objects to have their database changed.

by Richard Boulton, 16 years ago

Attachment: remotepostsource2.patch added

Latest version of patch

comment:4 by Richard Boulton, 16 years ago

Attached a new version of the patch - this time, clone() is optional again (default implementation returns NULL). I've had to add some tracking of whether the Query() owns the supplied posting source again.

Still to do:

  • should the Query constructor call clone on the posting source to get an owned copy (falling back to having a non-owned copy if clone returns NULL), or just assume that the supplied object remains valid indefinitely. I'm tending towards the latter (and that's what in this patch), but this does mean that if clone() is costly but implemented, it's called 1 more time than it needs to be.
  • Haven't looked into setting the database in clone() - note that it would also need to be able to be passed to unserialise() as well as clone(), of course.
  • Can we avoid the const_cast() when making the database object to pass to reset()?
  • general review of patch.

comment:5 by Olly Betts, 16 years ago

should the Query constructor call clone on the posting source to get an owned copy (falling back to having a non-owned copy if clone returns NULL), or just assume that the supplied object remains valid indefinitely. I'm tending towards the latter (and that's what in this patch), but this does mean that if clone() is costly but implemented, it's called 1 more time than it needs to be.

Um, do you mean "tending towards the former"?

comment:6 by Richard Boulton, 16 years ago

Oops, yes, I meant "tending towards the former".

comment:7 by Richard Boulton, 15 years ago

Blocking: 206 added

(In #206) Added patches implementing this, and testing it (just with the python bindings, though).

Current patch won't support serialising queries which use PostingSources, but I think that's a reasonable restriction for now. To support it, we'll need to pass an extra argument to the unserialise method giving the list of PostingSources available to be unserialised to, and add ways to serialise and clone Posting sources. This is implemented in the patch attached to ticket #295, so I'm adding a dependency on that ticket for this bug. I'm planning to implement some C++ level tests for this, and then apply the patch, since I could do with the feature for some upcoming work.

comment:8 by Richard Boulton, 15 years ago

Thinking about the remaining issues listed in comment:4 :

  • I think calling clone() in the Query constructor (as done in the patch) is right. The only potential downside is if clone() is very expensive - in which case, the implementor of the PostingSource should probably move the part which is expensive to clone into a reference counted resource. Doing it this way makes the code neat, and also allows users not to worry about ensuring the original PostingSource remains valid (as long as they implement a clone() method).
  • I don't think the database should be set in clone() - I'm fairly convinced that reset() is the correct place to specify the database. A big reason for this is that currently clone() doesn't need to be implemented - if the database was set by it, users who don't want to handle the complication of multiple databases would need to implement a clone() method which returns NULL, but modifies the state of the PostingSource whose clone method was called. This would be bad design, but would also require casting away const, since the clone() method is (and should be) const. Also, it would be unclear what database should be passed to the clone() method called by the Query constructor when passed a posting source.
  • Still don't know how to avoid the const_cast.
  • Have done some review of the patch, but it's basically fine apart from the const_cast as far as I can see.

I think the const_cast is a livable ugliness, so I'm inclined to apply the latest patch fairly soon unless there are major objections.

by Richard Boulton, 15 years ago

Attachment: remotepostsource3.patch added

Updated implementation patch (applies to trunk rev 11827)

comment:9 by Richard Boulton, 15 years ago

Resolution: fixed
Status: assignedclosed

Applied this patch, with minor tidying, to trunk in changeset [11838] (and to the bindings in [11839]). This closes this issue.

comment:10 by Olly Betts, 15 years ago

You didn't actually wait to hear if there were any major objections from me...

comment:11 by Richard Boulton, 15 years ago

Sorry - got a bit carried away perhaps. Are there any major objections from you? ;-)

FWIW, the cause of me getting carried away was that a number of branches and features depending on this (in particular, the geospatial branch and the fix for ticket #206, and also several non-core posting sources I've developed), which I needed to work with multi-databases for a client.

I'm open to comments about changing the implementation or design, though...

comment:12 by Olly Betts, 15 years ago

Resolution: fixed
Status: closedreopened

A couple of minor things:

  • As we briefly discussed on IRC, we should explicitly point out that the database passed to reset() will be a subdatabase in the case of searching of combined dbs, and that the docids will be those of the subdatabase (unless we do already - I didn't look in detail to check).
  • Slightly wondering about what name() should return - Xapian::FooPostingSource is nicely namespaced to avoid clashes with user classes, but is inconsistent with Xapian::Weight - for consistency with that we'd return just "Foo" here (although we're about to overhaul the API for that so we could change it). Having everything start "Xapian::" will make the map operations slower as any comparison of keys will have to scan the whole prefix, but that's probably not a big deal - it does seems a bit clumsy though. Not sure I see a better answer.

And a couple of possibly less minor things:

  • valueweightsource1 is failing when assertions are on (I checked revision 11840 which failed, so it's not a change made since this; revision 11837 skips this test):
$ ./apitest -b multi_chert valueweightsource1 -v
Running tests with backend "multi_chert"...
Running test: valueweightsource1...
RAW
 AssertionErrorAssertionError: matcher/multimatch.cc:267:
within_DBL_EPSILON(wt,pl->recalc_maxweight()) : values were 135 and 100
  • If you enable externalsource4 for the multi backend, it fails, but it looks like it should work to me:
$ ./apitest -b multi_chert externalsource4 -v
Running tests with backend "multi_chert"...
Running test: externalsource4...
OP_SCALE_WEIGHT 0
OP_FILTER
 AssertionErrorAssertionError: matcher/multiandpostlist.cc:57: sum <= MultiAndPostList::get_termfreq_est() : values were 13 and 9

comment:13 by Richard Boulton, 15 years ago

I've added comments about the database passed to posting sources.

I've made a separate ticket (#336) about the assertion error for valueweightsource, so it doesn't get forgotten.

Looking into externalsource4 now.

Regarding the namespacing of name(): the lookup of names isn't exactly in a performance critical location, so I think having neat and clean code is the most important thing. If performance became a problem, we could always do a trick (like defining a custom comparator starting the comparison at the 9th character) at a later date. Therefore, consistency with Xapian::Weight is the only reason to change this. However, I think it would be much better to change Xapian::Weight to match this approach. There currently aren't likely to be many user weighting classes, and even if there are, unless they're using "Xapian::" as the prefix of their name, changing our names won't break anything. We're already changing the network protocol, so now's a good time to change Xapian::Weight::name.

by Richard Boulton, 15 years ago

Attachment: weightnames.patch added

Patch to change names of weighting schemes

comment:14 by Richard Boulton, 15 years ago

Attached a patch which changes the Weight classes names to be full names, for consistency with PostingSources.

comment:15 by Richard Boulton, 15 years ago

Fixed externalsource4 with multi and assertions in r11975. Was a problem with the posting source used in the test.

All that remains for this ticket is therefore the issue of what the name() method should return. I believe the correct solution is to apply weightnames.patch.

comment:16 by Olly Betts, 15 years ago

Resolution: fixed
Status: reopenedclosed

I realise it's not a highly performance sensitive area. I noticed that this was inconsistent with a similar existing API, and was flagging that this, and while thinking about it, it struck me that the use of namespaces rather pessimised the use of a map. But I think namespacing the name does make most sense.

Good point that any incompatible change in Xapian::Weight names for built-in classes is highly unlikely to actually break user code (user weighting schemes will need changing anyway, so it wouldn't be a huge deal if they did).

But applying a patch to change the weight names now is just going to causes conflicts with my uncommitted Xapian::Weight::Internal rewrite, and it's going to be easier to less work to just change the names in that. So thanks, but no thanks...

Note: See TracTickets for help on using tickets.