Opened 20 years ago

Closed 17 years ago

Last modified 16 years ago

#38 closed enhancement (released)

Enable remote backend to support all database methods and matcher features

Reported by: Olly Betts Owned by: Richard Boulton
Priority: low Milestone: 1.0.0
Component: Backend-Remote Version: SVN trunk
Severity: minor Keywords:
Cc: Richard Boulton Blocked By:
Blocking: #118 Operating System: All

Description (last modified by Richard Boulton)

The following database methods aren't implemented:

NetworkDatabase::do_open_post_list() NetworkDatabase::open_position_list() NetworkDatabase::get_doclength() NetworkDatabase::open_allterms() NetworkDatabase::get_collection_freq()

And the matcher doesn't support sort_key, sort_bands, or MatchBiasFunctor.

Attachments (2)

xapian-remote-postlist.patch (10.1 KB ) - added by Richard Boulton 17 years ago.
Basic implementation of remote database postlists
betterdebug.patch (641 bytes ) - added by Olly Betts 17 years ago.
Debug tracing tweaks to help investigate this

Download all attachments as: .zip

Change History (29)

comment:1 by Olly Betts, 20 years ago

Component: Library APIBackend-Remote
Status: newassigned

comment:2 by Olly Betts, 19 years ago

sort_bands has now been removed, so progress of sorts!

comment:3 by Olly Betts, 18 years ago

sorting on a value was implemented in 0.9.2, so the updated list is:

NetworkDatabase::do_open_post_list() NetworkDatabase::open_position_list() NetworkDatabase::get_doclength() NetworkDatabase::open_allterms() NetworkDatabase::get_collection_freq()

And the matcher doesn't support MatchBiasFunctor.

Also, the remote backend doesn't support creating/updating a database (i.e. WritableDatabase)

comment:4 by Olly Betts, 18 years ago

I now have code (not yet checked-in) which implements everything except:

NetworkDatabase::do_open_post_list()

And I think MatchBiasFunctor still isn't supported.

comment:5 by Olly Betts, 18 years ago

Now in SVN.

comment:6 by Olly Betts, 17 years ago

Priority: lowestlow
Version: 0.8.10.9.9

get_lastdocid() is missing too.

And the lack of do_open_post_list() means that the versions of replace_document() and delete_document() which take termnames don't work (they should be handled by passing them across for efficiency, but currently the standard implementation is at a higher level).

comment:7 by Olly Betts, 17 years ago

I've just implemented RemoteDatabase::get_lastdocid().

That leaves: RemoteDatabase::do_open_post_list()

And we ought to handle the termname variants of add_document() and replace_document() on the remote side.

comment:8 by Olly Betts, 17 years ago

Version: 0.9.9SVN HEAD

comment:9 by Richard Boulton, 17 years ago

Owner: changed from Olly Betts to Richard Boulton
Status: assignednew

I've just implemented handling for the termname variants of add_document() and replace_document() on the remote side.

Therefore, RemoteDatabase::do_open_post_list() is the only unimplemented method.

I'm looking at implementing this tomorrow - obviously it isn't sensible to send individual postings over the wire - equivalently, an implementation which sends the whole posting list over wouldn't be terribly scalable. I'll try implementing something which just passes the postlist across the link in fixed-size chunks first, but better ideas would be welcome.

comment:10 by Olly Betts, 17 years ago

Status: newassigned

Thanks for sorting out another 2 methods.

I think the main reason for wanting do_open_post_list() is completeness - if delete_document(TERM) and replace_document(TERM) aren't implemented using it then scalability isn't a major concern, but it's useful if you can just run "delve" on a remote database, for example.

There's also a trade-off between efficiency if the whole posting list is wanted (in which case sending lots of chunks is worse because of latency issues) and efficiency if only a small amount is. The only direct uses of the posting list I can think of (outside of those in xapian-core handled specially for the remote database) are in delve and omindex/scriptindex. In delve we want the whole lot.

In omindex/scriptindex we use it for unique ids to find related document id, so

the whole posting list should be a single entry (and you'd probably arrange to update a large index via the file system when using omindex/scriptindex anyway).

The current test coverage should be pretty good for an implementation which sent the lot, whereas sending chunks would presumably have more corner cases we might not be exercising.

Plus I'd *really* like to get 1.0 released this month, and for it to be a relatively solid release, so I'm getting increasingly concerned about the rate we're still accumulating new code.

So overall I think my preference would be to keep it simple in the initial implementation, then optimise the implementation later if/when we find real world use cases that require it.

comment:11 by Richard Boulton, 17 years ago

Blocking: 118 added

I have two use cases which needing do_open_post_list - one is to be able to get document IDs given a unique term (as in omega/scriptindex), but that could be worked around by running a query instead. The other is to be able to get a list of all documents in the database, so that I can check the contents of the database against an external database as part of a reconciliation step to check data integrity.

Reading the whole posting list into memory is likely to cause problems for this latter case (both because it will cause a lot of memory to be used, and because it could take so long that the network connection could time out). However, your point about implementation complexity is well made, so I'll do a simple implementation first, and optimise it later. I just suspect strongly that I'll need to move to a chunked implementation sooner rather than later to make it work for me.

Regarding 1.0 - I'd very much like to see a 1.0 release sooner rather than later, too - there's certainly enough improvements in 1.0 to merit it. I've started a 1.0 release tracker bug (number #118), and added this bug to it's dependencies.

comment:12 by Olly Betts, 17 years ago

The first use case isn't a problem anyway, since the posting list is only one entry...

The second is more plausible though.

I was thinking you wouldn't need to read all the postings into memory but instead just read from the network as the posting list is advanced with next/skip_to. But now I think about it, that would block other operations on the remote database unless we implemented some form of multiplexing which is just different complexity. So unless we want multiplexing for other reasons, that's probably not a good direction.

But how big a database are you using? If the postings are cached sanely then it's probably going to use well under (2 * database size) bytes (since for dense postings delta compression gives one byte for the docid, and wdf is rarely > 255; for sparse postings we need more bytes per docid, but then we can have at most db size/128 postings which need a 2 byte docid delta).

comment:13 by Richard Boulton, 17 years ago

Database sizes I'm currently working with are up to 20 million documents, but 40Mb memory usage isn't out of the question (and should transfer over a fast link quickly enough not to cause timeouts - the bottleneck will more likely be the time taken to read the entire list from disk).

I'll see how it goes. Thanks for the comments.

comment:14 by Richard Boulton, 17 years ago

Cc: richard@… added

by Richard Boulton, 17 years ago

Basic implementation of remote database postlists

comment:15 by Richard Boulton, 17 years ago

The point c) in the first paragraph of the previous comment was meant to read,

"c) has the entire implementation of the NetworkPostList class in a header file."

comment:16 by Olly Betts, 17 years ago

It just passes the whole postlist across the link and buffers it on the client side. Note that it passes the document lengths across as well as the wdfs and the document IDs (and that document lengths are doubles, so this takes quite a few bytes).

Actually, the double encoding is pretty compact, especially as these will actually be integer values. Integers should be encoded in at most 1 + the number of bytes required to store them (so 37.0 is 2 bytes total, 1023.0 is 3 bytes total).

But the plan is to stop storing the document length in posting lists and add a posting list like structure to hold all the docids just once, so we'll need to revise this at some point. It's not a huge problem for now, but try to avoid building assumptions about it in too deeply (probably not a risk anyway).

This is mainly because it was the easiest way to implement the get_doclength() method for remote postlists, but also because the python bindings currently always return the document length when iterating through a postlist.

They also create a PositionListIterator every time, which is currently worse. I've been meaning to look at doing that lazily (though you probably have better skills for that, so feel free to take a look).

The point c) in the first paragraph of the previous comment was meant to read, "c) has the entire implementation of the NetworkPostList class in a header file."

My informal current rules are that trivial non-virtual methods are OK in headers, but all other methods ought to be defined in a .cc file. I've been avoiding defining inline methods in headers outside of the class, since I think needing to do that is a strong pointer that the method is too large to belong in a header. As usual, not all current source code conforms, though I've done some remedial work. I'll document this in HACKING (though of course I'm open to other opinions

The patch mostly looks good (I've applied the change to kill "N" since that's unrelated to this).

This line in RemoteServer::msg_postlist() is a bit mad! Just assign 0!

Xapian::docid lastdocid = static_cast<Xapian::termpos>(0);

And this in RemoteDatabase::do_open_post_list() is unused:

AutoPtr<NetworkPostList> plist;

Also, the header guards I've been using on new code are of the form XAPIAN_INCLUDED_NET_POSTLIST_H rather than OM_HGUARD_NET_POSTLIST_H.

I've just been going through my random patch collection and found one which eliminates do_open_post_list() - the issue here is that we call term_exists() before trying to open the posting list, which is just extra work for most backends since checking if a term exists will involve looking at the postlist table (for quartz and flint at least, and I suspect this is going to be the case for almost any backend). So it's better to just ask the backend to open the postlist and return an EmptyPostList (or a FooPostList which acts just like an EmptyPostList) if the term doesn't exist! This change should particularly benefit the remote backend as it will eliminate an extra message exchange for each postlist you open.

I'm about to apply this, and it'll clash with this patch, but resolving the clashes should be trivial - if not, prod me and I'll sort out the problems as I'll have created them...

comment:17 by Richard Boulton, 17 years ago

I don't think it will take me very long at all to sort out a better lazy interface for the python postlists, so I recommend leaving that to me (though I won't be doing it today, because I'm off for a couple of days shortly).

Your suggestions on rules for inlining methods in header files sound reasonable, and match with what I'd do (I just didn't in this case because I was focusing on getting something to work).

The mad line in RemoteServer::msg_postlist() was a bad copy and paste error: oops! Whereas the plist line is just odd. Thanks for the code review.

I'm sure resolving any clashes will be easy, and that sounds like a good change to open_post_list, so go ahead and apply. :)

comment:18 by Richard Boulton, 17 years ago

I've committed the implementation for open_post_list() now.

Earlier comments indicate that MatchBiasFunctor may still be outstanding, but I haven't checked if it's been fixed without a clear comment saying so. Other than that, this bug is nearly resolved.

comment:19 by Olly Betts, 17 years ago

MatchBiasFunctor isn't worth worrying about. Once we sort out merging the ExternalPostList stuff MatchBiasFunctor can be removed, though then we can worry about how (or if) to support ExternalPostList for the remote backend.

Don't forget to update docs/remote_protocol.html though!

comment:20 by Richard Boulton, 17 years ago

Oops - I've now updated docs/remote_protocol.html

So, I think this bug can probably be closed now, since all current features which we're going to implement are fixed.

comment:21 by Olly Betts, 17 years ago

This now fails in a debug (--enable-assertions) build:

./apitest -v -b remotetcp replacedoc3 replacedoc4

Also fails with "remote" instead of "remotetcp".

I had a quick look (and will attach a small patch which I used to start to investigate this). It's unclear to me if the postlist doclen is wrong or there's some issue with how we fetch the database doclen in the remote case.

by Olly Betts, 17 years ago

Attachment: betterdebug.patch added

Debug tracing tweaks to help investigate this

comment:22 by Olly Betts, 17 years ago

attachments.isobsolete: 01

(From update of attachment 57) I've now committed changes equivalent to this patch.

comment:23 by Richard Boulton, 17 years ago

I think this issue is due to the postlist doclen being incorrect: I've got some tests which indicate this which I'll check in shortly, but it looks like the document length stored in postlists isn't being updated when replace_document() is called.

comment:24 by Richard Boulton, 17 years ago

attachments.isobsolete: 01

comment:25 by Olly Betts, 17 years ago

Resolution: fixed
Status: assignedclosed

Now fixed in SVN HEAD.

comment:26 by Olly Betts, 17 years ago

Operating System: All
Resolution: fixedreleased

Fixed in 1.0.0 release.

comment:28 by Richard Boulton, 16 years ago

Description: modified (diff)
Milestone: 1.0.0
Note: See TracTickets for help on using tickets.