Ticket #141 (closed enhancement: released)

Opened 21 months ago

Last modified 20 months ago

Writable xapian-tcpsrv should allow queueing of connections

Reported by: mhammond Owned by: olly
Priority: normal Milestone:
Component: Backend-Remote Version: SVN trunk
Severity: minor Keywords:
Cc: richard, sidnei Blocked By:
Operating System: All Blocking: #120

Description

From xapian-devel:

There may be 3 stages of this bug:

* In the first instance a writable server enforces a simple "one connection at a time" lock. * In the second instance, a remote connection includes a 'writable needed' flag when connecting and server enforces only one writable connection. * Finally, we implement a queue, where we respond to the client that a writable connection exists and we will block until it finishes. I'll avoid too much implementation speculation in this initial comment :)

Attachments

test_multi_writers.py (0.5 kB) - added by mhammond 21 months ago.
simple test using python
multi_writer_lock.2.patch (15.8 kB) - added by mhammond 21 months ago.
updated patch
multi_writer_lock.patch (13.3 kB) - added by mhammond 21 months ago.
patch as described

Change History

Changed 21 months ago by mhammond

simple test using python

Changed 21 months ago by olly

  • status changed from new to assigned

Looks pretty good - a few comments:

There's no point using add_database() in the check that we can open all the read-only databases - just create and ignore each one like you do for the writable case.

You're missing spaces in the dissected "Starting server" message (and "port" might as well be factored out too!)

xapian-progsrv inherently only has a single connection which is established right away, so checking that the database can be opened in xapian-progsrv.cc is just adding start-up overhead for no benefit.

Replacing db_->get_description() with dbpaths[0] makes the error context less useful - a little helper function with joins all the elements of dbpaths[0] with spaces between would be better.

initialise() is now only used once, so might as well be merged into the ctor.

"its" should be "it's" (comment in ~RemoteServer?)!

(And unconnected with your changes, RemoteServer?'s default timeout values are probably unused, in which case they should be removed).

If you have time to address those that would be great. Otherwise I can probably sort them out, but not right now as I need to stop being distracted from rejigging index_text.

Changed 21 months ago by olly

  • blocking set to 118
  • summary changed from tcp server doesn't enforce one-writer constraint to xapian-tcpsrv doesn't enforce one-writer constraint

I believe we should try to make sure that at least the first stage as listed in the original report is implemented for 1.0.0, since it's a rather nasty hole in our database locking scheme, and we have a patch to fix it which is "mostly there".

I've now addressed the parenthetical point about the unused timeouts in RemoteServer? - they're history.

I'll have a think about the test harness. It's great for running arbitrary API calls against anonymous databases, but it's hard work to do this sort of stuff as it stands.

Changed 21 months ago by olly

  • blocking changed from 118 to 118, 142

Changed 21 months ago by mhammond

updated patch

Changed 21 months ago by mhammond

Changed 21 months ago by mhammond

  • attachments.isobsolete changed from 0 to 1

Changed 21 months ago by mhammond

patch as described

Changed 21 months ago by olly

I'm working on applying this, but for some reason it causes problems with running tests under valgrind. I'm investigating...

Changed 21 months ago by olly

  • summary changed from xapian-tcpsrv doesn't enforce one-writer constraint to Writable xapian-tcpsrv should allow queueing of connections
  • blocking changed from 118, 142 to 120, 142
  • severity changed from normal to enhancement

Turns out I get the same valgrind issues without the patch too! So committed a revised version (the version here doesn't compile on UNIX, xapian-progsrv doesn't work, and GCC warns about the order of class member initialisation).

So this bug is no longer a 1.0.0 blocker - it would be nice to support queued connections in the 1.0 release series I though, so moving to block that metabug instead and updating the summary.

Changed 21 months ago by richard

  • cc richard@… added
  • blocking changed from 120, 142 to 120

Missed this discussion because I wasn't subscribed to the bug. We really need to work out how to make bugzilla email a list with all bug changes, or move to a new bugtracker (after 1.0, of course).

This bug clearly isn't a blocker for 142, since 142 is now fixed and this one isn't, so I'm removing the block on 142. Incidentally, I've just noticed a load of valgrind failures in the testsuite (exclusively for the remote-tcp backend).

I was about to investigate, but it sounds like Olly is already doing so, so

I'll get on with something else.

Changed 21 months ago by olly

I'd marked 142 as depending on this just to avoid anyone trying to fix that independently since there was a patch here. I realise the relationship didn't really fit "A blocks B". I would have merged the two bugs, but our bugzilla doesn't seem to have a way to do that.

I've pretty much run out of ideas on the valgrind issues, and then noticed that the problem wasn't made worse by this patch, so gave up and applied it. I'm not actively looking at them now, so feel free to.

Oddly I didn't notice them before though, and I do run "make check" pretty often! I did upgrade to feisty fairly recently (while working on the zlib patch - valgrind reported errors with that which turned out to be known problems which there's normally a suppression for but it wasn't in the standard list for the latest zlib library version - I couldn't recompile valgrind from source for some reason, so I gave up and upgraded, before I discovered it was a missing suppression) so perhaps that's related.

I see two different problems:

(A) This fails because with "EOF" errors (all tests fail, but restricting to zerodocid1 makes it easier to track down):

./runtest ./apitest -v -b remote zerodocid1

If you manually hack runsrv to add "2> /dev/null" to the last line (the exec) then the test passes. But redirecting to a file shows that there's no output on stderr. Also, I believe Xapian actually closes fd 2 anyway. So I'm a bit puzzled there.

(B) This reports memory leaks:

./runtest ./apitest -v -b remotetcp zerodocid1

They appear to be memory that's still accessible but that the allocations grow when you rerun the test. The valgrind output appears to just list a load of allocations which either happened before main, or as the test harness initialises. I poked at it for a bit, but this seems to be a bogus report of the type I've seen before, and even if it's real I don't view a "leak" of memory which we still have a pointer to as a showstopper for releasing 1.0.0.

Changed 20 months ago by richard

After investigating the valgrind issues with remotetcp, I quite agree that it looks like a false positive, so doesn't need to block 1.0.0 (though we should try and fix it to clean up the output at some point).

I can't make sense of the problem with the "remote" backend, but I see the same behaviour as you.

Changed 20 months ago by olly

  • attachments.isobsolete changed from 0 to 1

Changed 20 months ago by olly

  • status changed from assigned to closed
  • resolution set to fixed

Closing in favour of bug#145 which covers the remaining issues in this bug.

Changed 20 months ago by olly

  • resolution changed from fixed to released

Fixed in 1.0.0 release.

Changed 20 months ago by trac

  • platform set to All
Note: See TracTickets for help on using tickets.