Opened 18 years ago

Closed 18 years ago

Last modified 18 years ago

#123 closed defect (released)

Adding large documents to remote db may cause a timeout

Reported by: Richard Boulton Owned by: Richard Boulton
Priority: normal Milestone:
Component: Backend-Remote Version: SVN trunk
Severity: normal Keywords:
Cc: Olly Betts, Sidnei da Silva, Mark Hammond Blocked By:
Blocking: #120 Operating System: All

Description

Sidnei has reported to me that adding a large document to a remote database may cause a timeout. His message about this follows:

Something else that might be important here. Large documents might cause a timeout with the standard settings. We might need to raise the default timeout by informing a large timeout value when connecting. This error happening while indexing the plain text extracted from a 5mb PDF file:

File "c:\src\xapian-sei\src\lemur\xapian\sei.py", line 1282, in addDocument

xapid = self._db.add_document(doc)

File "c:\src\dev\instance\es30\lib\python\xapian.py", line 951, in add_document

def add_document(*args): return _xapian.WritableDatabase_add_document(*args)

NetworkTimeoutError: REMOTE:Timeout expired while trying to read; context was: WritableDatabase() (remote:tcp(localhost:8100))

Sidnei - can you add further comments on this bug to this bug report, to help us investigate it. Most importantly, is it possible that the timeout happens before the db.add_document() method is called, because the process of preparing the xapian document from the PDF takes a long time?

If so, the solution is simply to increase the timeout, and we don't need to worry about this being a bug in Xapian core. If the timeout is happening because add_document() itself takes a long time, we need to look into that, since it's not reasonable for a timeout to occur whilst a method is in progress.

It occurs to me, thinking about this, that if a flush() method on a remote database takes a long time, as it may reasonably do, there is nothing to stop a timeout occurring. The only reasonable solution I can see is to make RemoteServer spawn a separate thread with the responsibility of sending a timeout message periodically until the flush has completed. This is probably too complex to implement right now, so will have to wait unless it proves to be a critical problem.

Attachments (3)

zero_timeout.patch (972 bytes ) - added by Mark Hammond 18 years ago.
patch the allows for no timeouts
xapian-remote-server-timeout.patch (822 bytes ) - added by Olly Betts 18 years ago.
Prospective patch to fix timeout handling
network_errors.patch (1.3 KB ) - added by Mark Hammond 18 years ago.
as described

Download all attachments as: .zip

Change History (20)

comment:1 by Mark Hammond, 18 years ago

It turns out this has nothing to do with large documents - it is the 'idle' timeout expiring at the server. You can work-around it by running the server with '-i 2000000' or some other suitably large number (but be warned - too large will apparently overflow silently :)

What happens is:

  • Client establishes a connection, and for whatever reason, does not use it for

the server's idle timeout period.

  • Server throws timeout exception. Server writes this timeout exception to the

client, blocking until the client reads.

  • Client eventually sends request. Immediately reads back the timeout exception.
  • Someone gets a little confused, resulting in server still attempting to write

to the client, but failing (and logging a "write failed" message) as the client has given up due to the error.

It seems strange that a timeout exception would cause the exception message to be written to the client, and otherwise be treated like other errors. The client would need to search for 'timeout' in the remote exception. Would it be better to just close the connection, which should cause the client to then get a local network error, which is can examine to see if it is dead connection?

comment:2 by Richard Boulton, 18 years ago

Blocking: 118 added

Thanks for that analysis Mark - we should be able to fix this somehow soon. Adding to list of issues to fix for 1.0.

Changing the server to just close the connection sounds like a plausible fix - Olly: can you see any reason why this wouldn't be a good idea?

comment:3 by Olly Betts, 18 years ago

Cc: olly@… added
rep_platform: PCAll

I think you usually don't even want idle timeouts (or active timeouts either generally) for a remote WritableDatabase, so being able to specify "no timeout" would be useful (and should perhaps be the default).

The client would need to search for 'timeout' in the remote exception.

The exception should be of type NetworkTimeoutError - no need to search for strings.

But it's clearly mad that the server blocks sending a message to inform the client of a timeout. I think this could even lead to a deadlock if the client blocks trying to write to the server while the server has blocked trying to send the idle timeout exception!

This is really just the standard exception passing happening without us thinking it through. It would be nice to be able to distinguish "idle timeout" from "link died" though.

I don't see this particularly needs to block 1.0 though as it's fixable without API or remote protocol changes. I can see it shouldn't be too hard to fix (if only crudely) and that you might want a fix soon, but that's not the same as being a blocker for 1.0...

comment:4 by Richard Boulton, 18 years ago

Blocking: 118 removed

Point about not being a 1.0 blocker taken - I'll try and get it fixed for it though, but it shouldn't block it. Block removed.

comment:5 by Mark Hammond, 18 years ago

Thanks for the feedback guys,

I think there is some advantage to rejecting connections after some idle time, to prevent a huge number of inactive connections. IIUC, http allows a similar scenario in allowing established connections to be closed, presumably for this same reason. (From rfc2616: "Servers will usually have some time-out value beyond which they will no longer maintain an inactive connection...When a client or server wishes to time-out it SHOULD issue a graceful close on the transport connection...A client, server, or proxy MAY close the transport connection at any time. For example, a client might have started to send a new request at the same time that the server has decided to close the "idle" connection. From the server's point of view, the connection is being closed while it was idle, but from the client's point of view, a request is in progress...This means that clients, servers, and proxies MUST be able to recover from asynchronous close events.") - it looks like we have identical concerns?

It would be nice to be able to distinguish "idle timeout" from "link died" though.

Well, on one hand I agree, on the other I'm not so sure. From the client's POV, the server closed the connection. This could have happened for any number of reasons, including the server configuration options - but its not clear to me that the client needs to know why - they would just take the action regardless.

But I do think it should be possible to differentiate an idle timeout from an "active timeout" - an idle timeout means we didn't even *start* processing the request, where an active timeout (generally) means we did start, but the server took too long to process the request - eg, the "flush" example Richard mentioned. An active timeout expiring probably just means we lost the response, but the request was processed - and it probably *is* appropriate the timeout exception be serialized to the client in this case.

This is leading me towards a new exception deriving from NetworkTimeoutError, called (say) NetworkIdleTimeoutError. This new exception would cause the connection to be immediately closed. A NetworkTimeoutError (which should now only be raised for 'active' timeouts) would operate as it does now.

If you guys can offer a little more guidance on the way forward, I'm happy to try and put something together...

comment:6 by Olly Betts, 18 years ago

Hmm, I wrote a lengthy response last week, but it seems I failed to submit it and it's no longer in an open tab in firefox. So sorry if this is rather terse.

You can't build up a "huge number of inactive connections" for a WritableDatabase, because access is exclusive. For read-only access that's relevant though.

A writable remote backend is unlike HTTP in that state is lost if the connection is lost (if the connection drops, unflushed changes are lost). The analogy applies more closely to a read-only backend, apart from not necessarily having the database open at the same revision as before.

So that's why I think a writable remote server should default to not having a timeout (I suggest a timeout value of 0' should mean don't timeout' as I don't see a use for a genuine timeout of 0, and 1ms is close enough to 0 if you did want one for some reason...)

There's no point adding Xapian::NetworkIdleTimeoutError if the plan is to just close the link at this point, since this new exception can't propagate across the link and so will never be seen outside of the remote server. It can just be internal like the "ConnectionClosed" exception in net/remoteserver.cc.

comment:7 by Richard Boulton, 18 years ago

Status: newassigned

by Mark Hammond, 18 years ago

Attachment: zero_timeout.patch added

patch the allows for no timeouts

comment:8 by Olly Betts, 18 years ago

attachments.isobsolete: 01

(From update of attachment 73) Thanks, applying (with some suitable documentation comments!)

comment:9 by Olly Betts, 18 years ago

I'm just about to commit a change so that writable connections default to not timing out (I'm leaving the connection timeout unchanged at 10 seconds).

So to summarise what remains:

  • We shouldn't throw an error for a timeout in the remote server, since that

just blocks on the already unresponsive client! Instead we should just close the connection.

  • OmTime shouldn't overflow if a large timeout is specified (actually, I have a

partial patch for this somewhere I think - I started rewriting OmTime to fix compilation warnings about possible overflow from one of the UNIX vendor compilers, but got distracted by more urgent work...)

If there's anything else I've missed, please note it!

by Olly Betts, 18 years ago

Prospective patch to fix timeout handling

comment:10 by Olly Betts, 18 years ago

Blocking: 120 added

Marking for 1.0 series.

comment:11 by Mark Hammond, 18 years ago

Cc: mhammond@… added

by Mark Hammond, 18 years ago

Attachment: network_errors.patch added

as described

comment:12 by Olly Betts, 18 years ago

attachments.isobsolete: 01

comment:13 by Olly Betts, 18 years ago

attachments.isobsolete: 01

comment:14 by Olly Betts, 18 years ago

MarkH: could you put a more helpful description than "as described" on patches?

It's OK in the context of the comment which adds them, but in the attachment

list it provides no clue as to what the patch is for...

First of all, I've fixed calls to FormatMessage to strip "\r\n".

I've also applied Mark's version of the patch, but refactored quite a bit. We now rethrow NetworkTimeoutError, and handle it in the caller by logging the timeout but only if `verbose' is set, which I think makes most sense and seems to address the pondering comments in the patch. Does that seem good to you?

Currently checking the msvc build still works in buildbot.

I had a quick look at the ERROR_NETNAME_DELETED issue - I think your suggested fix is reasonable, assuming by "public" you don't mean in the external API but rather putting it in common/remoteconnection.h or similar.

comment:15 by Olly Betts, 18 years ago

I think this bug is now either mostly fixed, or perhaps even completely fixed!

It appears the only potentially remaining issue is ERROR_NETNAME_DELETED, but possibly the applied changes make special handling here unnecessary now. But I don't really follow well enough to know.

MarkH: Do you think it's still needed? If so, could you provide a patch for that part?

No great urgency of course, though it would be nice to be able to close this bug!

comment:16 by Mark Hammond, 18 years ago

Resolution: fixed
Status: assignedclosed

Any ERROR_NETNAME_DELETED issues are purely cosmetic - and as such should probably get their own bug when someone really cares enough to open one (which isn't me :) So I'm happy for it to be closed (and doing so!)

comment:17 by Richard Boulton, 18 years ago

Operating System: All
Resolution: fixedreleased

This is fixed in the upcoming 1.0.2 release (actually, I don't think anything has changed since the 1.0.1 release, but everyone seems to agree that it's fixed). Therefore, marking it as closed.

Note: See TracTickets for help on using tickets.