Opened 10 years ago

Closed 8 years ago

#654 closed defect (fixed)

Getting EOF when using get_data() in a remote database

Reported by: German M. Bravo Owned by: Olly Betts
Priority: normal Milestone: 1.2.22
Component: Backend-Chert Version: 1.2.18
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

I'm getting these exceptions when using get_data() in a remote chert database... in 1.2.18 and trunk (1.3.1):

Some times (most of the time): xapian.NetworkError: Received EOF (context: remote:tcp(localhost:8900))

Some times (not so often): xapian.DatabaseCorruptError: REMOTE:find_entry failed to find any entry at all! (context: remote:tcp(localhost:8900))

Other times, it works as expected. In Brass, it seems to work always.

I have been able to reproduce this only when using the remote database, however, this seems like a lower level bug (in the chert backend).

This happens when you have two databases open, one for writing, one for reading. Then, doing a read query in the read database to get a document's get_data(). Then writing something in the write database, doing commit. Then doing the read query again in the read database (get_data()).

To reproduce: Create a blank new Chert database, launching: xapian-tcpsrv --timeout=0 --port=8900 --writable bug in an empty directory.

Then run the attached file: python bug.py, you should receive an exception: xapian.NetworkError: Received EOF (context: remote:tcp(localhost:8900)), if not, try running python bug.py a few times.

Problem with the EOF is that when ChertCursor::get_key() is called, Key::read() tries to assign (p + K1) with a Key::length() that is returned as -3.

Attachments (2)

bug.py (1.2 KB ) - added by German M. Bravo 10 years ago.
the_bug.cpp (2.3 KB ) - added by German M. Bravo 9 years ago.
Perhaps the -3 was fixed after fixing #675; however the EOF error remains.

Download all attachments as: .zip

Change History (21)

by German M. Bravo, 10 years ago

Attachment: bug.py added

comment:1 by German M. Bravo, 10 years ago

The problem seems not to be there in the Brass backend, but only in trunk (Brass in 1.2.18 also has the problem)

comment:2 by Olly Betts, 10 years ago

Milestone: 1.3.2
Status: newassigned

comment:3 by Olly Betts, 10 years ago

I'm unable to reproduce this - I tried current trunk and the debian packages of 1.2.17, and both on SSD and on spinning disk (since it seems likely there's a timing-related factor as you report it doesn't always happen).

Also tried 1.2.17 on spinning disk with either or both sides running under valgrind, in case that helped tickle it.

Can you attach a gdb backtrace showing where the bad call to ChertCursor::get_key() happens?

comment:4 by Olly Betts, 9 years ago

@Kronuz: Any chance of that backtrace? Without more idea what's going on it's going to be hard to fix properly - all I can do is add a check for a bogus key length and bail out, but that doesn't really improve the situation.

comment:5 by Olly Betts, 9 years ago

Component: OtherBackend-Chert
Milestone: 1.3.21.3.3
Version: 1.2.18

comment:6 by Olly Betts, 9 years ago

I had a look at the "find_entry failed to find any entry at all!" error and I think I may have spotted a bug which could explain at least some of this.

If we try to move the cursor to a key which isn't present, we ought to leave the cursor on the highest key < the request one. And there's a special item with a "null key" which is always present, so it should always be possible to leave the cursor positioned on an item. However, it looks like you can't call component_of() on the item with the null key, but in some situations while positioning the cursor on the key before we try to.

This almost explains the issue with the negative key length too, except that I think the key length would be reported as -2 and not -3. I'll run some tests to check I'm reading the code correctly, and try to resolve this.

A backtrace of the callstack leading to situation where the key length of -3 is returned would still be very helpful.

comment:7 by Olly Betts, 9 years ago

Does the fix for #675 in [1900eca6d325e9185b9f9f5c75a4a4a545616d7f/git] solve this one too?

That could lead to us reading the wrong data for the block contents, which would explain a key length of -3 (since that would be a zero byte where the length should be), and as you noted there, this ticket is a fairly similar situation.

comment:8 by Olly Betts, 9 years ago

Looking at the component_of() issue again, I was mistaken. The null key that every table has is in fact "component 1 of 1" - the item is 7 bytes long, within that the key is 3-3=0 bytes long, and it's part 1 of 1:

0001ff0: xx xx xx xx xx xx xx xx xx 00 07 03 00 01 00 01
         .......................... itmlen K part1 of__1

We also set the left key at each branch level to a "null key", and that one lacks the component information, and I was confusing these two different sorts of "null key".

Last edited 8 years ago by Olly Betts (previous) (diff)

comment:9 by German M. Bravo, 9 years ago

Unfortunately, this doesn't fix the issue... however I'm now unable to get a proper backtrace (I can't even remember how I did it) ...but simple printfs() no longer are showing -3 as being the key length.

I'm attaching a c++ program that reproduces the problem here on both chert and glass.

by German M. Bravo, 9 years ago

Attachment: the_bug.cpp added

Perhaps the -3 was fixed after fixing #675; however the EOF error remains.

comment:10 by German M. Bravo, 9 years ago

It could very well rather be a bug in TcpServer, because it works with the remote server we have in Xapiand.

comment:11 by Olly Betts, 9 years ago

For me, the server reports:

Got exception NetworkError: Received EOF (context: db_the_bug)
Closing connection.

(It's clearer this is from the server if you run the server in a separate terminal.)

But that just means "client closed the connection" - it's not a bug for a read-only client to do that since there's nothing to commit and so no need for an explicit shutdown exchange. For a writable client, it's useful for the client to wait for the database to actually be closed, since the write lock will still be held until then, and any implicit commit on close won't have happened.

I guess your server just doesn't display that message.

It's not a very useful message, and potentially confusing (as here) - perhaps we should suppress it or reword it (or suppress it by default and only show it with --verbose).

comment:12 by German M. Bravo, 9 years ago

Not at all, I assure you. The server's forked process is the one dying unexpectedly, and the client receives the EOF when waiting for a reply from the server. That's what I saw. When the server was dying due the -3 length issue, that's what was happening. I believe that to be fixed now (after fixing #675) but now perhaps some other error is making the server die and the effect on the client is the same: EOF. Perhaps you are seeing other output with the attached program? But here the client is the one not receiving the full response from the server.

Last edited 9 years ago by German M. Bravo (previous) (diff)

comment:13 by Olly Betts, 9 years ago

Testing with your the_bug.cpp, I definitely see the message from the server not the client - the client seems to complete successfully for me.

On IRC you seemed to confirm that the error wasn't from the client, but you mention the client hanging if I follow correctly:

<Kronuz> olly, you are correct, it's the client closing the connection :P
<Kronuz> ...but I don't know why or why the same program does work with our server (because it looks like the client just hangs)
<Kronuz> something is making the client close the connection while the server is still expecting something

Is the client hanging? If so, can you attach gdb and get a backtrace from where it is stuck?

The server is always still expecting something unless the client explicitly sends MSG_SHUTDOWN, but the client only sends that if it has opened a WritableDatabase. We don't bother for the the read-only case - it would just slow things down for no good reason, especially on higher latency connections.

comment:14 by Olly Betts, 9 years ago

Milestone: 1.3.31.3.4

Kronuz said on IRC he's not had a chance to retest yet.

I don't want to hold up the next release, so I'm going to move this ticket to the next milestone, though if we determine what the remaining issues are and they're easy to fix, they could still go in.

comment:15 by Olly Betts, 8 years ago

Ping?

I'm still unsure if there's really still a bug here. For a read-only database, the client just closes the connection when it is done, and currently the server reports that as Got exception NetworkError: Received EOF, e.g.:

$ bin/xapian-tcpsrv tests/.glass/db= -p 16712
Starting server on port 16712
Listening...
Connection from 127.0.0.1, port 7337
Got exception NetworkError: Received EOF (context: tests/.glass/db=)
Closing connection.
Connection from 127.0.0.1, port 7849
Got exception NetworkError: Received EOF (context: tests/.glass/db=)
Closing connection.
Connection from 127.0.0.1, port 8361
Got exception NetworkError: Received EOF (context: tests/.glass/db=)
Closing connection.

So I'm wondering if that's all you're seeing (and it "works" with your server because you don't print that!)

Those messages are confusing when that's really just what's expected, but we should be able to just suppress them.

comment:16 by Olly Betts, 8 years ago

As of [bcf9b05336b9ea0a53158d678c0a8084ca42e123/git], we now say "Connection closed" for an expected EOF, and "Connection closed unexpectedly" for an unexpected EOF.

comment:17 by Olly Betts, 8 years ago

Milestone: 1.3.41.3.x

Not a blocker.

comment:18 by Olly Betts, 8 years ago

Milestone: 1.3.x1.2.22

Reviewing the discussion, I feel confident that the issues here are a combination of #675 and the confusing not-actually-an-error message (fix noted in comment:16). It doesn't seem useful to keep tickets we believe to be fixed open indefinitely in the absence of any further feedback from the submitter, so let's backport the error message improvement (assuming that's feasible to without a lot of work) and close this ticket. If there's still actually a problem here, please comment here or open a fresh ticket.

comment:19 by Olly Betts, 8 years ago

Resolution: fixed
Status: assignedclosed

Actually it's already backported in [f2b7b204a22de3f12445c237c8bd399d31b0e39b].

Note: See TracTickets for help on using tickets.