Opened 10 years ago

Closed 9 years ago

Last modified 9 years ago

#678 closed defect (fixed)

Remote backend doesn't work with large databases

Reported by: matf Owned by: Olly Betts
Priority: normal Milestone: 1.2.22
Component: Backend-Remote Version: 1.2.20
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

file : net/serialise.cc function: decode_length(const char ** p, const char *end, bool check_remaining)

bug: if (*p == end || shift > 28)

change: if (*p == end || shift > 28*2)

if database too big, it will fail

Change History (11)

comment:1 by matf, 10 years ago

Component: OtherBackend-Remote

comment:2 by Olly Betts, 10 years ago

Just changing the 28 here isn't enough, as we'll overflow size_t on platforms where it is 32-bit.

So this is likely to need more complex changes to fix properly, but in order to do that I need to understand exactly where this is blowing up for you.

Also, did you have a reason to pick 28*2, or was that just a big enough value to make it work in your situation?

Also, what version of Xapian are you using? I can tell it's 1.2.x, but not what "x" is...

comment:3 by Olly Betts, 10 years ago

Please can you let us have the extra information requested.

comment:4 by matf, 10 years ago

when xapian-replicate use a 6G index , the search program can not connect . then change28 to 2*28, the search program work normal . I us 1.2.20 .

l want use remote backend to Improve query speed, Is this idea right ? thank you

comment:5 by Olly Betts, 10 years ago

Component: Backend-RemoteReplication
Milestone: 1.3.4
Status: newassigned
Summary: RemoteDatabaseReplication of >4GB files fails
Version: 1.2.20

OK, so this is an issue with replication, not the remote backend. Where sizeof(size_t) == 8 (which is generally will be on 64 bit platforms) then changing 28 to 63 looks like the correct fix (28 is the greatest multiple of 7 < 32, while 63 is the greatest < 64). But where sizeof(size_t) == 4 (most 32 bit platforms) this will just cause the decoded value to overflow. I'll sort out a proper fix for this.

Using the remote backend (or replication) doesn't automatically improve query speed as such, but both enable approaches which can improve it. But if you have a single database, simply moving it to be remote will probably be slower - there's the extra overhead of serialising data and sending it across the network (though you are splitting the load across two machines to some extent).

But with the remote database, you can spread the load of searching a large data set by partitioning it by document into N databases and putting each on a different server, then searching all N together as remote databases.

And with replication you can efficiently have copies of a database on many servers, which allows a high search load to be split across them.

comment:6 by matf, 10 years ago

I'm very sorry, I was wrong,xapian-tcpsrv use a 6G index ,not "xapian-replicate use a 6G index".so is an issue with remote backend.

comment:7 by Olly Betts, 10 years ago

Component: ReplicationBackend-Remote
Summary: Replication of >4GB files failsRemote backend doesn't work with large databases

Ah, OK - thanks for the correction.

In fact replication was already fixed for this back in 1.2.13:

+ Allow files > 32G to be be copied by replication.

(That NEWS entry is wrongly worded - it should say "> 4G" or "with a size which needs a > 32-bit integer").

I just reviewed where decode_length() is called, and I think the problem isn't actually the size of the database in bytes (the remote backend never needs to send that information), but rather the total length of all documents in the database - we send that and then divide it by the number of documents to get the average document length.

Can you check your database with delve (which might be installed as xapian-delve) to get the number of documents and the average length, like so:

$ delve db
UUID = 00cb3616-dfc5-4113-a041-0f3d81961b0b
number of documents = 566
average document length = 109.346
document length lower bound = 2
document length upper bound = 532
highest document id ever used = 566
has positional information = true

So for this example, the total length is approximately 566*109.346 = 61889.836. If I've correctly diagnosed the cause of this, you should get > 4294967295.

comment:8 by Olly Betts, 10 years ago

Milestone: 1.3.41.2.22

I've split decode_length() into 32 and 64 bit variants, and the appropriate one should now get called in each case, which should fix this.

That's in git master [24c7867693cf5746eab0e1cc50546b3e1bfc8332], which will be in 1.3.4.

We already have a test that total doclength > (1<<32) works, but actually this case is problematic only once it exceeds (1<<35), and the existing test case didn't cause that to happen, so I've extended it to provide a regression test:

[466ae43450e238ce76b0f73fdd27e7c0bfad100a]

It would be useful to have confirmation that your total document length is > (1<<35) (34,359,738,368 i.e. 34 billion and some), as if it isn't then the bug I've fixed isn't actually the one you hit.

We should also backport this.

Last edited 9 years ago by Olly Betts (previous) (diff)

comment:9 by matf, 10 years ago

UUID = 34c3f9a3-c3e3-11e4-bb81-90fba60951b3
number of documents = 2487586
average document length = 14545.6
document length lower bound = 9
document length upper bound = 656404
highest document id ever used = 2487586
has positional information = true

total document length is 36,181,938,370

Last edited 9 years ago by Olly Betts (previous) (diff)

comment:10 by Olly Betts, 9 years ago

OK, so the total document length is somewhat over the (1<<35) threshold. Thanks for confirmation.

comment:11 by Olly Betts, 9 years ago

Description: modified (diff)
Resolution: fixed
Status: assignedclosed

Backported to the 1.2 branch in [057edbcd/git].

Note: See TracTickets for help on using tickets.