#678 closed defect (fixed)
Remote backend doesn't work with large databases
Reported by: | matf | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.2.22 |
Component: | Backend-Remote | Version: | 1.2.20 |
Severity: | normal | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
file : net/serialise.cc
function: decode_length(const char ** p, const char *end, bool check_remaining)
bug: if (*p == end || shift > 28)
change: if (*p == end || shift > 28*2)
if database too big, it will fail
Change History (11)
comment:1 by , 10 years ago
Component: | Other → Backend-Remote |
---|
comment:2 by , 10 years ago
comment:4 by , 9 years ago
when xapian-replicate use a 6G index , the search program can not connect . then change28 to 2*28, the search program work normal . I us 1.2.20 .
l want use remote backend to Improve query speed, Is this idea right ? thank you
comment:5 by , 9 years ago
Component: | Backend-Remote → Replication |
---|---|
Milestone: | → 1.3.4 |
Status: | new → assigned |
Summary: | RemoteDatabase → Replication of >4GB files fails |
Version: | → 1.2.20 |
OK, so this is an issue with replication, not the remote backend. Where sizeof(size_t) == 8
(which is generally will be on 64 bit platforms) then changing 28 to 63 looks like the correct fix (28 is the greatest multiple of 7 < 32, while 63 is the greatest < 64). But where sizeof(size_t) == 4
(most 32 bit platforms) this will just cause the decoded value to overflow. I'll sort out a proper fix for this.
Using the remote backend (or replication) doesn't automatically improve query speed as such, but both enable approaches which can improve it. But if you have a single database, simply moving it to be remote will probably be slower - there's the extra overhead of serialising data and sending it across the network (though you are splitting the load across two machines to some extent).
But with the remote database, you can spread the load of searching a large data set by partitioning it by document into N databases and putting each on a different server, then searching all N together as remote databases.
And with replication you can efficiently have copies of a database on many servers, which allows a high search load to be split across them.
comment:6 by , 9 years ago
I'm very sorry, I was wrong,xapian-tcpsrv use a 6G index ,not "xapian-replicate use a 6G index".so is an issue with remote backend.
comment:7 by , 9 years ago
Component: | Replication → Backend-Remote |
---|---|
Summary: | Replication of >4GB files fails → Remote backend doesn't work with large databases |
Ah, OK - thanks for the correction.
In fact replication was already fixed for this back in 1.2.13:
+ Allow files > 32G to be be copied by replication.
(That NEWS entry is wrongly worded - it should say "> 4G" or "with a size which needs a > 32-bit integer").
I just reviewed where decode_length()
is called, and I think the problem isn't actually the size of the database in bytes (the remote backend never needs to send that information), but rather the total length of all documents in the database - we send that and then divide it by the number of documents to get the average document length.
Can you check your database with delve
(which might be installed as xapian-delve
) to get the number of documents and the average length, like so:
$ delve db UUID = 00cb3616-dfc5-4113-a041-0f3d81961b0b number of documents = 566 average document length = 109.346 document length lower bound = 2 document length upper bound = 532 highest document id ever used = 566 has positional information = true
So for this example, the total length is approximately 566*109.346 = 61889.836. If I've correctly diagnosed the cause of this, you should get > 4294967295.
comment:8 by , 9 years ago
Milestone: | 1.3.4 → 1.2.22 |
---|
I've split decode_length()
into 32 and 64 bit variants, and the appropriate one should now get called in each case, which should fix this.
That's in git master [24c7867693cf5746eab0e1cc50546b3e1bfc8332], which will be in 1.3.4.
We already have a test that total doclength > (1<<32) works, but actually this case is problematic only once it exceeds (1<<35), and the existing test case didn't cause that to happen, so I've extended it to provide a regression test:
[466ae43450e238ce76b0f73fdd27e7c0bfad100a]
It would be useful to have confirmation that your total document length is > (1<<35) (34,359,738,368 i.e. 34 billion and some), as if it isn't then the bug I've fixed isn't actually the one you hit.
We should also backport this.
comment:9 by , 9 years ago
UUID = 34c3f9a3-c3e3-11e4-bb81-90fba60951b3 number of documents = 2487586 average document length = 14545.6 document length lower bound = 9 document length upper bound = 656404 highest document id ever used = 2487586 has positional information = true
total document length is 36,181,938,370
comment:10 by , 9 years ago
OK, so the total document length is somewhat over the (1<<35)
threshold. Thanks for confirmation.
comment:11 by , 9 years ago
Description: | modified (diff) |
---|---|
Resolution: | → fixed |
Status: | assigned → closed |
Backported to the 1.2 branch in [057edbcd/git].
Just changing the
28
here isn't enough, as we'll overflowsize_t
on platforms where it is 32-bit.So this is likely to need more complex changes to fix properly, but in order to do that I need to understand exactly where this is blowing up for you.
Also, did you have a reason to pick
28*2
, or was that just a big enough value to make it work in your situation?Also, what version of Xapian are you using? I can tell it's 1.2.x, but not what "x" is...