#385 closed enhancement (fixed)
Expanding docids (etc) beyond 32 bit types
Reported by: | James Aylett | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.3.4 |
Component: | Library API | Version: | SVN trunk |
Severity: | minor | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description
Currently, Xapian uses (unsigned int) for types such as docid and termpos. For docid in particular, this limits the total size of the database, and there has been some interest in a Xapian that doesn't have this restriction.
This ticket is here to capture information and thoughts about this.
Attachments (5)
Change History (29)
comment:1 by , 15 years ago
follow-up: 3 comment:2 by , 15 years ago
Version: | → SVN trunk |
---|
Replying to the mailing list message here to keep all the discussion together.
I'm in work-avoidance mode at the moment, so I thought I'd take a pass at this. The simplest approach seemed to be to declare Xapian::unsigned_integer and Xapian::signed_integer, and use them for doccount, doccount_diff, docid, termcount, termcount_diff, termpos, termpos_diff, valueno and valueno_diff, the last two mostly because I couldn't be bothered to figure out why there's a piece of code that iterates over valuenos but wants them to act like docids, or vice versa or something.
I'm not keen on adding new public types for this. They aren't useful in themselves, and these types don't all need to be the same size - they're just all "int" currently as that's "big enough" (or at least were for most people) on all modern plaforms - so a common type for them doesn't really make logical sense either.
If valueno needs changing, that's a bug. Ideally termcount shouldn't need changing (more than 4 billion terms per document doesn't seem like a sane scenario, and isn't going to work sanely with the current termlist storage anyway), but we would need a new type for collection frequency. We probably should have one anyway since the collection frequency of a term which occurs many times in many documents will for many users probably overflow 32 bits before you add 4 billion documents.
Out of the box, with two static_cast<>s (to turn 0u and 1u into our unsigned_integer type instead of just unsigned int)
Both arguments of std::min()
and std::max()
should be the same
type, so the cast should be to that type not whatever it happens to be
typedef-ed currently, so 1u is wrong even as things stand. I've fixed that
in trunk.
this passes all tests except stub on all backends except remote. I believe that stub actually uses remote anyway,
A stub database can refer to any database backend(s), and there are several stub tests which are run over various actual backends, but if the remote tests fail, then stub tests run over the remote backend are likely to as well!
so this just means that the network protocol can't handle 64 bit, which was expected.
The remote protocol uses variable length integer encodings produced by templated functions, so I'd actually expect it would just work. Hard to guess what might be wrong. Running xapian-tcpsrv by hand in one terminal and performing a search on it from another (e.g. via a stub db file and examples/quest) might show what's going on.
I should note that passing the testsuite wouldn't actually be saying much about this patch - it would really need testing with more than 4 billion documents. The testsuite can't sanely do that due to the time and space required, but it can test with really large docids, and I guess people wanting this support can report any issues they hit.
(In a future where we support this, it's unclear as yet to me what the right approach is. 64 bits will slow down access in some cases, but on some CPUs other aspects of 64 bit access will be faster. I guess we just have to profile a lot and perhaps have it as a configure option until/unless it's the clear winner. Shudder.)
It's always going to be slower on a CPU without 64 bit registers, and sadly it's not going to be ABI compatible in general.
Right now, the sanest approach is probably just for people who actually need it to enable it - if you're handling more than 4 billion documents, having to work with a specially built package isn't likely to be a huge deal.
comment:3 by , 15 years ago
Replying to olly:
I'm not keen on adding new public types for this. They aren't useful in themselves, and these types don't all need to be the same size - they're just all "int" currently as that's "big enough" (or at least were for most people) on all modern plaforms - so a common type for them doesn't really make logical sense either.
Yes, that makes sense. It just seemed a convenient place to shove it so I could play around; I wasn't actually expecting so many of the tests to pass without tweaking.
So a patch that just changes the typedefs directly without an intermediate type is appropriate? (I guess a define from config.h?)
If valueno needs changing, that's a bug.
That's what I thought, but I don't really know what's going on in detail enough to figure it out. The error is:
./common/valuelist.h:73: warning: ‘virtual void Xapian::ValueIterator::Internal::skip_to(Xapian::docid)’ was hidden api/documentvaluelist.h:59: warning: by ‘void DocumentValueList::skip_to(Xapian::valueno)’
When Xapian::docid and Xapian::valueno are typedef'd the same, this doesn't matter.
Ideally termcount shouldn't need changing (more than 4 billion terms per document doesn't seem like a sane scenario, and isn't going to work sanely with the current termlist storage anyway), but we would need a new type for collection frequency. We probably should have one anyway since the collection frequency of a term which occurs many times in many documents will for many users probably overflow 32 bits before you add 4 billion documents.
Yes, that all makes sense. I was probably overly-liberal in the types I changed, bitten by the valueno thing and not bothering to fix it properly.
There's another problem if I don't switch termcount:
backends/chert/chert_postlist.cc:1137: error: cannot convert ‘Xapian::termcount*’ to ‘Xapian::doccount*’ for argument ‘3’ to ‘Xapian::docid read_start_of_first_chunk(const char, const char*, Xapian::doccount*, Xapian::termcount*)’
And if termpos and termcount don't match:
api/../backends/multi/multi_termlist.h:51: error: conflicting return type specified for ‘virtual Xapian::termpos MultiTermList::positionlist_count() const’ ./common/termlist.h:101: error: overriding ‘virtual Xapian::termcount Xapian::TermIterator::Internal::positionlist_count() const’
A stub database can refer to any database backend(s), and there are several stub tests which are run over various actual backends, but if the remote tests fail, then stub tests run over the remote backend are likely to as well!
Yes, that sounds like what happened :-) It's stubdb2, which is just remote :../bin/xapian-progsrv .chert/db=apitest_simpledata
.
The remote protocol uses variable length integer encodings produced by templated functions, so I'd actually expect it would just work. Hard to guess what might be wrong. Running xapian-tcpsrv by hand in one terminal and performing a search on it from another (e.g. via a stub db file and examples/quest) might show what's going on.
Server error:
Got exception NetworkError: Bad encoded length: insufficient data
I've tried investigating further, but my gdb-fu isn't really strong enough when forking starts getting involved :-/
Right now, the sanest approach is probably just for people who actually need it to enable it - if you're handling more than 4 billion documents, having to work with a specially built package isn't likely to be a huge deal.
Yes, that's what I thought. I was thinking that keeping a patch that applying reasonably cleanly around would mean that people with that requirement could build it 64 bit and see what happens. Once we get a bit of practical feedback of its actually working, it could be built in properly as a compile option.
comment:4 by , 15 years ago
Status: | new → assigned |
---|
I've hopefully fixed the problem in chert_postlist.cc in r12954.
by , 15 years ago
Patch to enable 64 bit docids and 64 bit termcounts (termpos, valueno left 32 bit)
comment:5 by , 9 years ago
Milestone: | → 1.3.x |
---|
There's a 32-bit assumption in glass currently:
/** How many bits to store the length of a sortable uint in. * * Setting this to 2 limits us to 2**32 documents in the database. If set * to 3, then 2**64 documents are possible, but the database format isn't * compatible. */ const unsigned int SORTABLE_UINT_LOG2_MAX_BYTES = 2;
We should probably resolve that before 1.4.0, so glass is 64-bit ready. Need to check what the consequences of changing that are though.
Marking this for 1.3.x, to at least resolve that.
by , 9 years ago
Attachment: | v1.2.21-x64.patch added |
---|
Here is the patch we're currently using against v1.2.21 . I will submit another patch against v1.3.3 which will be a little more tidy but this is just here in case anybody wants to use it. It passes all the tests on debian and appears to be working on our production systems but we are only using it for searching right now. I've run searches across combined databases with max docid > 232 and it has successfully solved all our DocNotFoundError problems. There are a few outstanding problems with this: 1. Tests don't compile on mac (just scaleweight1 and scaleweight2) 2. There are some hardcoded backend limits which I've not bothered to change
comment:6 by , 9 years ago
Sorry for weird formatting of comment above:
Here is the patch we're currently using against v1.2.21 . I will submit another patch against v1.3.3 which will be a little more tidy but this is just here in case anybody wants to use it. It passes all the tests on debian and appears to be working on our production systems but we are only using it for searching right now. I've run searches across combined databases with max docid > 32bit and it has successfully solved all our DocNotFoundError problems.
There are a few outstanding problems with this:
- Tests don't compile on mac (just scaleweight1 and scaleweight2)
- There are some hardcoded backend limits which I've not bothered to change
comment:7 by , 9 years ago
I've just uploaded the patch for v1.3.3 which passes all the tests and with constants extracted.
One thing I'm a little unsure of is whether or not my changes to SmokeTest.java
make sense. Did the unsigned long long
change cause a non backward compatible change to the java bindings? If so is that fine? If it's not fine can we somehow mitigate the java bindings?
follow-up: 12 comment:8 by , 9 years ago
I think the Java bindings should just continue to use Java long
for docid after this change. That's a signed 64 bit type, but the changes to SmokeTest.java
in the patch look like an incompatible change which require people to write rather clumsy and verbose code.
That may just be as simple as adding something like this to java/java.i
:
%apply long long { Xapian::docid };
Or maybe unsigned
instead of long long
there (currently we end up using Java long
). And similarly for other types like Xapian::doccount
.
The change to decode_length()
is OK for a 64-bit platform, but I think it'll fail to work where size_t
is 32-bit. I guess we either want it to return unsigned long long
, or a separate version of the function for cases where the length can legitimately be > 32 bit. This is actually already buggy for replication (#678).
Is long long
ever > 64 bits currently? C++11 says it must be at least that size, but if there's a platform relevant to us where long
is 64 bit and long long
is 128, we probably want to use unsigned long
(though using an unnecessarily wide type is probably only a performance issues).
It would also be good to have a simple test that this actually works - e.g. a multi database with subdatabases which each have a document with local docid 0xffffffff
, so the docid in the merged database should require 64 bits. It can't run for inmemory (as that reserves O(last_docid) space), but for other backends it ought to work.
We also need to think about how to enable this. I'm thinking probably a configure option, off by default for 1.4.x, unless benchmarking on 32 bit platforms show that this doesn't incur an overhead. It could perhaps be conditionally enabled for 64 bit platforms.
comment:9 by , 9 years ago
[24c7867693cf5746eab0e1cc50546b3e1bfc8332] makes decode_length()
decode an appropriate sized value, which mean the change in net/length.cc
isn't needed. It likely also solves some undetected issues with this patch on 32-bit platforms.
comment:10 by , 9 years ago
I've added the XXX_MAX_DOCID
constants in [57d2464bf81a335bc06841e2135a95b592969677] and merged the fixes for the Snipper
API in [4c7e875befffa981e424d6df7b6bc14f71a0d8ef].
comment:11 by , 9 years ago
The change to make the operator<<
function a template doesn't look right to me, and GCC seems to agree. I think the template would need to be explicitly instantiated for any types it will be used for in order for that to work, but it's a template mostly to avoid having to worry about what types we are going to use it for. Perhaps it is best to just inline the template from the header.
comment:12 by , 9 years ago
Replying to olly:
I think the Java bindings should just continue to use Java
long
for docid after this change. That's a signed 64 bit type, but the changes toSmokeTest.java
in the patch look like an incompatible change which require people to write rather clumsy and verbose code.That may just be as simple as adding something like this to
java/java.i
:%apply long long { Xapian::docid };Or maybe
unsigned
instead oflong long
there (currently we end up using Javalong
). And similarly for other types likeXapian::doccount
.
Agreed. I'll give that a go and see if I can get the smoke tests to run without changes to SmokeTest.java
.
The change to
decode_length()
is OK for a 64-bit platform, but I think it'll fail to work wheresize_t
is 32-bit. I guess we either want it to returnunsigned long long
, or a separate version of the function for cases where the length can legitimately be > 32 bit. This is actually already buggy for replication (#678).
I'm thinking this is no longer needed because of the other change you made?
Is
long long
ever > 64 bits currently? C++11 says it must be at least that size, but if there's a platform relevant to us wherelong
is 64 bit andlong long
is 128, we probably want to useunsigned long
(though using an unnecessarily wide type is probably only a performance issues).
I'm not really sure how to find out the answer to this question. Do we have a list of devices/hardware/OSs or something to check against? Also is there a way to detect this at compile time so we can pivot based on whether or not unsigned long
is already 64bit?
It would also be good to have a simple test that this actually works - e.g. a multi database with subdatabases which each have a document with local docid
0xffffffff
, so the docid in the merged database should require 64 bits. It can't run for inmemory (as that reserves O(last_docid) space), but for other backends it ought to work.
Sounds good. I'll have a go at adding such a test.
We also need to think about how to enable this. I'm thinking probably a configure option, off by default for 1.4.x, unless benchmarking on 32 bit platforms show that this doesn't incur an overhead. It could perhaps be conditionally enabled for 64 bit platforms.
Would the conditional enabling that seems half done through #define USE_64BIT_DOCID
and #define USE_64BIT_TERMCOUNT
be suitable? Could we have those somehow enabled in ./configure
perhaps ./configure --with-64-bit-docids --with-64bit-termcount
? Is there an example of a configuration step in xapian already that I can look at and try to copy that?
comment:13 by , 9 years ago
[14b7af012cd57bb5e6097584f36d8680ca3c8d7e] just changes the testutils operator<<
to use Xapian::docid
instead of unsigned int
, which avoids using a template there. Really the issue is just that the wrong type was being used.
[ decode_length() ]
I'm thinking this is no longer needed because of the other change you made?
Indeed - that should all just work now.
[ sizeof(long long) ]
I'm not really sure how to find out the answer to this question. Do we have a list of devices/hardware/OSs or something to check against?
It's hard to gather a complete list of platforms Xapian works on as people might have successfully built on something and never told us. We're much more likely to hear if they fail, or if a few tweaks are needed.
But we can look at the platforms which are generally in active use (and in this case, anything old isn't going to make long long
wider than it has to, and 64-bit is the minimum requirement).
The current norm is definitely 64-bit long long - e.g. all the architectures Debian supports have sizeof(long long) == 8
. The GCC manual seems to hint that GCC supports platforms where this isn't true, but I don't know an easy way to find out what they are:
https://gcc.gnu.org/onlinedocs/gcc/_005f_005fint128.html#g_t_005f_005fint128 says "There is no support in GCC for expressing an integer constant of type int128 for targets with long long integer less than 128 bits wide" which suggests that there are targets with long long at least 128 bits wide.
C++11 provides uint64_t
in <cstdint>
(at least if such a type exists), though so far we've tried to avoid introducing C++11 assumptions in the API headers (only in the library code) - most compilers currently need C++11 support enabling with a command-line option, and it seems unhelpful to force all C++ projects using Xapian to update their build systems to probe for such an option.
I think we probably just use unsigned long long
- it will always work, and while it may be wider than necessary, that seems mostly a theoretical worry currently.
Would the conditional enabling that seems half done through #define USE_64BIT_DOCID and #define USE_64BIT_TERMCOUNT be suitable?
We don't want to have be defining generically named macros in the API headers (we risk colliding with macros the application using Xapian is using) - so the macros should start XAPIAN_
.
But the basic idea seems OK.
Could we have those somehow enabled in ./configure perhaps ./configure --with-64-bit-docids --with-64bit-termcount? Is there an example of a configuration step in xapian already that I can look at and try to copy that?
This is trickier for things like this which we want to use in the API headers as we can't just stick #include <config.h>
in those.
I'd look at --enable-backend-chert
and XAPIAN_HAS_CHERT_BACKEND
in configure.ac
and include/xapian/version_h.cc
(which is used to generate include/xapian/version.h
).
The options should probably be --enable-X
(--with-X
is conventionally meant to be used when X
is some other software package and --enable-X
when X
is a feature of this package - e.g. --with-java
vs --enable-backend-chert
- the most obvious consequence is the sections they are listed under by configure --help
).
It seems confusing for docids
to be plural in the option name when termcount
isn't; similarly be consistent with 64-bit
vs 64bit
there.
Also is there a way to detect this at compile time so we can pivot based on whether or not unsigned long is already 64bit?
I can't think of one unless we force people to select C++11 and just use uint64_t
, but it seems a bit soon for that. At some point compilers will presumably default to C++11 and this won't be a consideration.
comment:14 by , 9 years ago
I've attached my latest patch on top of master. I've added the --enable-64bit-docid
and --enable-64bit-termcount
options. All tests pass on xapian-core
but there is possibly still the outstanding issue with the java bindings using BigInteger and I also have yet to add a new test for the 64bit functionality. I'll try and tackle those next.
I'm also going to keep a branch on top of master here https://github.com/xapian/xapian/compare/master...DylanGriffith:master-x64
comment:15 by , 9 years ago
Ok so that patch doesn't actually appear to have any problem with the java bindings. Didn't require any changes. I will try now to add a test that asserts that the 32bit overflow no longer causes a problem.
comment:16 by , 9 years ago
Ok just added another patch that includes a test that shows you can now exceed 32bits in combined DB. You can build passing --enable-64bit-docid
and --enable-64bit-termcount
to enable and test this feature.
comment:17 by , 9 years ago
In [b01415df/git] I've fixed the overload resolution issue for decode_length()
by simply providing overloads for unsigned
, unsigned long
, and unsigned long long
.
comment:18 by , 9 years ago
Component: | Other → Library API |
---|
I've updated the patch so configure just picks suitable sized types for docid
and termcount
and substitutes those into include/xapian/version.h
instead (as discussed on IRC yesterday).
I'll really need to build and run the testsuite for all 4 combinations of type sizes, but it built for 32+32 OK, and the testsuite is currently running. So far I'm getting a valgrind moan for testcase topercent2, but that doesn't seem likely to be due to these changes - I'm guessing it's due to newer valgrind/GCC/something, and either a false positive, or a pre-existing issue that wasn't previously reported. Will investigate.
comment:19 by , 9 years ago
The topercent2 issue is fixed by [5b13ee178ee321821b7f7e91fe74a32d2fade3cd].
comment:20 by , 9 years ago
All 4 combinations worked, so I've committed the adjusted patch in [165391d49d0e52ae416f0c75350f79fe626311da].
The bindings needed a small tweak to get SWIG to see XAPIAN_DOCID_BASE_TYPE
, etc but otherwise seem to work, though we probably should have a testcase like exceed32bitcombineddb1
for each language to ensure that 64 bit docids actually work.
I've opened #686 for the related issue of allowing > 32-bit docid values to be used in the disk backends. This requires an encoding change, so isn't feasible for the stable backends (chert and older), but we could probably do it for glass.
I also wonder if 64-bit docids would be a sane default, at least for 64-bit platforms. Dylan: Have you done any profiling of your patch compared to a standard build?
comment:21 by , 9 years ago
Milestone: | 1.3.x → 1.3.4 |
---|
Dylan: Also, were you hoping to get this backported for 1.2.x, or are you happy to just carry a patch until you move to 1.4.x?
comment:22 by , 9 years ago
Olly: We should be fine to carry our patch until 1.4.x.
Thanks for getting that merged.
comment:23 by , 9 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
We should be fine to carry our patch until 1.4.x.
Cool.
I've spun off the bindings tests as #687. Closing.
comment:24 by , 6 years ago
Related: [bc9e82fec5a243d67f9c4cb9a6eeaedb21190135] adds support for making Xapian::termpos
64-bit via new configure option --enable-64bit-termpos
. I intend to backport this for 1.4.8.
That patch passes all but a stub test (that uses remote:prog), except for the remote backends. (This on Mac OS X.)