Opened 8 years ago

Last modified 14 months ago

#724 new defect

Handle overflow of collection frequency

Reported by: Olly Betts Owned by: Olly Betts
Priority: normal Milestone: 2.0.0
Component: Backend-Glass Version: git master
Severity: normal Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description

Test case totaldoclen1 under chert was triggering undefined behaviour via signed integer overflow of the collection frequency deltas, and that was causing errors under ubsan.

I've patched the testcase to avoid this, so it tests what it aims to, and the testsuite doesn't fail under ubsan.

We ought to handle overflow of collection frequency better (hence this ticket) but in practice this isn't likely to happen until database get very large unless people deliberately set artificially high wdf (which is what the testcase was doing).

Although glass wasn't triggering ubsan errors the same issue does apply to glass, so marking as Backend-Glass since glass is the default backend for 1.4.x.

Change History (2)

comment:1 by Olly Betts, 8 years ago

[d9de0125313e666f53ad1718c603c5fb68730270/git] was the totaldoclen1 change, and the overflow was happening at line 1125 of backends/chert/chert_database.cc:

i->second.second += cf_delta;

comment:2 by Olly Betts, 14 months ago

Milestone: 1.4.x2.0.0

This seems tricky to fix properly.

There are two problems here really:

One is that cf is the sum of wdf over all the documents, but the same type as wdf so can simply overflow. That probably needs to be thrown as an exception to the application code, though there's not a lot if can really do about it. The other option is to use a type for collection frequency which is wider (>= width(doccount)+width(termcount) would work). We already have Xapian::totallength which is always 64-bit which would be suitable unless --enable-64bit-docid and/or --enable-64bit-termcount is used.

The other is that these deltas are the signed versions of the types, so we can overflow them more easily (and we can potentially overflow tf_delta, though we'd need to batch up >= 231 document changes which isn't really feasible in practice). We can probably use the overflow checking addition and subtraction and flush the pending changes if we would overflow, though this is fiddly to get right, and handling a wdf >= 231 is fiddlier still since we need to apply it directly rather than via the inverter. We could use a wider type here, or track the signs separately to the magnitudes, but both increase the size of the data stored, which is a concern.

Note: See TracTickets for help on using tickets.