Opened 12 years ago

Closed 3 years ago

Last modified 3 years ago

#326 closed defect (fixed)

Change doc length chunk encoding so skipping through a chunk is better than O(n)

Reported by: Richard Boulton Owned by: Olly Betts
Priority: normal Milestone: 1.5.0
Component: Backend-Honey Version: SVN trunk
Severity: normal Keywords:
Cc: charlie@… Blocked By:
Blocking: Operating System: All

Description (last modified by Olly Betts)

I've built a benchmark database of slightly over 100,000 documents from wikipedia, and indexed these with flint and chert. When searching the resulting databases, with 10,000 single term searches, with the database fully cached, the flint database completes all searches in 1.78 seconds (averaging 0.000178 seconds/query), whereas the chert database completes all searches in 12.58 seconds (averaging 0.001258 seconds/query) - ie, chert is about 7 times slower than flint.

Note that the chert database is considerably smaller than the flint database, which hopefully means that in the uncached case chert might perform better. However, with databases under 1m documents, we're likely to be IO bound, and performance will be much worse with chert.

Profiling the code with callgrind revealed that around 85% of the CPU time is being spent in get_doclength() calls, and around 90% of that time is spent in ChertPostList::move_forward_in_chunk_to_at_least(). This method calls next_in_chunk() repeatedly to find the appropriate doclen: on average, it calls next_in_chunk() around 30 times per call.

I don't think this degree of slowdown is acceptable, so we need to either find a way to make the code faster with the existing datastructure, or find a way to allow faster seeking in the doclen list.

I've done some experiments about this, which are detailed in the comments. So far, a combination of avoid_string_operations.patch and optimise_unpack_uint.patch give the best result, but still result in searches 4.5 times slower than with flint. I think a redesign of the way doclen information is stored is needed to get acceptable speeds.

Attachments (6)

small_doclen_chunks.patch (4.8 KB ) - added by Richard Boulton 12 years ago.
Patch to make doclen chunks smaller.
avoid_string_appends_1.patch (3.8 KB ) - added by Richard Boulton 12 years ago.
patch to speed search by avoiding string appends with output of pack_uint_preserving_sort
avoid_string_appends_and_loop_unroll_2.patch (5.3 KB ) - added by Richard Boulton 12 years ago.
patch which also unrolls loop in append_packed_uint_preserving_sort(), avoiding string resizing
avoid_string_operations.patch (20.0 KB ) - added by Richard Boulton 12 years ago.
Patch to speed search by using a fixed size buffer for keys, rather than C++ strings
optimise_unpack_uint.patch (2.7 KB ) - added by Richard Boulton 12 years ago.
Optimise the pack_uint function in chert with a specialised version for om_uint32
chunktypes.patch (14.7 KB ) - added by Richard Boulton 12 years ago.
A patch to add two different chunk types - sparse and compact

Download all attachments as: .zip

Change History (33)

comment:1 by Richard Boulton, 12 years ago

Status: newassigned

http://tartarus.org/richard/xapian_perf_analysis/callgrind.out.xapian_search_chert_1 is a callgrind trace for 1000 searches with chert, unpatched.

My first attempt to improve performance was to reduce the size of the doclen postlist chunks, to avoid so much time being spent calling next. Since around 30 calls to next() were made per search, I reduced the chunksize for doclen lists from 2000 bytes to 70 bytes. After applying xapian-compact to the original and new databases, the new database was a fraction of a percent larger (postlist.DB of 245088256 bytes instead of 245022720 bytes). but searches have sped up - 10,000 searches completed in 10.53 seconds instead of 12.58 seconds with the standard doclen chunk size.

Profiling with callgrind (see http://tartarus.org/richard/xapian_perf_analysis/callgrind.out.xapian_search_chert_small_doclen_chunks ) shows that move_forwards_in_chunk_to_at_least() is now taking only 23.75% of the CPU time (though this is still undesirably high), but ChertPostList::move_to_chunk_containing takes 50.51% of the CPU time.

by Richard Boulton, 12 years ago

Attachment: small_doclen_chunks.patch added

Patch to make doclen chunks smaller.

by Richard Boulton, 12 years ago

patch to speed search by avoiding string appends with output of pack_uint_preserving_sort

by Richard Boulton, 12 years ago

patch which also unrolls loop in append_packed_uint_preserving_sort(), avoiding string resizing

by Richard Boulton, 12 years ago

Patch to speed search by using a fixed size buffer for keys, rather than C++ strings

comment:2 by Richard Boulton, 12 years ago

Patches: avoid_string_appends_1.patch, avoid_string_appends_and_loop_unroll_2.patch and avoid_string_operations.patch progressively improve the speeds of the search (against the database with small doclen chunks): the first to 9.38 seconds, the second to 8.56 and finally to 8.08.

The first patch simply avoids some of the C++ string appends and resizing - the second adds some loop unrolling, which lets us avoid even more C++ string appending and resizing. However, I wasn't able to get rid of a fair amount of unneccessary dynamic allocation and string copying using standard C++ strings, so the third patch replaces the use of C++ strings for table keys with a fixed size char array. This final patch isn't very tidy, but I think it's reasonable to conclude from its performance that this is worth looking into: we could have a fixed key buffer on each table, in which the keys are formed in-place, rather than by copying strings.

by Richard Boulton, 12 years ago

Attachment: optimise_unpack_uint.patch added

Optimise the pack_uint function in chert with a specialised version for om_uint32

comment:3 by Richard Boulton, 12 years ago

Going back to the original profiling, with full size doclen chunks, 55% of CPU is spent in unpack_uint(). I've tried optimising this with a version specialised for uint32, with some success: the patch is in optimise_unpack_uint.patch. With this patch, the compiler inlines the calls to unpack_uint(), so it's hard to directly measure the difference, but search times go from 12.58 seconds to 8.54 seconds, so it's clearly beneficial. callgrind output with this patch is available at http://tartarus.org/richard/xapian_perf_analysis/callgrind.out.xapian_search_chert_optimised_unpack_uint_1

This output shows that the main caller of unpack_uint goes from 4.29 billion cycles without the patch, to 1.96 billion with it - more than doubling its speed.

comment:4 by Richard Boulton, 12 years ago

Description: modified (diff)

comment:5 by Richard Boulton, 12 years ago

Applying avoid_string_operations.patch and optimise_unpack_uint.patch together, I get a search time of 7.84 seconds with the full size doclen chunks, or 7.70 seconds with the small doclen chunks. I think this is getting towards the limit of what can be acheived with this kind of optimisation, and a better datastructure for the doclens which can find document lengths without linear seeking is needed.

Looking at the callgrind output for this (with the full size doclen chunks), ChertPostList::move_to_chunk_containing() is currently using 221 757 608 cycles, whereas with flint, the entire set of calls to next() uses only 214 203 320 cycles. This implies that to get faster performance than flint with chert, we'll need to decrease the time spent in move_to_chunk_containing() - one way to do this would be to decrease the number of calls, by reducing the number of chunks (either by making them larger, or more compact).

comment:6 by Olly Betts, 12 years ago

I did wonder about how best to encode the doclengths at the time but just went for the same as that which the postlist encoding uses (once I'd removed the doclen from there) because that was least coding effort and it seemed to work well enough.

The requirement to be able to splice in updates restricts the choice of data structure somewhat. I'll have a think.

comment:7 by Richard Boulton, 12 years ago

I'd be happy to sacrifice a bit of index-time performance for better search speed...

I wondered about just adding a few skip positions; perhaps every storing the offset of every 10th docid in the list (together with the difference between each 10th docid), so a skip_to() would skip to the right group of 10, and then skip to the right individual item.

This could be done heirarchically, and could pick the offset probabilistically. This is basically a skip-list, as described in http://en.wikipedia.org/wiki/Skip_list

comment:8 by Richard Boulton, 12 years ago

Also, I wonder if it's just the doclength lists which could do with improved skip_to() ability - it's possible the other posting lists could also benefit, but aren't causing quite such obvious problems: either because they're being skipped through less often, or because the chunks contain fewer items since each item is larger - with flint, the items contain the doclengths, so they're going to be quite a bit larger.

comment:9 by Olly Betts, 12 years ago

Milestone: 1.1.01.1.1

We're not promising that the chert format is stable yet, so -> milestone:1.1.1

by Richard Boulton, 12 years ago

Attachment: chunktypes.patch added

A patch to add two different chunk types - sparse and compact

comment:10 by Richard Boulton, 12 years ago

I've just attached a patch which tries another approach - it makes a separate chunk type for the case where a continuous list of docids are in the list. This should probably just be an option for the doclen lists, since it's rare to get continuous lists in other situations, but it was easier to code it generally, to start with.

This patch increases the performance quite nicely - from the original 12.58 seconds it goes to 6.8 seconds. get_doclength() is now taking 75% of the time spent under get_mset() (down from 85% unpatched), and of this only 66% is spent in move_forward_in_chunk_to_at_least() (down from 90% of time spent in get_doclength()). Only 5% of the time is spent in unpack_uint() - the majority of the time in move_forward_in_chunk_to_at_least() (86%) is spent in the tight loop iterating through the wdf values looking for the right one to decode.

If we had a datastructure which removed the need to spend any time in this tight loop, but didn't affect anything else (unrealistic, obviously, but an upper bound on what could happen), we could therefore expect the time spent in get mset to decrease by a proportion of: .75 * .66 * .86 = .42. This would result in a total time for the search of 3.9 seconds, rather than 6.8, which would be about 2500 queries per second. This is still half the speed of flint, so we're going to have to think of more than that if we want to equal flint's speed in chert.

(Admittedly, these single-term queries are a worst-case for chert versus flint, since we have two lists to look in rather than 1 - for a multi-term query, the lookup in the extra doclen list will be a lesser proportion of the total work. However, they're probably an important case, and certainly a good place to try and optimise as much as possible.)

comment:11 by Charlie Hull, 12 years ago

Cc: charlie@… added

comment:12 by Olly Betts, 12 years ago

Had an idea last night...

If the doclength encoding is a known fixed length per document, then skip ahead becomes O(1), and move_forward_in_chunk_to_at_least() is just a multiply and an add instruction.

There's some extra space lost to encoding lengths in more bytes than they need, but we also save the overhead of keeping skip lists and it's much easier to update this structure. Also, we only have one list of doclengths with one entry per document so compactness is less critical than it is for posting lists.

At its simplest level, the idea would have an initial byte of the chunk to give the fixed width - obviously 1, 2, 3, 4 byte are useful. I'd guess it might be worth considering 12 bit, etc. And then we encode lengths like that until we get one which is too long. Updating may require a slightly fiddly split though.

Slightly more complex version:

Header byte = 2 bits for width (1,2,3,4 byte), 6 bits for number of entries of that size - 1

Then we have that many entries of the specified width

Appending requires checking the encoding, and if the width is OK and the count is less than 32, incrementing the count and appending the new length. If not, we append a new header byte and start encoding. We can easily track min, max, and/or average length of the data being appended to judge if we should reduce the encoding width when possible too.

Searching can step forward by up to 32 entries very cheaply. If that doesn't seem enough, we could use a 2 byte header, and potentially add some more widths. We could also keep a min length there and store deltas from it (e.g. 5 bits could store log2 of a width lower bound; 2 or 3 bits for the field width, and 9 or 8 bit count).

I'm going to dump out the gmane document lengths and have a look at how they lie, but perhaps not until 1.1.0 is done.

comment:13 by Olly Betts, 11 years ago

Milestone: 1.1.11.1.3

comment:14 by Olly Betts, 11 years ago

Priority: normalhighest

This is really a release blocker.

comment:15 by Olly Betts, 11 years ago

Owner: changed from Richard Boulton to Olly Betts
Status: assignednew

I'm going to work on this ticket next, since it seems to be the most unavoidable of the three tickets left with priority "highest".

comment:16 by Olly Betts, 11 years ago

Milestone: 1.1.31.1.4
Status: newassigned

I've committed more efficient pack and unpack functions for chert to trunk in r13578, which should improve the situation here. These sped up searching for "The" on gmane by about 35%. If that same improvement were seen in the original testcase, chert would be 4.6 times slower rather than 7 times, which is a good start.

This patch incorporates the ideas from avoid_string_appends_1.patch. I've not used the unrolled/specialised versions (yet anyway) as I'm worried that inlining these will cause excessive code size growth, filling the CPU cache more. So we need to re-benchmark them in a real situation, not just a microbenchmark. I did address the string::insert() issue which is one issue that the unrolling in avoid_string_appends_and_loop_unroll_2.patch is aimed at. Also, unpack_uint() now finds the length first (which allows a cleaner unpacking loop working backwards), so the trick of checking there are at least 5 bytes available and using a special unrolled version is likely to be less of a gain.

Interestingly, for the gmane search for "The", memcpy() dominates the callgrind profile (both before and after).

Bumping remaining work to 1.1.4.

comment:17 by Olly Betts, 11 years ago

Summary: Searches with chert are slow, due to slow doclen accessSome queries slower with chert than flint due to doclen access

Retitling to better describe the issue (before any of the fixes, chert managed 795 queries per second which isn't "slow", it's just slower than flint is on the same testcase), and this is particular to some queries - with more terms, or when the database isn't entirely in the OS cache, chert is likely to be faster (because it doesn't have to parse the doclength in every postlist, and because it uses less disk space).

The recent changes to the key format are likely to improve chert's speed relative to flint, although probably more for the "can't cache it all case".

While it would be nice to have chert faster than flint for every case before we make it the default backend in the current stable release, I don't think it is useful to hold up 1.2.0 indefinitely when chert is faster than flint for larger databases, just not in this one case you've singled out (and if the fair release blocking test would be trunk with chert against 1.0.18 with flint anyway).

So I'm currently thinking that we shouldn't hold up 1.2.0 for this. We can slot in further tweaks which don't change the format (e.g. to the pack and unpack routines) in 1.2.x if we find them.

Rewriting the doclength storage at this point is potentially destabilising, so I think it's better to put that in the new development backend (brass) instead since we haven't even started to code that yet.

Delaying 1.2.0 won't make brass won't be any faster or slower relative to flint, but it will inevitably push 1.4.0 further back too, which is a long-term loss.

comment:18 by Olly Betts, 11 years ago

Component: Backend-ChertBackend-Brass
Milestone: 1.1.41.2.x
Priority: highestnormal

Marking for brass which means we can address the doclength encoding in 1.2.x.

comment:19 by Olly Betts, 10 years ago

Summary: Some queries slower with chert than flint due to doclen accessChange doc length chunk encoding so skipping through a chunk is better than O(n)

retitling

comment:20 by Olly Betts, 9 years ago

Description: modified (diff)

comment:21 by Olly Betts, 8 years ago

1.3.x material now.

comment:22 by Olly Betts, 8 years ago

Milestone: 1.2.x1.3.x

comment:23 by Olly Betts, 6 years ago

Component: Backend-BrassBackend-Glass

comment:24 by Olly Betts, 5 years ago

Milestone: 1.3.x1.4.x

It would be good to address this for 1.4.0. It's conceptually simple, but gets more complex in detail for cases when updating the length of an existing document which doesn't fit in the fixed width picked for that chunk.

Since we don't have anything close to a candidate patch for it either at this point, I'm reluctantly going to postpone it. And while O(n) sounds bad, n is bounded by the fixed maximum size of a chunk which is a bit under 2KB, so it actually has an O(1) bound, just with a larger constant than it could have.

comment:25 by Olly Betts, 4 years ago

Milestone: 1.4.x1.5.0

Something for the next development series.

comment:26 by Olly Betts, 3 years ago

Resolution: fixed
Status: assignedclosed

The new honey backend which I recently merged to master stores document lengths with a fixed width encoding.

Currently it's a fixed 4 bytes per entry, which is somewhat wasteful but actually for glass a typical document length entry needs 3 bytes (because the document length values are typically >= 128 and < 16384 which takes 2 bytes, and we use another byte to store the docid delta, which is always 0 unless there are deleted documents).

My plan is to allow the width to vary per chunk - not sure if to byte or bit granularity, or somewhere between. Bit granularity is obvious more compact, but actually the doclen data is not a huge amount of data so the additional saving may not justify the increased complexity (and hence encoding and decoding time).

But while there's scope to improve further, the issue here is now addressed so closing.

Last edited 3 years ago by Olly Betts (previous) (diff)

comment:27 by Olly Betts, 3 years ago

Component: Backend-GlassBackend-Honey
Note: See TracTickets for help on using tickets.