Ticket #326 (assigned defect)
Some queries slower with chert than flint due to doclen access
| Reported by: | richard | Owned by: | olly |
|---|---|---|---|
| Priority: | normal | Milestone: | 1.2.x |
| Component: | Backend-Brass | Version: | SVN trunk |
| Severity: | normal | Keywords: | |
| Cc: | charlie@… | Blocked By: | |
| Operating System: | All | Blocking: |
Description (last modified by richard) (diff)
I've built a benchmark database of slightly over 100,000 documents from wikipedia, and indexed these with flint and chert. When searching the resulting databases, with 10,000 single term searches, with the database fully cached, the flint database completes all searches in 1.78 seconds, whereas the chert database completes all searches in 12.58 seconds - ie, chert is about 7 times slower than flint.
Note that the chert database is considerably smaller than the flint database, which hopefully means that in the uncached case chert might perform better. However, with databases under 1m documents, we're likely to be IO bound, and performance will be much worse with chert.
Profiling the code with callgrind revealed that around 85% of the CPU time is being spent in get_doclength() calls, and around 90% of that time is spent in ChertPostList::move_forward_in_chunk_to_at_least(). This method calls next_in_chunk() repeatedly to find the appropriate doclen: on average, it calls next_in_chunk() around 30 times per call.
I don't think this degree of slowdown is acceptable, so we need to either find a way to make the code faster with the existing datastructure, or find a way to allow faster seeking in the doclen list.
I've done some experiments about this, which are detailed in the comments. So far, a combination of avoid_string_operations.patch and optimise_unpack_uint.patch give the best result, but still result in searches 4.5 times slower than with flint. I think a redesign of the way doclen information is stored is needed to get acceptable speeds.
