Opened 6 years ago
Closed 6 years ago
#782 closed enhancement (fixed)
Improve encode_length() performance
Reported by: | German M. Bravo | Owned by: | Olly Betts |
---|---|---|---|
Priority: | normal | Milestone: | 1.5.0 |
Component: | Backend-Remote | Version: | |
Severity: | minor | Keywords: | |
Cc: | Blocked By: | ||
Blocking: | Operating System: | All |
Description (last modified by )
One major bottleneck during some profiling I've been doing when using the remote protocol is in Document::serialise()
. Remote protocol uses serialization a lot, and serializers use encode_length()
pretty much everywhere... however current implementation of encode_length
could be improved to be more performant.
I've re-implemented the function and I've opened pull request at https://github.com/xapian/xapian/pull/239.
Benchmarking results are as follow:
2019-04-25 11:06:05 Running ./benchmark-encode_length Run on (4 X 3000 MHz CPU s) CPU Caches: L1 Data 32K (x2) L1 Instruction 32K (x2) L2 Unified 262K (x2) L3 Unified 4194K (x1) ----------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------- BM_EncodeLength_Original 1684 ns 1681 ns 414721 BM_EncodeLength_Optimized 481 ns 480 ns 1421164
Attached is the benchmarking code I used.
Edit: I've updated the patch to squeeze out a few more nanoseconds.
Attachments (3)
Change History (20)
by , 6 years ago
Attachment: | benchmark-encode_length.cc added |
---|
comment:1 by , 6 years ago
Description: | modified (diff) |
---|
comment:2 by , 6 years ago
Curiously I don't see nearly as much of a speed-up on x86 Linux:
2019-05-05 18:04:29 Running ./benchmark-encode_length Run on (8 X 3700 MHz CPU s) CPU Caches: L1 Data 32K (x4) L1 Instruction 32K (x4) L2 Unified 256K (x4) L3 Unified 6144K (x1) ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ***WARNING*** Library was built as DEBUG. Timings may be affected. ----------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------- BM_EncodeLength_Original 957 ns 957 ns 725154 BM_EncodeLength_Optimized 752 ns 752 ns 914252
I had to comment out the BENCHMARK_MAIN();
as otherwise linking fails with multiple definitions of main
, but otherwise that's the same code you show and the same compilation command.
comment:3 by , 6 years ago
I think your microbenchmark may be flawed.
It doesn't use the returned value at all, which is unrealistic and potentially allows the compiler to optimise away parts of the encoding routine. If it can optimise away more of your optimised version than the current version yours will appear faster, but we'd never actually call it in a void context like that so that's not a speed-up we'll actually ever see in practice.
I changed both loops to this, which is much more like the typical usage pattern:
std::string s; for (auto i = x; i != 0; i >>= 1) { s += encode_length_original(i); }
But then both versions seem to take the same time within noise.
I'll re-test on x86-64 when I get a chance.
comment:4 by , 6 years ago
These are my timings after adding the std::string s
and appending to it:
Run on (4 X 3000 MHz CPU s) CPU Caches: L1 Data 32K (x2) L1 Instruction 32K (x2) L2 Unified 262K (x2) L3 Unified 4194K (x1) ----------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------- BM_EncodeLength_Original 2578 ns 2573 ns 244950 BM_EncodeLength_Optimized 1510 ns 1507 ns 591061
comment:5 by , 6 years ago
Maybe not as dramatic, but still faster. And "optimized" code is clean, imvho, nothing eccentric or esoteric; and generated instructions (at least for x86_64) promote processor pipelining and easy branch prediction.
comment:6 by , 6 years ago
I realised I had stupidly cut-and-pasted the loop above verbatim into both versions. With the appropriate function called in each case I'm seeing ~1400ns vs ~1200ns for the s +=
version. Not sure why fixing one version changes the timings for the unchanged one substantially though. Maybe I should heed those ***WARNING***
messages.
Maybe not as dramatic, but still faster. And "optimized" code is clean, imvho, nothing eccentric or esoteric; and generated instructions (at least for x86_64) promote processor pipelining and easy branch prediction.
I'm not arguing against optimising this - just trying to make sure we're optimising based on valid data.
comment:7 by , 6 years ago
I've done some more tests, and also now tested on x86-64 Linux.
The tl;dr version is that it looks to me like your code is probably better with clang, but probably slower with GCC. And that's clearly problematic.
I made some changes to the benchmark code. I don't have actual data, but I think the full range of unsigned long long
is not a realistic test. Most encoded values will be term lengths or wdf values, both of which I'd expect to mostly fit in one byte. So I changed the start point to be 0xfff
.
I also changed the benchmark loop to what the libbenchmark docs recommend unless you need C++98 compatibility (and they especially recommend for benchmarked code that doesn't take very long to execute, like ours).
With that here's what I get on Linux:
$ clang++-7 -std=c++1z -pedantic -Wall -Wextra -O3 -lbenchmark -lbenchmark_main -o benchmark-encode_length ./benchmark-encode_length.cc && ./benchmark-encode_length 2019-05-08 18:50:50 Running ./benchmark-encode_length Run on (8 X 3900 MHz CPU s) CPU Caches: L1 Data 32K (x4) L1 Instruction 32K (x4) L2 Unified 256K (x4) L3 Unified 8192K (x1) ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ***WARNING*** Library was built as DEBUG. Timings may be affected. ----------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------- BM_EncodeLength_Original 155 ns 155 ns 4506744 BM_EncodeLength_Optimized 113 ns 113 ns 6190602 $ g++ -std=c++1z -pedantic -Wall -Wextra -O3 -lbenchmark -lbenchmark_main -o benchmark-encode_length ./benchmark-encode_length.cc && ./benchmark-encode_length 2019-05-08 18:50:59 Running ./benchmark-encode_length Run on (8 X 3900 MHz CPU s) CPU Caches: L1 Data 32K (x4) L1 Instruction 32K (x4) L2 Unified 256K (x4) L3 Unified 8192K (x1) ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ***WARNING*** Library was built as DEBUG. Timings may be affected. ----------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------- BM_EncodeLength_Original 88 ns 88 ns 7929005 BM_EncodeLength_Optimized 117 ns 117 ns 5926981
So your optimised version improves with clang but regresses with GCC. Interestingly your version seems to have pretty much the same performance with both compilers, but the current code works much better with GCC than clang.
BTW, this clang using libstdc++ - I tried with -stdlib=libc++
but the built program segfaults.
I see essentially the same picture on x86 Linux.
So where do we go from here?
I think gathering some representative data on the values which get encoded would be useful, so we can actually make sure we're benchmarking something realistic. In particular if the single byte case dominates that makes a difference.
Having some real data could also inform a decision about whether the current encoding is actually a good choice. Using the simpler format which pack_uint()
produces is an option for example.
Do you have a real world xapiand deployment you could log such data from? If not, I can see if I can probably find somewhere to get suitable data from. Or perhaps we both should to get some idea of how much variability there is.
by , 6 years ago
Attachment: | benchmark-encode_length.2.cc added |
---|
updated benchmark on narrower range of values
comment:8 by , 6 years ago
That is weird. I’ll definitely dig further here. I’ll also get that real world sampling as well.
comment:9 by , 6 years ago
Yes, apparently gcc makes a better job at producing faster code here. I tweaked a new "optimized" which really only moves the returns inside the if()
s (for some reason it gives a little bit of gain):
template<class T> std::string encode_length_optimized(T len) { std::string result; if (len < 255) { result += static_cast<unsigned char>(len); return result; } else { result += '\xff'; len -= 255; while (true) { unsigned char b = static_cast<unsigned char>(len & 0x7f); len >>= 7; if (!len) { result += b | static_cast<unsigned char>(0x80); break; } result += b; } return result; } }
gcc with (x == 0xfff)
:
Running ./benchmark-encode_length.2 Run on (4 X 3000 MHz CPU s) CPU Caches: L1 Data 32K (x2) L1 Instruction 32K (x2) L2 Unified 262K (x2) L3 Unified 4194K (x1) Load Average: 1.70, 2.76, 3.47 -------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------- BM_EncodeLength_Original 132 ns 132 ns 5234075 BM_EncodeLength_Optimized 131 ns 131 ns 5338377 BM_EncodeLength_Pack 108 ns 108 ns 6482442
clang with (x == 0xfff)
:
Running ./benchmark-encode_length.2 Run on (4 X 3000 MHz CPU s) CPU Caches: L1 Data 32K (x2) L1 Instruction 32K (x2) L2 Unified 262K (x2) L3 Unified 4194K (x1) ----------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------- BM_EncodeLength_Original 220 ns 220 ns 3019206 BM_EncodeLength_Optimized 219 ns 219 ns 3132061 BM_EncodeLength_Pack 55 ns 55 ns 12631731
On the other hand, this is what I'm seeing encode_length()
being used for:
len(sample) = 27,712,252 min(sample) = 0 max(sample) = 254 statistics.median(sample) = 3.0 statistics.mean(sample) = 20.126449596260276 statistics.stdev(sample) = 42.358357658089815 statistics.variance(sample) = 1794.2304634906563
So benchmarks with x == 0xff
follow.
gcc with (x == 0xff)
:
Run on (4 X 3000 MHz CPU s) CPU Caches: L1 Data 32K (x2) L1 Instruction 32K (x2) L2 Unified 262K (x2) L3 Unified 4194K (x1) Load Average: 1.24, 2.00, 2.18 -------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------- BM_EncodeLength_Original 28.0 ns 28.0 ns 23851060 BM_EncodeLength_Optimized 20.5 ns 20.5 ns 33096144 BM_EncodeLength_Pack 19.0 ns 19.0 ns 36725357 BM_EncodeLength_Char 12.8 ns 12.8 ns 53820486
clang with (x == 0xff)
:
2019-05-08 10:37:56 Running ./benchmark-encode_length.2 Run on (4 X 3000 MHz CPU s) CPU Caches: L1 Data 32K (x2) L1 Instruction 32K (x2) L2 Unified 262K (x2) L3 Unified 4194K (x1) ----------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------- BM_EncodeLength_Original 115 ns 115 ns 5985310 BM_EncodeLength_Optimized 115 ns 114 ns 5641522 BM_EncodeLength_Pack 29 ns 29 ns 23824029 BM_EncodeLength_Char 26 ns 26 ns 26767312
Given that the statistics in a real-world example (above) show that len
is almost (if not always) less or equal to 254, the second set of benchmarks is more relevant. Clearly, for clang, pack makes a big difference, but not so much with gcc. The "old optimized" version failed to produce any improvements in gcc no matter what, so I discarded it altogether; the "new optimized" version, with the returns inside the if()
s make very little difference in clang but makes encode_length()
for this small numbers almost as fast as pack_uint()
with gcc. Other than always simply sending the length as a single char only (which is evidently always much faster), I'd propose only moving the returns inside the if()
s (and using static_cast
, for style only).
by , 6 years ago
Attachment: | benchmark-encode_length.3.cc added |
---|
Updated optimized version + added pack_uint and pack_char
comment:10 by , 6 years ago
Working on getting some real world data, but in the test suite we see larger sizes:
Data points: 1584397 Top 20 values with frequencies: 590387 1 230343 4 149371 3 99777 0 25724 2 17371 6 13834 18 13548 8 11340 5 8428 7 8179 40 7944 61 6418 9 5941 11 5646 96 5243 21 5082 4294967295 4801 10 4376 17 3804 16
(4294967295 is 0xffffffff so something suspect is going on there...)
127058 values (including those 5082 0xffffffff values) are > 254, which is 8%. Excluding the 0xffffffff values, the mean is 36717.040639771.
But the testsuite data is typically rather artificial.
comment:11 by , 6 years ago
Those 0xffffffff seem to be BAD_VALUENO
being passed for the sort key when there's no sorting by value, so that'll currently happen up to once per remote db per search. It's easy to adjust the protocol to avoid this (on master at least - for 1.4 it's probably best to just leave it be).
comment:12 by , 6 years ago
Status: | new → assigned |
---|
BTW, the data above came from adding logging to stderr to encode_length()
and then running ./apitest -b remoteprog_glass
, so doesn't include unit tests of encode_length()
- only use of the remote protocol by API tests (in both directions).
comment:13 by , 6 years ago
Tweaking the remote protocol to not send the sort slot when we aren't sorting by value at all gives me:
Data points: 1579310 Top 20 values with frequencies: 590387 1 230343 4 149371 3 99771 0 25724 2 17371 6 13834 18 13548 8 11340 5 8428 7 8179 40 7948 61 6418 9 5941 11 5674 90 5243 21 4801 10 4376 17 3804 16 3573 15
121954 of those are > 254, which is 7.72%. The mean is now 39436.6589466286.
I looked at some real world data (334230 documents from English wikipedia over two databases) and those have a lot more larger values.
However calculating the size difference between encode_length()
and pack_uint()
, the latter is about 21% smaller for the real world data and 42% smaller for the testsuite data. For your data it would be a little larger (since encode_length()
is always one byte there, while pack_uint()
needs two for values >= 128) - I can't calculate exactly and don't see an easy way to approximate, but given the mean, median and SD there can't be many values >= 128.
And pack_uint()
is faster in your tests, and mine confirm that - on x86-64 Linux I get:
olly@gemse:~/git/xapian$ g++ -O2 -W -Wall benchmark-encode_length.3.cc -lbenchmark -lbenchmark_main olly@gemse:~/git/xapian$ ./a.out 2019-06-07 13:23:14 Running ./a.out Run on (8 X 3900 MHz CPU s) CPU Caches: L1 Data 32K (x4) L1 Instruction 32K (x4) L2 Unified 256K (x4) L3 Unified 8192K (x1) ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ***WARNING*** Library was built as DEBUG. Timings may be affected. ----------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------- BM_EncodeLength_Original 47 ns 47 ns 14689866 BM_EncodeLength_Optimized 50 ns 50 ns 14041593 BM_EncodeLength_Pack 23 ns 23 ns 30639009 BM_EncodeLength_Char 15 ns 15 ns 45907718 olly@gemse:~/git/xapian$ g++ -O3 -W -Wall benchmark-encode_length.3.cc -lbenchmark -lbenchmark_main olly@gemse:~/git/xapian$ ./a.out 2019-06-07 13:23:29 Running ./a.out Run on (8 X 3900 MHz CPU s) CPU Caches: L1 Data 32K (x4) L1 Instruction 32K (x4) L2 Unified 256K (x4) L3 Unified 8192K (x1) ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ***WARNING*** Library was built as DEBUG. Timings may be affected. ----------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------- BM_EncodeLength_Original 46 ns 46 ns 15182687 BM_EncodeLength_Optimized 47 ns 47 ns 15025481 BM_EncodeLength_Pack 23 ns 22 ns 30794134 BM_EncodeLength_Char 9 ns 9 ns 76277125 olly@gemse:~/git/xapian$ g++ --version g++ (Debian 8.3.0-7) 8.3.0 Copyright (C) 2018 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. olly@gemse:~/git/xapian$ clang++-7 -O2 -W -Wall benchmark-encode_length.3.cc -lbenchmark -lbenchmark_main olly@gemse:~/git/xapian$ ./a.out 2019-06-07 13:24:10 Running ./a.out Run on (8 X 3900 MHz CPU s) CPU Caches: L1 Data 32K (x4) L1 Instruction 32K (x4) L2 Unified 256K (x4) L3 Unified 8192K (x1) ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ***WARNING*** Library was built as DEBUG. Timings may be affected. ----------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------- BM_EncodeLength_Original 60 ns 60 ns 11231889 BM_EncodeLength_Optimized 61 ns 61 ns 11449092 BM_EncodeLength_Pack 23 ns 23 ns 30410274 BM_EncodeLength_Char 10 ns 10 ns 70272189 olly@gemse:~/git/xapian$ clang++-7 -O3 -W -Wall benchmark-encode_length.3.cc -lbenchmark -lbenchmark_main olly@gemse:~/git/xapian$ ./a.out 2019-06-07 13:24:20 Running ./a.out Run on (8 X 3900 MHz CPU s) CPU Caches: L1 Data 32K (x4) L1 Instruction 32K (x4) L2 Unified 256K (x4) L3 Unified 8192K (x1) ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ***WARNING*** Library was built as DEBUG. Timings may be affected. ----------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------- BM_EncodeLength_Original 63 ns 63 ns 11215143 BM_EncodeLength_Optimized 64 ns 64 ns 11054774 BM_EncodeLength_Pack 23 ns 23 ns 29497909 BM_EncodeLength_Char 10 ns 10 ns 71177986
We can't switch to the even faster "Char" encoding, since that can't represent values > 255.
So I think we should switch to pack_uint
for the remote protocol. It's faster, looks like it will usually be more compact, and it's one fewer encoding to have to worry about.
comment:14 by , 6 years ago
Yes, well, now I'm wondering if I did the sampling right... I have two encode_length functions, the one inside the bundled xapian, and the on I use in my re-implementation of the remote protocol. I'll double check that, because if the other one is always < 255, we could use char there.
comment:15 by , 6 years ago
I've so far gone through and converted the encoding side to pack_uint()
- based on that I don't believe there's anything significant in Xapian's remote protocol which uses encode_length()
where the value could never be > 255.
There are places where it'll always fit in a byte for most of the current backends (the disk-based backends currently impose a term length limit a bit under 255, though that could change in future and inmemory doesn't have such a limit).
However, I noticed that for a lot of the current uses of encode_length()
are for the final entry in a string, and there we can use pack_uint_last()
instead, which just stores the bytes from the integer up to the most significant non-zero byte. Adding that to your benchmark I find it's the same speed as "Char" for GCC (-O2 or -O3) and for clang at -O3. For clang at -O2 it's in between in speed:
BM_EncodeLength_Pack 23 ns 23 ns 30996598 BM_EncodeLength_PackLast 18 ns 18 ns 40014832 BM_EncodeLength_Char 10 ns 10 ns 70941804
So we should probably use that where we can. It's also a slightly more compact encoding than pack_uint()
.
comment:16 by , 6 years ago
This is great! I can’t wait for that commit and test to see if it helps, overall.
comment:17 by , 6 years ago
Milestone: | → 1.5.0 |
---|---|
Resolution: | → fixed |
Status: | assigned → closed |
Implemented on master in [db790e9e12bb9b3ebeaf916ac0acdea9a7ab0dd1]. We can't backport this for 1.4.
Benchmarking code