Opened 5 years ago

Closed 5 years ago

#782 closed enhancement (fixed)

Improve encode_length() performance

Reported by: German M. Bravo Owned by: Olly Betts
Priority: normal Milestone: 1.5.0
Component: Backend-Remote Version:
Severity: minor Keywords:
Cc: Blocked By:
Blocking: Operating System: All

Description (last modified by German M. Bravo)

One major bottleneck during some profiling I've been doing when using the remote protocol is in Document::serialise(). Remote protocol uses serialization a lot, and serializers use encode_length() pretty much everywhere... however current implementation of encode_length could be improved to be more performant.

I've re-implemented the function and I've opened pull request at https://github.com/xapian/xapian/pull/239.

Benchmarking results are as follow:

2019-04-25 11:06:05
Running ./benchmark-encode_length
Run on (4 X 3000 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 4194K (x1)
-----------------------------------------------------------------
Benchmark                          Time           CPU Iterations
-----------------------------------------------------------------
BM_EncodeLength_Original        1684 ns       1681 ns     414721
BM_EncodeLength_Optimized        481 ns        480 ns    1421164

Attached is the benchmarking code I used.

Edit: I've updated the patch to squeeze out a few more nanoseconds.

Attachments (3)

benchmark-encode_length.cc (1.6 KB ) - added by German M. Bravo 5 years ago.
Benchmarking code
benchmark-encode_length.2.cc (1.6 KB ) - added by Olly Betts 5 years ago.
updated benchmark on narrower range of values
benchmark-encode_length.3.cc (2.4 KB ) - added by German M. Bravo 5 years ago.
Updated optimized version + added pack_uint and pack_char

Download all attachments as: .zip

Change History (20)

by German M. Bravo, 5 years ago

Attachment: benchmark-encode_length.cc added

Benchmarking code

comment:1 by German M. Bravo, 5 years ago

Description: modified (diff)

comment:2 by Olly Betts, 5 years ago

Curiously I don't see nearly as much of a speed-up on x86 Linux:

2019-05-05 18:04:29
Running ./benchmark-encode_length
Run on (8 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 6144K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
-----------------------------------------------------------------
Benchmark                          Time           CPU Iterations
-----------------------------------------------------------------
BM_EncodeLength_Original         957 ns        957 ns     725154
BM_EncodeLength_Optimized        752 ns        752 ns     914252

I had to comment out the BENCHMARK_MAIN(); as otherwise linking fails with multiple definitions of main, but otherwise that's the same code you show and the same compilation command.

comment:3 by Olly Betts, 5 years ago

I think your microbenchmark may be flawed.

It doesn't use the returned value at all, which is unrealistic and potentially allows the compiler to optimise away parts of the encoding routine. If it can optimise away more of your optimised version than the current version yours will appear faster, but we'd never actually call it in a void context like that so that's not a speed-up we'll actually ever see in practice.

I changed both loops to this, which is much more like the typical usage pattern:

		std::string s;
		for (auto i = x; i != 0; i >>= 1) {
			s += encode_length_original(i);
		}

But then both versions seem to take the same time within noise.

I'll re-test on x86-64 when I get a chance.

comment:4 by German M. Bravo, 5 years ago

These are my timings after adding the std::string s and appending to it:

Run on (4 X 3000 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 4194K (x1)
-----------------------------------------------------------------
Benchmark                          Time           CPU Iterations
-----------------------------------------------------------------
BM_EncodeLength_Original        2578 ns       2573 ns     244950
BM_EncodeLength_Optimized       1510 ns       1507 ns     591061

comment:5 by German M. Bravo, 5 years ago

Maybe not as dramatic, but still faster. And "optimized" code is clean, imvho, nothing eccentric or esoteric; and generated instructions (at least for x86_64) promote processor pipelining and easy branch prediction.

comment:6 by Olly Betts, 5 years ago

I realised I had stupidly cut-and-pasted the loop above verbatim into both versions. With the appropriate function called in each case I'm seeing ~1400ns vs ~1200ns for the s += version. Not sure why fixing one version changes the timings for the unchanged one substantially though. Maybe I should heed those ***WARNING*** messages.

Maybe not as dramatic, but still faster. And "optimized" code is clean, imvho, nothing eccentric or esoteric; and generated instructions (at least for x86_64) promote processor pipelining and easy branch prediction.

I'm not arguing against optimising this - just trying to make sure we're optimising based on valid data.

comment:7 by Olly Betts, 5 years ago

I've done some more tests, and also now tested on x86-64 Linux.

The tl;dr version is that it looks to me like your code is probably better with clang, but probably slower with GCC. And that's clearly problematic.

I made some changes to the benchmark code. I don't have actual data, but I think the full range of unsigned long long is not a realistic test. Most encoded values will be term lengths or wdf values, both of which I'd expect to mostly fit in one byte. So I changed the start point to be 0xfff.

I also changed the benchmark loop to what the libbenchmark docs recommend unless you need C++98 compatibility (and they especially recommend for benchmarked code that doesn't take very long to execute, like ours).

With that here's what I get on Linux:

$ clang++-7 -std=c++1z -pedantic -Wall -Wextra -O3 -lbenchmark -lbenchmark_main -o benchmark-encode_length ./benchmark-encode_length.cc && ./benchmark-encode_length 
2019-05-08 18:50:50
Running ./benchmark-encode_length
Run on (8 X 3900 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
-----------------------------------------------------------------
Benchmark                          Time           CPU Iterations
-----------------------------------------------------------------
BM_EncodeLength_Original         155 ns        155 ns    4506744
BM_EncodeLength_Optimized        113 ns        113 ns    6190602
$ g++ -std=c++1z -pedantic -Wall -Wextra -O3 -lbenchmark -lbenchmark_main -o benchmark-encode_length ./benchmark-encode_length.cc && ./benchmark-encode_length 
2019-05-08 18:50:59
Running ./benchmark-encode_length
Run on (8 X 3900 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
-----------------------------------------------------------------
Benchmark                          Time           CPU Iterations
-----------------------------------------------------------------
BM_EncodeLength_Original          88 ns         88 ns    7929005
BM_EncodeLength_Optimized        117 ns        117 ns    5926981

So your optimised version improves with clang but regresses with GCC. Interestingly your version seems to have pretty much the same performance with both compilers, but the current code works much better with GCC than clang.

BTW, this clang using libstdc++ - I tried with -stdlib=libc++ but the built program segfaults.

I see essentially the same picture on x86 Linux.

So where do we go from here?

I think gathering some representative data on the values which get encoded would be useful, so we can actually make sure we're benchmarking something realistic. In particular if the single byte case dominates that makes a difference.

Having some real data could also inform a decision about whether the current encoding is actually a good choice. Using the simpler format which pack_uint() produces is an option for example.

Do you have a real world xapiand deployment you could log such data from? If not, I can see if I can probably find somewhere to get suitable data from. Or perhaps we both should to get some idea of how much variability there is.

Last edited 5 years ago by Olly Betts (previous) (diff)

by Olly Betts, 5 years ago

updated benchmark on narrower range of values

comment:8 by German M. Bravo, 5 years ago

That is weird. I’ll definitely dig further here. I’ll also get that real world sampling as well.

comment:9 by German M. Bravo, 5 years ago

Yes, apparently gcc makes a better job at producing faster code here. I tweaked a new "optimized" which really only moves the returns inside the if()s (for some reason it gives a little bit of gain):

template<class T>
std::string
encode_length_optimized(T len)
{
        std::string result;
        if (len < 255) {
                result += static_cast<unsigned char>(len);
                return result;
        } else {
                result += '\xff';
                len -= 255;
                while (true) {
                        unsigned char b = static_cast<unsigned char>(len & 0x7f);
                        len >>= 7;
                        if (!len) {
                                result += b | static_cast<unsigned char>(0x80);
                                break;
                        }
                        result += b;
                }
                return result;
        }
}

gcc with (x == 0xfff):

Running ./benchmark-encode_length.2
Run on (4 X 3000 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 4194K (x1)
Load Average: 1.70, 2.76, 3.47
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_EncodeLength_Original         132 ns          132 ns      5234075
BM_EncodeLength_Optimized        131 ns          131 ns      5338377
BM_EncodeLength_Pack             108 ns          108 ns      6482442

clang with (x == 0xfff):

Running ./benchmark-encode_length.2
Run on (4 X 3000 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 4194K (x1)
-----------------------------------------------------------------
Benchmark                          Time           CPU Iterations
-----------------------------------------------------------------
BM_EncodeLength_Original         220 ns        220 ns    3019206
BM_EncodeLength_Optimized        219 ns        219 ns    3132061
BM_EncodeLength_Pack              55 ns         55 ns   12631731

On the other hand, this is what I'm seeing encode_length() being used for:

  len(sample) = 27,712,252
  min(sample) = 0
  max(sample) = 254
  statistics.median(sample) = 3.0
  statistics.mean(sample) = 20.126449596260276
  statistics.stdev(sample) = 42.358357658089815
  statistics.variance(sample) = 1794.2304634906563

So benchmarks with x == 0xff follow.

gcc with (x == 0xff):

Run on (4 X 3000 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 4194K (x1)
Load Average: 1.24, 2.00, 2.18
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_EncodeLength_Original        28.0 ns         28.0 ns     23851060
BM_EncodeLength_Optimized       20.5 ns         20.5 ns     33096144
BM_EncodeLength_Pack            19.0 ns         19.0 ns     36725357
BM_EncodeLength_Char            12.8 ns         12.8 ns     53820486

clang with (x == 0xff):

2019-05-08 10:37:56
Running ./benchmark-encode_length.2
Run on (4 X 3000 MHz CPU s)
CPU Caches:
  L1 Data 32K (x2)
  L1 Instruction 32K (x2)
  L2 Unified 262K (x2)
  L3 Unified 4194K (x1)
-----------------------------------------------------------------
Benchmark                          Time           CPU Iterations
-----------------------------------------------------------------
BM_EncodeLength_Original         115 ns        115 ns    5985310
BM_EncodeLength_Optimized        115 ns        114 ns    5641522
BM_EncodeLength_Pack              29 ns         29 ns   23824029
BM_EncodeLength_Char              26 ns         26 ns   26767312

Given that the statistics in a real-world example (above) show that len is almost (if not always) less or equal to 254, the second set of benchmarks is more relevant. Clearly, for clang, pack makes a big difference, but not so much with gcc. The "old optimized" version failed to produce any improvements in gcc no matter what, so I discarded it altogether; the "new optimized" version, with the returns inside the if()s make very little difference in clang but makes encode_length() for this small numbers almost as fast as pack_uint() with gcc. Other than always simply sending the length as a single char only (which is evidently always much faster), I'd propose only moving the returns inside the if()s (and using static_cast, for style only).

by German M. Bravo, 5 years ago

Updated optimized version + added pack_uint and pack_char

comment:10 by Olly Betts, 5 years ago

Working on getting some real world data, but in the test suite we see larger sizes:

Data points: 1584397
Top 20 values with frequencies:
 590387 1
 230343 4
 149371 3
  99777 0
  25724 2
  17371 6
  13834 18
  13548 8
  11340 5
   8428 7
   8179 40
   7944 61
   6418 9
   5941 11
   5646 96
   5243 21
   5082 4294967295
   4801 10
   4376 17
   3804 16

(4294967295 is 0xffffffff so something suspect is going on there...)

127058 values (including those 5082 0xffffffff values) are > 254, which is 8%. Excluding the 0xffffffff values, the mean is 36717.040639771.

But the testsuite data is typically rather artificial.

comment:11 by Olly Betts, 5 years ago

Those 0xffffffff seem to be BAD_VALUENO being passed for the sort key when there's no sorting by value, so that'll currently happen up to once per remote db per search. It's easy to adjust the protocol to avoid this (on master at least - for 1.4 it's probably best to just leave it be).

comment:12 by Olly Betts, 5 years ago

Status: newassigned

BTW, the data above came from adding logging to stderr to encode_length() and then running ./apitest -b remoteprog_glass, so doesn't include unit tests of encode_length() - only use of the remote protocol by API tests (in both directions).

comment:13 by Olly Betts, 5 years ago

Tweaking the remote protocol to not send the sort slot when we aren't sorting by value at all gives me:

Data points: 1579310
Top 20 values with frequencies:
 590387 1
 230343 4
 149371 3
  99771 0
  25724 2
  17371 6
  13834 18
  13548 8
  11340 5
   8428 7
   8179 40
   7948 61
   6418 9
   5941 11
   5674 90
   5243 21
   4801 10
   4376 17
   3804 16
   3573 15

121954 of those are > 254, which is 7.72%. The mean is now 39436.6589466286.

I looked at some real world data (334230 documents from English wikipedia over two databases) and those have a lot more larger values.

However calculating the size difference between encode_length() and pack_uint(), the latter is about 21% smaller for the real world data and 42% smaller for the testsuite data. For your data it would be a little larger (since encode_length() is always one byte there, while pack_uint() needs two for values >= 128) - I can't calculate exactly and don't see an easy way to approximate, but given the mean, median and SD there can't be many values >= 128.

And pack_uint() is faster in your tests, and mine confirm that - on x86-64 Linux I get:

olly@gemse:~/git/xapian$ g++ -O2 -W -Wall benchmark-encode_length.3.cc -lbenchmark -lbenchmark_main 
olly@gemse:~/git/xapian$ ./a.out 
2019-06-07 13:23:14
Running ./a.out
Run on (8 X 3900 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
-----------------------------------------------------------------
Benchmark                          Time           CPU Iterations
-----------------------------------------------------------------
BM_EncodeLength_Original          47 ns         47 ns   14689866
BM_EncodeLength_Optimized         50 ns         50 ns   14041593
BM_EncodeLength_Pack              23 ns         23 ns   30639009
BM_EncodeLength_Char              15 ns         15 ns   45907718
olly@gemse:~/git/xapian$ g++ -O3 -W -Wall benchmark-encode_length.3.cc -lbenchmark -lbenchmark_main 
olly@gemse:~/git/xapian$ ./a.out 
2019-06-07 13:23:29
Running ./a.out
Run on (8 X 3900 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
-----------------------------------------------------------------
Benchmark                          Time           CPU Iterations
-----------------------------------------------------------------
BM_EncodeLength_Original          46 ns         46 ns   15182687
BM_EncodeLength_Optimized         47 ns         47 ns   15025481
BM_EncodeLength_Pack              23 ns         22 ns   30794134
BM_EncodeLength_Char               9 ns          9 ns   76277125
olly@gemse:~/git/xapian$ g++ --version
g++ (Debian 8.3.0-7) 8.3.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

olly@gemse:~/git/xapian$ clang++-7 -O2 -W -Wall benchmark-encode_length.3.cc -lbenchmark -lbenchmark_main 
olly@gemse:~/git/xapian$ ./a.out 
2019-06-07 13:24:10
Running ./a.out
Run on (8 X 3900 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
-----------------------------------------------------------------
Benchmark                          Time           CPU Iterations
-----------------------------------------------------------------
BM_EncodeLength_Original          60 ns         60 ns   11231889
BM_EncodeLength_Optimized         61 ns         61 ns   11449092
BM_EncodeLength_Pack              23 ns         23 ns   30410274
BM_EncodeLength_Char              10 ns         10 ns   70272189
olly@gemse:~/git/xapian$ clang++-7 -O3 -W -Wall benchmark-encode_length.3.cc -lbenchmark -lbenchmark_main 
olly@gemse:~/git/xapian$ ./a.out 
2019-06-07 13:24:20
Running ./a.out
Run on (8 X 3900 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
-----------------------------------------------------------------
Benchmark                          Time           CPU Iterations
-----------------------------------------------------------------
BM_EncodeLength_Original          63 ns         63 ns   11215143
BM_EncodeLength_Optimized         64 ns         64 ns   11054774
BM_EncodeLength_Pack              23 ns         23 ns   29497909
BM_EncodeLength_Char              10 ns         10 ns   71177986

We can't switch to the even faster "Char" encoding, since that can't represent values > 255.

So I think we should switch to pack_uint for the remote protocol. It's faster, looks like it will usually be more compact, and it's one fewer encoding to have to worry about.

comment:14 by German M. Bravo, 5 years ago

Yes, well, now I'm wondering if I did the sampling right... I have two encode_length functions, the one inside the bundled xapian, and the on I use in my re-implementation of the remote protocol. I'll double check that, because if the other one is always < 255, we could use char there.

comment:15 by Olly Betts, 5 years ago

I've so far gone through and converted the encoding side to pack_uint() - based on that I don't believe there's anything significant in Xapian's remote protocol which uses encode_length() where the value could never be > 255.

There are places where it'll always fit in a byte for most of the current backends (the disk-based backends currently impose a term length limit a bit under 255, though that could change in future and inmemory doesn't have such a limit).

However, I noticed that for a lot of the current uses of encode_length() are for the final entry in a string, and there we can use pack_uint_last() instead, which just stores the bytes from the integer up to the most significant non-zero byte. Adding that to your benchmark I find it's the same speed as "Char" for GCC (-O2 or -O3) and for clang at -O3. For clang at -O2 it's in between in speed:

BM_EncodeLength_Pack              23 ns         23 ns   30996598
BM_EncodeLength_PackLast          18 ns         18 ns   40014832
BM_EncodeLength_Char              10 ns         10 ns   70941804

So we should probably use that where we can. It's also a slightly more compact encoding than pack_uint().

comment:16 by German M. Bravo, 5 years ago

This is great! I can’t wait for that commit and test to see if it helps, overall.

comment:17 by Olly Betts, 5 years ago

Milestone: 1.5.0
Resolution: fixed
Status: assignedclosed

Implemented on master in [db790e9e12bb9b3ebeaf916ac0acdea9a7ab0dd1]. We can't backport this for 1.4.

Note: See TracTickets for help on using tickets.