GSoC2014/Posting list encoding improvements/Journal – Xapian

wiki:GSoC2014/Posting list encoding improvements/Journal

Context Navigation

Community Bonding Week 2: April 28-May 4

Community Bonding Week 3: May 5-May 11

May 8

I implements a function
void buildSkipList( vector<unsigned>& src, int level )
@level: desired height of the skiplist
@src: the differd source vector, e.g. if we have a map<docid,doclength> m, m[4]=3,m[5]=7,m[9]=2,m[15]=3,m[16]=4,m[18]=11,m[27]=20,m[37]=40
then src will be generated as ( 0 3 1 7 4 2 6 3 1 4 2 11 9 20 10 40 )
After calling buildSkipList( src, 2 ), the src will become
( -1 12 11 -1 2 1 0 3 1 7 -1 2 10 4 2 6 3 -1 ....)
-1 is a sign of the start of skip info.
(12 11) indicates that by offset 12 from position after 11, we can get docid increased by 11.
Then I implements
int getData( docid did, docid bias, const vector<unsigned>& src )
to search for the doclength of @did.
I also write some test functions for the two methods above.
my current plan:
First, I implements a skiplist( build, insert, delete, search ) based on vector,
then I encode this vector to chunk(string).

May 10

I get some trouble when devising delete and insert for the skiplist.
As I transform the normal format of skiplist into a vector, much information of the skiplist is lost,
therefore it may be a bit troublesome dealing with delete.
When it comes to delete, assume we just delete the desired tuple (docid,doclen) and modify every influenced offset information
and don't change the structure of the skiplist( decrease or increase the height of skiplist ).
When I try to delete a tuple (docid,doclen), there maybe some "pointer"(-1,x,x) pointing to the tuple.
To find all these pointers and make sure every offset information is accurate after deleting, much extra work is neened
And as I don't change the structure of the skiplist, after some deleting and inserting, the skiplist may become highly unbalanced
thus the performance of searching will be lowered.
This is one approach dealing with delete and insert.
Another approach is that I just rebuild the whole skiplist everytime I delete or insert a tuple.
I can't determine which one to choose.
As this skiplist is designed to store the doclen of all docs, if search operation is much more than delete/insert, I prefer the second approach.

Community Bonding Week 4: May 12-May 19

May 12

I implements merge_doclen_changes in the skiplist.
And the level of the skiplist can be determined automatically now.
Currently, the height of the skiplist is log4(n), n is the number of entries.
I plan to implement fixed width solution in next days.

May 13

Considering some details of fixed-width format before implementing it.

May 14

Implementing part of the fixed width.
If some contiguous docids are detected when encoding, ( referred is a contiguous block in the following )
the function turns to fixed width format.
First, -1 is encoded to indicate the fixed width format starts.
Then the number of entries in fixed width format, the fixed width, first docid in this contiguous docids and the minimum doclen are encoded.

May 16

I modify the format of fixed width.
-1 is encoded as an indicator.
Then first docid in this block, the number of entries in this block and the fixed width are encoded.
Docids are encoded after differed.
But doclen aren't, I encode the original doclen value.
If I want to keep a min doclen of this block and just encode the delta from it,
to find the min doclen I should determine the range of the block,
it means all doclen values in this block can be encoded within a certain width.
But as I will encode the doclen as delta, I can't get the exact width before min doclen is found.
It seems to be a little contradictory.
And in my function, the length of the block must be longer than a certain value,
otherwise it will be encoded in normal format.

Another problem,
If the first doclen in the contiguous block is much bigger than the rest,
the fixed width will be determined by this doclen value, which may lead to waste of space.

May 17

Implement search and merge_doclen_changes in a chunk of fixed-width format.
In merge_doclen_changes, I read the whole chunk and build the original postlist map,
then modify original postlist map according to map<docid,doclen> changes and re-encode the original postlist to chunk.
When the size of changes is small, updating the original chunk directly may be more efficient than rebuilding it,
but when dealing with bigger changes, I don't think it's faster than rebuilding.
After all, encoding a postlist map to chunk does take little extra calculation.

I will handle the case of deleting an entry and the case where the size of chunk becomes too bigger after updating.

May 19

I wrap the skip list and the fixed width format into DoclenChunkWriter and DoclenChunkReader.
DoclenChunkWriter will determine which format to use when the chunk is built for the first time.
If the number of docids is smaller than DOCLEN_CHUNK_MIN_SKIPLIST_LENGTH,
it will choose fixed-width format.(Notice: the fixed-width format is the same as current format when there aren't any contiguous docids.)
If bigger, it will detect the contiguous blocks in docids.
If the number of contiguous blocks is bigger than DOCLEN_CHUNK_MIN_CONTIGUOUS_BLOCKS,
( When I write the above line, I suddenly find this measurement isn't suitable, just think the case where all docids are contiguous )
then it will choose fixed-width format.
If not, it will choose skip list.
Currently, once the format of the chunk is determined when the chunk is first built, it won't change after any updating.
I don't think this way is suitable.
I will devise some tricks to change the format after some updating, if change is necessary.

Coding Week 1: May 20-May 25

May 20

I modify the principle of choosing the format.

May 21

I modify the updating of fixed-width format, making use of the fact that all docids in map are in order,
and read the brass_postlist.cc to learn about some details.

May 23

I made some changs on DoclenChunk.
In previous version, the DoclenChunk class will choose skip-list or fixed-width format depending on the distribution of the docids in postlist.
But after seeing the real data from Olly, I don't think skip list is necessary, so current DoclenChunk class will use fixed-width format in any circumstances.

I repaired the problem mentioned in May 16.
I track the ratio : good_bytes/used_bytes, if the ratio is smaller than a threshold, then these docids won't be stored in a same blocks.

I modify the searching on fixed-width format chunk.
In previous, each search starts at the beginning of the chunk,
now I track cur_docid, cur_doclength, pos like brass_postlist.cc.
if desired_docid is smaller than cur_docid, search still starts at the beginning.
if not, search will start at the block which cur_docid is in.

May 25

I repair a fatal bug made in May 23,
and do some test based on archives.doclens.
with -O2,
Fixed Width (merge changes): 15
Variable Width (merge changes) : 10
Fixed Width ( search ): 2
Variable Width ( search ): 1566
and fixed width need little space:
Fixed Width : 202948
Variable Width : 304326

And the size is too big, I need to split it into chunks.
But I'd like to split it and make each chunk have same number of entries, rather than same size.
Because it's difficult to control the size of chunk accurately in fixed-width format.
And if two chunks have the same number of entries, their sizes won't have very big difference.

Coding Week 2: May 26-June 1

May 26

I deal with the split of chunk and add header to each chunk
which lead to a major change in the hierarchy of my class, which hasn't been completed today.

May 27

I finish the remaining part of adding header to chunk and the change of hierarchy.
I do some basic test on new features.
More strict tests will be done tomorrow.

May 29

I did many tests to ensure the accuracy of new format, as possible as I can,
and deal with some details about the key of the chunk.
I can start modifying brass_postlist.cc and brass_postlist.h next days.

May 30

I complete modifying BrassPostlistTable

June 1

I complete modifying BrassPostlist

Coding Week 3: June 2-June 8

June 3

I repair some bugs when compiling new brass_postlist.cc and brass_postlist.h.
Finally they are compiled successfully, but when execute "make check", there are much failure.
I plan to debug these failure directly, or write a test for new feature first.

June 4

I read documents about how to write tests,
and read source code about brass database to debug the failure when executing ./runtest ./apitest -b brass simplequery1

June 5

I repair some bugs in brass_postlist.cc

June 6

I repair a bug about a mistake in forming the key for first chunk.
Now, ./apitest backend brass: 206 tests passed, 102 failed, 2 skipped. (without VALGRIND)

June 8

./apitest backend brass: 275 tests passed, 33 failed, 2 skipped.

Coding Week 4: June 9-June 15

June 9

./apitest backend brass: 288 tests passed, 20 failed, 2 skipped.

June 10

./apitest backend brass: 288 tests passed, 19 failed, 2 skipped.
I apply a mojor change of class FixedWidthChunkReader to implement next_in_chunk, which I think is unecessary to rewrite at first.

June 11

./apitest backend brass: 298 tests passed, 10 failed, 2 skipped.

June 12

./apitest backend brass: 304 tests passed, 4 failed, 2 skipped.
I repair a bug when applying compacting.
Xapian implements compacting by changing the key of chunk, not the chunk itself.
But at first I think chunk is of no relevance with its key.
It works well until I meet compacting.

June 13

Now without VALGRIND, there are only 3 failed tests.
One is cursordelbug1,
in brass_dbcheck.cc, Line 187,
if (!unpack_uint(&pos, end, &did))
It operates the raw chunk string directly, but as I change the format, it will fail for sure.
Need I to repair all such operations to raw chunk string?

And there are some strange SIGSEGV errors when it comes to tests doclenaftercommit1 and replacedoc2.
hurricanetong@hurricanetong-VirtualBox:~/xapian-1.3/xapian-core/tests$ VALGRIND= ./runtest ./apitest -b brass replacedoc2
Running tests with backend "brass"...
Running test: replacedoc2... SIGSEGV at 0xd5d5d5d9
./apitest backend brass: 0 tests passed, 1 failed.

But when I run this test in gdb, hurricanetong@hurricanetong-VirtualBox:~/xapian-1.3/xapian-core/tests$ VALGRIND= ./runtest gdb ./apitest -b brass replacedoc2
(gdb) r
Running tests with backend "brass"...
/home/hurricanetong/xapian-1.3/xapian-core/tests/apitest backend brass: All 1 tests passed.

doclenaftercommit1 is the same.
What should I do to deal with this strange situation?

June 15

I add Assert, LOGCALL, and so on.
With valgrind,
./apitest backend brass: 287 tests passed, 21 failed, 2 skipped.

Coding Week 5: June 16-June 22

June 16

I repair a bug about memory leak
and delete a redundant information of chunk.

June 17

./apitest backend brass: All 308 tests passed, 2 skipped.
There still are some failed tests under valgrind, but they seem to exist even before I modify the files.
And although no failed tests, brass_dbcheck still needs repairing.
I modify brass_postlist.cc based on xapian files I downloaded months before,
now I'm trying to apply these changes on latest version.

June 18

modify brass_dbcheck.cc,
now it supports fixed width doclen chunk.
write a simple test for this change.

June 20

do some optimization to avoid copy when passing parameters.

June 21

write comments.

June 22

write comments and send my first PR.

Coding Week 6: June 23-June 29 (Midterm deadline June 27)

June 26

modify code style according to Xapian convention.

June 27

write more comments.

June 28

write more comments.
do a slight optimization by reducing the use of a temporary variable.

June 29

start optimizing vsencoding.
I reread the paper about vsencoding, as several months have passed, I forget some details.
I move some calculation of log2 out of the loop, by this the performance of encoding improves by 31%.
Now most time when encoding is spent on finding global optimal split.
But I think a local optimal split may also be ok, so I will try this later.
And vsencoding doesn't make use of the fact that integers in position list is in order.
https://github.com/HurricaneTong/Xapian/commit/dcb0e271334d38280ada2550a15748bc27d4e024[[BR]]

Coding Week 7: June 30-July 6

June 30

I use local optimization in place of global optimization.
Following is a simple test based on a real position list file provided by Olly.

code	encoding time	size	decoding time
interpolative(used by Xapian now)	709	8554887	898
my original vsencoding	5108	11734072	1267
current vsencoding	1429	17336081	736

I don't get the high performance as the paper says.
Maybe my original code is poorly optimized, or the code in Xapian is highly optimized.
Current vsencoding decrease decoding time by 18%, at the cost of twice as much as encoding time and space.
https://github.com/HurricaneTong/Xapian/commit/0ce340cdaa2dc1d504ef3c77017d392b648533db

July 1

do a little optimization.
I profile the program, and find that in current vsencoding and interpolative encoding,
most time is spent on appending an char to a string buffer.
As the size of buffer in my code is twice as much as interpolative,
it makes sense that my code costs twice as much time.
But when using vsencoding, we cann't prevent some bits from being wasted, even encoding delta may decrease that.
And in interpolative encoding, it seems every bit isn't wasted.
So I find it hard to do further optimization.
https://github.com/HurricaneTong/Xapian/commit/ec4fd90ae6ed4c7ab4217f8ca903db44798d6571

July 3

As I change my computer to mac, I am building development environment these two days.
All is OK, but I met some trouble when configure and make, so I just decide to still work in Ubuntu in virtual machine before they are solved.
And I find something strange.

code	encoding time	size	decoding time
interpolative	375	8554887	411
current vsencoding	477	17336081	230

Space is still twice as much, but encoding time is only 130%, and decoding time is 55%.
So maybe the performance depends on machines heavily.

July 4

I use pointer in place of array index to speed up.
I also change the chunk format to make it more logical.
But it seems that I introduce some bugs.
https://github.com/HurricaneTong/Xapian/commit/c345f69e5da9a7e5760ff9bffc2a9549469561d5

July 5

repair the bug introduced yesterday.
https://github.com/HurricaneTong/Xapian/commit/840d2943ba1d6b8d48eefa0b9612a6d51248eb15
integrate vsencoding to latest xapian.
https://github.com/HurricaneTong/xapian-1/commit/f1b8d69d1aa1b6c2ec4df28d0fbec761f1773e53

July 6

repair a bug introduced yesterday, now all tests are successful.
https://github.com/HurricaneTong/xapian-1/commit/25a5791d1da83c10cda8fc6fca9cf51223bf6345
modify code style.
https://github.com/HurricaneTong/xapian-1/commit/1050e5b739220d303ae85d12a7611fb5528b0c59

Coding Week 8: July 7-July 13

July 8

write comments.
https://github.com/HurricaneTong/xapian-1/commit/accebef5f75a40bef272c1e1f1d8b7ccdcf5d5f0

July 9

optimize Unary Encoder.
https://github.com/HurricaneTong/xapian-1/commit/0c23a1faa31a49788427b73b9d6f7af0182c13ad

July 11

write comments and optimize GammaEncoder.

July 12

check code and make a pull request.

July 13

modify comments.
https://github.com/HurricaneTong/xapian-1/commit/15428cca8a927896cd8d5a3c05d385e198c6f8cf

Start applying skip list for position list.
My original class needs some changing, according to the experience learnt from applying fixed width format.
https://github.com/HurricaneTong/Xapian/commit/85634de2e3f7064de17cfa1affbfae751b2a5028

Coding Week 9: July 14-July 20

Coding Week 10: July 21-July 27

July 21

repair a bug.
https://github.com/HurricaneTong/xapian-1/commit/bba7d5bf23a10426a15d4be0b37a42766020b873
./apitest backend brass: 310 tests passed, 2 failed, 3 skipped.

July 22

repair the last bug.
This bug seems a little difficult, I haven't find the accurate cause of this bug.

July 23

work the last bug out.
./apitest backend brass: 311 tests passed, 1 failed, 3 skipped.
https://github.com/HurricaneTong/xapian-1/commit/d5e3c7042ad216c32552f984458c03c8ad159c8d
https://github.com/HurricaneTong/xapian-1/commit/e4d146a1eeeefdd3a74e49438bb0572c60e790f6

July 25

modify code style and add log info.
https://github.com/HurricaneTong/xapian-1/commit/5c69c6d1f8da22a730b2c13f9213869c65f9f5ad

July 26

write comments and refine a macro.
https://github.com/HurricaneTong/xapian-1/commit/6457d03217a5b41bdeb560cf2459af5dd5a354f6

July 27

finish writing comments.
https://github.com/HurricaneTong/xapian-1/commit/fdd49dd38c3ca1d243b21b278993997576939b70

Coding Week 11: July 28-August 3

July 30

modify code in brass_dbcheck.cc concerning about skip-list format.
https://github.com/HurricaneTong/xapian-1/commit/0ac583c2ea06928e83ed361faf47ab16bf54cd59
modify a test to test the behavior of new brass_dbcheck.cc and the case when a postlist have more entries than MAX_ENTRIES_IN_SKIPLIST_CHUNK.
https://github.com/HurricaneTong/xapian-1/commit/30cd0e9e279bbb135f5e6447de53cb5e51cc992f

August 1

check code and make PR.

August 3

update db_check about vsencoding and modify a corresponding test.
https://github.com/HurricaneTong/xapian-1/commit/0d571bc5d84600b0f09656d842eeedb543b05d22

Coding Week 12: August 4-August 10

Coding Week 13: August 11-August 18 (Final evaluation based on work up to August 18)

Last modified 11 years ago Last modified on 03/08/14 02:44:41

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text