wiki:GSoC2014/Posting list encoding improvements/Journal

Community Bonding Week 2: April 28-May 4

Community Bonding Week 3: May 5-May 11

May 8

I implements a function
void buildSkipList( vector<unsigned>& src, int level )
@level: desired height of the skiplist
@src: the differd source vector, e.g. if we have a map<docid,doclength> m, m[4]=3,m[5]=7,m[9]=2,m[15]=3,m[16]=4,m[18]=11,m[27]=20,m[37]=40
then src will be generated as ( 0 3 1 7 4 2 6 3 1 4 2 11 9 20 10 40 )
After calling buildSkipList( src, 2 ), the src will become
( -1 12 11 -1 2 1 0 3 1 7 -1 2 10 4 2 6 3 -1 ....)
-1 is a sign of the start of skip info.
(12 11) indicates that by offset 12 from position after 11, we can get docid increased by 11.
Then I implements
int getData( docid did, docid bias, const vector<unsigned>& src )
to search for the doclength of @did.
I also write some test functions for the two methods above.
my current plan:
First, I implements a skiplist( build, insert, delete, search ) based on vector,
then I encode this vector to chunk(string).

May 10

I get some trouble when devising delete and insert for the skiplist.
As I transform the normal format of skiplist into a vector, much information of the skiplist is lost,
therefore it may be a bit troublesome dealing with delete.
When it comes to delete, assume we just delete the desired tuple (docid,doclen) and modify every influenced offset information
and don't change the structure of the skiplist( decrease or increase the height of skiplist ).
When I try to delete a tuple (docid,doclen), there maybe some "pointer"(-1,x,x) pointing to the tuple.
To find all these pointers and make sure every offset information is accurate after deleting, much extra work is neened
And as I don't change the structure of the skiplist, after some deleting and inserting, the skiplist may become highly unbalanced
thus the performance of searching will be lowered.
This is one approach dealing with delete and insert.
Another approach is that I just rebuild the whole skiplist everytime I delete or insert a tuple.
I can't determine which one to choose.
As this skiplist is designed to store the doclen of all docs, if search operation is much more than delete/insert, I prefer the second approach.

Community Bonding Week 4: May 12-May 19

May 12

I implements merge_doclen_changes in the skiplist.
And the level of the skiplist can be determined automatically now.
Currently, the height of the skiplist is log4(n), n is the number of entries.
I plan to implement fixed width solution in next days.

May 13

Considering some details of fixed-width format before implementing it.

May 14

Implementing part of the fixed width.
If some contiguous docids are detected when encoding, ( referred is a contiguous block in the following )
the function turns to fixed width format.
First, -1 is encoded to indicate the fixed width format starts.
Then the number of entries in fixed width format, the fixed width, first docid in this contiguous docids and the minimum doclen are encoded.

May 16

I modify the format of fixed width.
-1 is encoded as an indicator.
Then first docid in this block, the number of entries in this block and the fixed width are encoded.
Docids are encoded after differed.
But doclen aren't, I encode the original doclen value.
If I want to keep a min doclen of this block and just encode the delta from it,
to find the min doclen I should determine the range of the block,
it means all doclen values in this block can be encoded within a certain width.
But as I will encode the doclen as delta, I can't get the exact width before min doclen is found.
It seems to be a little contradictory.
And in my function, the length of the block must be longer than a certain value,
otherwise it will be encoded in normal format.

Another problem,
If the first doclen in the contiguous block is much bigger than the rest,
the fixed width will be determined by this doclen value, which may lead to waste of space.

May 17

Implement search and merge_doclen_changes in a chunk of fixed-width format.
In merge_doclen_changes, I read the whole chunk and build the original postlist map,
then modify original postlist map according to map<docid,doclen> changes and re-encode the original postlist to chunk.
When the size of changes is small, updating the original chunk directly may be more efficient than rebuilding it,
but when dealing with bigger changes, I don't think it's faster than rebuilding.
After all, encoding a postlist map to chunk does take little extra calculation.

I will handle the case of deleting an entry and the case where the size of chunk becomes too bigger after updating.

May 19

I wrap the skip list and the fixed width format into DoclenChunkWriter and DoclenChunkReader.
DoclenChunkWriter will determine which format to use when the chunk is built for the first time.
If the number of docids is smaller than DOCLEN_CHUNK_MIN_SKIPLIST_LENGTH,
it will choose fixed-width format.(Notice: the fixed-width format is the same as current format when there aren't any contiguous docids.)
If bigger, it will detect the contiguous blocks in docids.
If the number of contiguous blocks is bigger than DOCLEN_CHUNK_MIN_CONTIGUOUS_BLOCKS,
( When I write the above line, I suddenly find this measurement isn't suitable, just think the case where all docids are contiguous )
then it will choose fixed-width format.
If not, it will choose skip list.
Currently, once the format of the chunk is determined when the chunk is first built, it won't change after any updating.
I don't think this way is suitable.
I will devise some tricks to change the format after some updating, if change is necessary.

Coding Week 1: May 20-May 25

May 20

I modify the principle of choosing the format.

May 21

I modify the updating of fixed-width format, making use of the fact that all docids in map are in order,
and read the to learn about some details.

May 23

I made some changs on DoclenChunk.
In previous version, the DoclenChunk class will choose skip-list or fixed-width format depending on the distribution of the docids in postlist.
But after seeing the real data from Olly, I don't think skip list is necessary, so current DoclenChunk class will use fixed-width format in any circumstances.

I repaired the problem mentioned in May 16.
I track the ratio : good_bytes/used_bytes, if the ratio is smaller than a threshold, then these docids won't be stored in a same blocks.

I modify the searching on fixed-width format chunk.
In previous, each search starts at the beginning of the chunk,
now I track cur_docid, cur_doclength, pos like
if desired_docid is smaller than cur_docid, search still starts at the beginning.
if not, search will start at the block which cur_docid is in.

May 25

I repair a fatal bug made in May 23,
and do some test based on archives.doclens.
with -O2,
Fixed Width (merge changes): 15
Variable Width (merge changes) : 10
Fixed Width ( search ): 2
Variable Width ( search ): 1566
and fixed width need little space:
Fixed Width : 202948
Variable Width : 304326

And the size is too big, I need to split it into chunks.
But I'd like to split it and make each chunk have same number of entries, rather than same size.
Because it's difficult to control the size of chunk accurately in fixed-width format.
And if two chunks have the same number of entries, their sizes won't have very big difference.

Coding Week 2: May 26-June 1

May 26

I deal with the split of chunk and add header to each chunk
which lead to a major change in the hierarchy of my class, which hasn't been completed today.

May 27

I finish the remaining part of adding header to chunk and the change of hierarchy.
I do some basic test on new features.
More strict tests will be done tomorrow.

May 29

I did many tests to ensure the accuracy of new format, as possible as I can,
and deal with some details about the key of the chunk.
I can start modifying and brass_postlist.h next days.

May 30

I complete modifying BrassPostlistTable

June 1

I complete modifying BrassPostlist

Coding Week 3: June 2-June 8

June 3

I repair some bugs when compiling new and brass_postlist.h.
Finally they are compiled successfully, but when execute "make check", there are much failure.
I plan to debug these failure directly, or write a test for new feature first.

June 4

I read documents about how to write tests,
and read source code about brass database to debug the failure when executing ./runtest ./apitest -b brass simplequery1

June 5

I repair some bugs in

June 6

I repair a bug about a mistake in forming the key for first chunk.
Now, ./apitest backend brass: 206 tests passed, 102 failed, 2 skipped. (without VALGRIND)

June 8

./apitest backend brass: 275 tests passed, 33 failed, 2 skipped.

Coding Week 4: June 9-June 15

June 9

./apitest backend brass: 288 tests passed, 20 failed, 2 skipped.

June 10

./apitest backend brass: 288 tests passed, 19 failed, 2 skipped.
I apply a mojor change of class FixedWidthChunkReader to implement next_in_chunk, which I think is unecessary to rewrite at first.

June 11

./apitest backend brass: 298 tests passed, 10 failed, 2 skipped.

June 12

./apitest backend brass: 304 tests passed, 4 failed, 2 skipped.
I repair a bug when applying compacting.
Xapian implements compacting by changing the key of chunk, not the chunk itself.
But at first I think chunk is of no relevance with its key.
It works well until I meet compacting.

June 13

Now without VALGRIND, there are only 3 failed tests.
One is cursordelbug1,
in, Line 187,
if (!unpack_uint(&pos, end, &did))
It operates the raw chunk string directly, but as I change the format, it will fail for sure.
Need I to repair all such operations to raw chunk string?

And there are some strange SIGSEGV errors when it comes to tests doclenaftercommit1 and replacedoc2.
hurricanetong@hurricanetong-VirtualBox:~/xapian-1.3/xapian-core/tests$ VALGRIND= ./runtest ./apitest -b brass replacedoc2
Running tests with backend "brass"...
Running test: replacedoc2... SIGSEGV at 0xd5d5d5d9
./apitest backend brass: 0 tests passed, 1 failed.

But when I run this test in gdb, hurricanetong@hurricanetong-VirtualBox:~/xapian-1.3/xapian-core/tests$ VALGRIND= ./runtest gdb ./apitest -b brass replacedoc2
(gdb) r
Running tests with backend "brass"...
/home/hurricanetong/xapian-1.3/xapian-core/tests/apitest backend brass: All 1 tests passed.

doclenaftercommit1 is the same.
What should I do to deal with this strange situation?

June 15

I add Assert, LOGCALL, and so on.
With valgrind,
./apitest backend brass: 287 tests passed, 21 failed, 2 skipped.

Coding Week 5: June 16-June 22

June 16

I repair a bug about memory leak
and delete a redundant information of chunk.

June 17

./apitest backend brass: All 308 tests passed, 2 skipped.
There still are some failed tests under valgrind, but they seem to exist even before I modify the files.
And although no failed tests, brass_dbcheck still needs repairing.
I modify based on xapian files I downloaded months before,
now I'm trying to apply these changes on latest version.

June 18

now it supports fixed width doclen chunk.
write a simple test for this change.

June 20

do some optimization to avoid copy when passing parameters.

June 21

write comments.

June 22

write comments and send my first PR.

Coding Week 6: June 23-June 29 (Midterm deadline June 27)

June 26

modify code style according to Xapian convention.

June 27

write more comments.

June 28

write more comments.
do a slight optimization by reducing the use of a temporary variable.

June 29

start optimizing vsencoding.
I reread the paper about vsencoding, as several months have passed, I forget some details.
I move some calculation of log2 out of the loop, by this the performance of encoding improves by 31%.
Now most time when encoding is spent on finding global optimal split.
But I think a local optimal split may also be ok, so I will try this later.
And vsencoding doesn't make use of the fact that integers in position list is in order.[[BR]]

Coding Week 7: June 30-July 6

June 30

I use local optimization in place of global optimization.
Following is a simple test based on a real position list file provided by Olly.

code encoding time size decoding time
interpolative(used by Xapian now) 709 8554887 898
my original vsencoding 5108 11734072 1267
current vsencoding 1429 17336081 736

I don't get the high performance as the paper says.
Maybe my original code is poorly optimized, or the code in Xapian is highly optimized.
Current vsencoding decrease decoding time by 18%, at the cost of twice as much as encoding time and space.

July 1

do a little optimization.
I profile the program, and find that in current vsencoding and interpolative encoding,
most time is spent on appending an char to a string buffer.
As the size of buffer in my code is twice as much as interpolative,
it makes sense that my code costs twice as much time.
But when using vsencoding, we cann't prevent some bits from being wasted, even encoding delta may decrease that.
And in interpolative encoding, it seems every bit isn't wasted.
So I find it hard to do further optimization.

July 3

As I change my computer to mac, I am building development environment these two days.
All is OK, but I met some trouble when configure and make, so I just decide to still work in Ubuntu in virtual machine before they are solved.
And I find something strange.

code encoding time size decoding time
interpolative 375 8554887 411
current vsencoding 477 17336081 230

Space is still twice as much, but encoding time is only 130%, and decoding time is 55%.
So maybe the performance depends on machines heavily.

July 4

I use pointer in place of array index to speed up.
I also change the chunk format to make it more logical.
But it seems that I introduce some bugs.

July 5

repair the bug introduced yesterday.
integrate vsencoding to latest xapian.

July 6

repair a bug introduced yesterday, now all tests are successful.
modify code style.

Coding Week 8: July 7-July 13

July 8

write comments.

July 9

optimize Unary Encoder.

July 11

write comments and optimize GammaEncoder.

July 12

check code and make a pull request.

July 13

modify comments.

Start applying skip list for position list.
My original class needs some changing, according to the experience learnt from applying fixed width format.

Coding Week 9: July 14-July 20

July 14

refine three major classes and do tests.

July 16

refine three major classes of skip list to satisfy integrating.
modify merge_changes.

July 18

modify other functions.
repair a bug in branch fixed-width format.

July 20

repair some bugs.
./apitest backend brass: 304 tests passed, 8 failed, 3 skipped.

Coding Week 10: July 21-July 27

July 21

repair a bug.
./apitest backend brass: 310 tests passed, 2 failed, 3 skipped.

July 22

repair the last bug.
This bug seems a little difficult, I haven't find the accurate cause of this bug.

July 23

work the last bug out.
./apitest backend brass: 311 tests passed, 1 failed, 3 skipped.

July 25

modify code style and add log info.

July 26

write comments and refine a macro.

July 27

finish writing comments.

Coding Week 11: July 28-August 3

July 30

modify code in concerning about skip-list format.
modify a test to test the behavior of new and the case when a postlist have more entries than MAX_ENTRIES_IN_SKIPLIST_CHUNK.

August 1

check code and make PR.

August 3

update db_check about vsencoding and modify a corresponding test.

Coding Week 12: August 4-August 10

Coding Week 13: August 11-August 18 (Final evaluation based on work up to August 18)

Last modified 8 years ago Last modified on 03/08/14 02:44:41
Note: See TracWiki for help on using the wiki.