Table of Contents
- Community Bonding Week 2: April 28-May 4
- Community Bonding Week 3: May 5-May 11
- Community Bonding Week 4: May 12-May 19
- Coding Week 1: May 20-May 25
- Coding Week 2: May 26-June 1
- Coding Week 3: June 2-June 8
- Coding Week 4: June 9-June 15
- Coding Week 5: June 16-June 22
- Coding Week 6: June 23-June 29 (Midterm deadline June 27)
- Coding Week 7: June 30-July 6
- Coding Week 8: July 7-July 13
- Coding Week 9: July 14-July 20
- Coding Week 10: July 21-July 27
- Coding Week 11: July 28-August 3
- Coding Week 12: August 4-August 10
- Coding Week 13: August 11-August 18 (Final evaluation based on work up …
Community Bonding Week 2: April 28-May 4
Community Bonding Week 3: May 5-May 11
I implements a function
void buildSkipList( vector<unsigned>& src, int level )
@level: desired height of the skiplist
@src: the differd source vector, e.g. if we have a map<docid,doclength> m, m=3,m=7,m=2,m=3,m=4,m=11,m=20,m=40
then src will be generated as ( 0 3 1 7 4 2 6 3 1 4 2 11 9 20 10 40 )
After calling buildSkipList( src, 2 ), the src will become
( -1 12 11 -1 2 1 0 3 1 7 -1 2 10 4 2 6 3 -1 ....)
-1 is a sign of the start of skip info.
(12 11) indicates that by offset 12 from position after 11, we can get docid increased by 11.
Then I implements
int getData( docid did, docid bias, const vector<unsigned>& src )
to search for the doclength of @did.
I also write some test functions for the two methods above.
my current plan:
First, I implements a skiplist( build, insert, delete, search ) based on vector,
then I encode this vector to chunk(string).
I get some trouble when devising delete and insert for the skiplist.
As I transform the normal format of skiplist into a vector, much information of the skiplist is lost,
therefore it may be a bit troublesome dealing with delete.
When it comes to delete, assume we just delete the desired tuple (docid,doclen) and modify every influenced offset information
and don't change the structure of the skiplist( decrease or increase the height of skiplist ).
When I try to delete a tuple (docid,doclen), there maybe some "pointer"(-1,x,x) pointing to the tuple.
To find all these pointers and make sure every offset information is accurate after deleting, much extra work is neened
And as I don't change the structure of the skiplist, after some deleting and inserting, the skiplist may become highly unbalanced
thus the performance of searching will be lowered.
This is one approach dealing with delete and insert.
Another approach is that I just rebuild the whole skiplist everytime I delete or insert a tuple.
I can't determine which one to choose.
As this skiplist is designed to store the doclen of all docs, if search operation is much more than delete/insert, I prefer the second approach.
Community Bonding Week 4: May 12-May 19
I implements merge_doclen_changes in the skiplist.
And the level of the skiplist can be determined automatically now.
Currently, the height of the skiplist is log4(n), n is the number of entries.
I plan to implement fixed width solution in next days.
Considering some details of fixed-width format before implementing it.
Implementing part of the fixed width.
If some contiguous docids are detected when encoding, ( referred is a contiguous block in the following )
the function turns to fixed width format.
First, -1 is encoded to indicate the fixed width format starts.
Then the number of entries in fixed width format, the fixed width, first docid in this contiguous docids and the minimum doclen are encoded.
I modify the format of fixed width.
-1 is encoded as an indicator.
Then first docid in this block, the number of entries in this block and the fixed width are encoded.
Docids are encoded after differed.
But doclen aren't, I encode the original doclen value.
If I want to keep a min doclen of this block and just encode the delta from it,
to find the min doclen I should determine the range of the block,
it means all doclen values in this block can be encoded within a certain width.
But as I will encode the doclen as delta, I can't get the exact width before min doclen is found.
It seems to be a little contradictory.
And in my function, the length of the block must be longer than a certain value,
otherwise it will be encoded in normal format.
If the first doclen in the contiguous block is much bigger than the rest,
the fixed width will be determined by this doclen value, which may lead to waste of space.
Implement search and merge_doclen_changes in a chunk of fixed-width format.
In merge_doclen_changes, I read the whole chunk and build the original postlist map,
then modify original postlist map according to map<docid,doclen> changes and re-encode the original postlist to chunk.
When the size of changes is small, updating the original chunk directly may be more efficient than rebuilding it,
but when dealing with bigger changes, I don't think it's faster than rebuilding.
After all, encoding a postlist map to chunk does take little extra calculation.
I will handle the case of deleting an entry and the case where the size of chunk becomes too bigger after updating.
I wrap the skip list and the fixed width format into DoclenChunkWriter and DoclenChunkReader.
DoclenChunkWriter will determine which format to use when the chunk is built for the first time.
If the number of docids is smaller than DOCLEN_CHUNK_MIN_SKIPLIST_LENGTH,
it will choose fixed-width format.(Notice: the fixed-width format is the same as current format when there aren't any contiguous docids.)
If bigger, it will detect the contiguous blocks in docids.
If the number of contiguous blocks is bigger than DOCLEN_CHUNK_MIN_CONTIGUOUS_BLOCKS,
( When I write the above line, I suddenly find this measurement isn't suitable, just think the case where all docids are contiguous )
then it will choose fixed-width format.
If not, it will choose skip list.
Currently, once the format of the chunk is determined when the chunk is first built, it won't change after any updating.
I don't think this way is suitable.
I will devise some tricks to change the format after some updating, if change is necessary.
Coding Week 1: May 20-May 25
I modify the principle of choosing the format.
I modify the updating of fixed-width format, making use of the fact that all docids in map are in order,
and read the brass_postlist.cc to learn about some details.
I made some changs on DoclenChunk.
In previous version, the DoclenChunk class will choose skip-list or fixed-width format depending on the distribution of the docids in postlist.
But after seeing the real data from Olly, I don't think skip list is necessary, so current DoclenChunk class will use fixed-width format in any circumstances.
I repaired the problem mentioned in May 16.
I track the ratio : good_bytes/used_bytes, if the ratio is smaller than a threshold, then these docids won't be stored in a same blocks.
I modify the searching on fixed-width format chunk.
In previous, each search starts at the beginning of the chunk,
now I track cur_docid, cur_doclength, pos like brass_postlist.cc.
if desired_docid is smaller than cur_docid, search still starts at the beginning.
if not, search will start at the block which cur_docid is in.
I repair a fatal bug made in May 23,
and do some test based on archives.doclens.
Fixed Width (merge changes): 15
Variable Width (merge changes) : 10
Fixed Width ( search ): 2
Variable Width ( search ): 1566
and fixed width need little space:
Fixed Width : 202948
Variable Width : 304326
And the size is too big, I need to split it into chunks.
But I'd like to split it and make each chunk have same number of entries, rather than same size.
Because it's difficult to control the size of chunk accurately in fixed-width format.
And if two chunks have the same number of entries, their sizes won't have very big difference.
Coding Week 2: May 26-June 1
I deal with the split of chunk and add header to each chunk
which lead to a major change in the hierarchy of my class, which hasn't been completed today.
I finish the remaining part of adding header to chunk and the change of hierarchy.
I do some basic test on new features.
More strict tests will be done tomorrow.
I did many tests to ensure the accuracy of new format, as possible as I can,
and deal with some details about the key of the chunk.
I can start modifying brass_postlist.cc and brass_postlist.h next days.
I complete modifying BrassPostlistTable
I complete modifying BrassPostlist
Coding Week 3: June 2-June 8
I repair some bugs when compiling new brass_postlist.cc and brass_postlist.h.
Finally they are compiled successfully, but when execute "make check", there are much failure.
I plan to debug these failure directly, or write a test for new feature first.
I read documents about how to write tests,
and read source code about brass database to debug the failure when executing ./runtest ./apitest -b brass simplequery1
I repair some bugs in brass_postlist.cc
I repair a bug about a mistake in forming the key for first chunk.
Now, ./apitest backend brass: 206 tests passed, 102 failed, 2 skipped. (without VALGRIND)
./apitest backend brass: 275 tests passed, 33 failed, 2 skipped.
Coding Week 4: June 9-June 15
./apitest backend brass: 288 tests passed, 20 failed, 2 skipped.
./apitest backend brass: 288 tests passed, 19 failed, 2 skipped.
I apply a mojor change of class FixedWidthChunkReader to implement next_in_chunk, which I think is unecessary to rewrite at first.
./apitest backend brass: 298 tests passed, 10 failed, 2 skipped.
./apitest backend brass: 304 tests passed, 4 failed, 2 skipped.
I repair a bug when applying compacting.
Xapian implements compacting by changing the key of chunk, not the chunk itself.
But at first I think chunk is of no relevance with its key.
It works well until I meet compacting.
Now without VALGRIND, there are only 3 failed tests.
One is cursordelbug1,
in brass_dbcheck.cc, Line 187,
if (!unpack_uint(&pos, end, &did))
It operates the raw chunk string directly, but as I change the format, it will fail for sure.
Need I to repair all such operations to raw chunk string?
And there are some strange SIGSEGV errors when it comes to tests doclenaftercommit1 and replacedoc2.
hurricanetong@hurricanetong-VirtualBox:~/xapian-1.3/xapian-core/tests$ VALGRIND= ./runtest ./apitest -b brass replacedoc2
Running tests with backend "brass"...
Running test: replacedoc2... SIGSEGV at 0xd5d5d5d9
./apitest backend brass: 0 tests passed, 1 failed.
But when I run this test in gdb,
hurricanetong@hurricanetong-VirtualBox:~/xapian-1.3/xapian-core/tests$ VALGRIND= ./runtest gdb ./apitest -b brass replacedoc2
Running tests with backend "brass"...
/home/hurricanetong/xapian-1.3/xapian-core/tests/apitest backend brass: All 1 tests passed.
doclenaftercommit1 is the same.
What should I do to deal with this strange situation?
I add Assert, LOGCALL, and so on.
./apitest backend brass: 287 tests passed, 21 failed, 2 skipped.
Coding Week 5: June 16-June 22
I repair a bug about memory leak
and delete a redundant information of chunk.
./apitest backend brass: All 308 tests passed, 2 skipped.
There still are some failed tests under valgrind, but they seem to exist even before I modify the files.
And although no failed tests, brass_dbcheck still needs repairing.
I modify brass_postlist.cc based on xapian files I downloaded months before,
now I'm trying to apply these changes on latest version.
now it supports fixed width doclen chunk.
write a simple test for this change.
do some optimization to avoid copy when passing parameters.
write comments and send my first PR.
Coding Week 6: June 23-June 29 (Midterm deadline June 27)
modify code style according to Xapian convention.
write more comments.
write more comments.
do a slight optimization by reducing the use of a temporary variable.
start optimizing vsencoding.
I reread the paper about vsencoding, as several months have passed, I forget some details.
I move some calculation of log2 out of the loop, by this the performance of encoding improves by 31%.
Now most time when encoding is spent on finding global optimal split.
But I think a local optimal split may also be ok, so I will try this later.
And vsencoding doesn't make use of the fact that integers in position list is in order.
Coding Week 7: June 30-July 6
I use local optimization in place of global optimization.
Following is a simple test based on a real position list file provided by Olly.
|code||encoding time||size||decoding time|
|interpolative(used by Xapian now)||709||8554887||898|
|my original vsencoding||5108||11734072||1267|
I don't get the high performance as the paper says.
Maybe my original code is poorly optimized, or the code in Xapian is highly optimized.
Current vsencoding decrease decoding time by 18%, at the cost of twice as much as encoding time and space.
do a little optimization.
I profile the program, and find that in current vsencoding and interpolative encoding,
most time is spent on appending an char to a string buffer.
As the size of buffer in my code is twice as much as interpolative,
it makes sense that my code costs twice as much time.
But when using vsencoding, we cann't prevent some bits from being wasted, even encoding delta may decrease that.
And in interpolative encoding, it seems every bit isn't wasted.
So I find it hard to do further optimization.
As I change my computer to mac, I am building development environment these two days.
All is OK, but I met some trouble when configure and make, so I just decide to still work in Ubuntu in virtual machine before they are solved.
And I find something strange.
|code||encoding time||size||decoding time|
Space is still twice as much, but encoding time is only 130%, and decoding time is 55%.
So maybe the performance depends on machines heavily.
I use pointer in place of array index to speed up.
I also change the chunk format to make it more logical.
But it seems that I introduce some bugs.
repair the bug introduced yesterday.
integrate vsencoding to latest xapian.
repair a bug introduced yesterday, now all tests are successful.
modify code style.
Coding Week 8: July 7-July 13
optimize Unary Encoder.
write comments and optimize GammaEncoder.
check code and make a pull request.
Start applying skip list for position list.
My original class needs some changing, according to the experience learnt from applying fixed width format.
Coding Week 9: July 14-July 20
refine three major classes and do tests.
refine three major classes of skip list to satisfy integrating.
modify other functions.
repair a bug in branch fixed-width format.
repair some bugs.
./apitest backend brass: 304 tests passed, 8 failed, 3 skipped.
Coding Week 10: July 21-July 27
repair a bug.
./apitest backend brass: 310 tests passed, 2 failed, 3 skipped.
repair the last bug.
This bug seems a little difficult, I haven't find the accurate cause of this bug.
work the last bug out.
./apitest backend brass: 311 tests passed, 1 failed, 3 skipped.
modify code style and add log info.
write comments and refine a macro.
finish writing comments.
Coding Week 11: July 28-August 3
modify code in brass_dbcheck.cc concerning about skip-list format.
modify a test to test the behavior of new brass_dbcheck.cc and the case when a postlist have more entries than MAX_ENTRIES_IN_SKIPLIST_CHUNK.
check code and make PR.
update db_check about vsencoding and modify a corresponding test.