wiki:GSoC2012/Bi-gram Language Modeling/Journal


Community Bonding Week 1 - 3: April 23 - May14

Issues for Unigram Implementation:

  • Decision for Bounds of Unigram Weighting Scheme.
    log(k*min(wdf_max/doc_length_lower_bound,1.0)) with checks for doc_length zero/
  • Compiled latest repository and fixed small bug to make code compatible with gcc4.7(My first accepted patch)
  • Forked git repository for xapian repo and learned pushing changes to repo.
  • Decision on Clamping of Negative value due to log

             i=1,...,n, if { max(log(K.Pi), 0) == 0)


  • Decision to Provide user API to select value of K and which smoothing to try with a sensible default.

Coding Week 1: May 21-May 27

Implementation of Uni-gram Language Model : WeeksBlogPost

  • Decision on Smoothing techniques to be implemented and was documented in NOTES.
  • Handling of negative value from the log in sum formula and bound for the optimization.

  • Implementation of Parametric constructor for Uni-gram Language Model.
  • Addition of Smoothing to the Uni-gram Language Model Implementation.
  • Addition of Per document statistics (number of unique term in document) to the xapian architecture similar to document length.
  • Shifting Back-end used of xapian implementation from Chert to Brass.

Coding Week 2: May 28-June 3


May 29

  • Documentation and tutorial of Unigram Language Model Weighting class documented in new non-generated documentation.

May 30

  • Added the 5 test cases for UnigramLMWeight class and removed bugs found by those test cases.

May 31

  • Tested binding for Xapian and written test to smoke check Uni-gram implementation.
  • Explored code more rigorously to check exact configuration to fit in Bi-gram Implementation (Notes regarding will be updated in NOTES Section).

June 1

  • Generated test coverage report for the Xapian and more interestingly for new class(Test coverage is 91.1%)

June 2


  • Code exploration for Bigram Integration proposal
  • Work of Bigram Integration Proposal completed.

June 3

  • Made changes to index_text to create bi-grams and add them to documents ReadMore
  • Made changes to API of termgenerator class to give API User ability enable bi-gram indexing which is disable by default.ReadMore
  • Made changes to stopword need to be removed while seeing bi-grams.ReadMore
  • Added DocumentBigramTerm class ReadMore

Coding Week 3: June 4-June 10


June 4

June 5

June 6

June 7

  • Made Internal Document level changes to table BrassBigramTermList to add bi-grams and support access to them. ReadMore CodeCommit

  • Added UnImplementedMethod Exception for other database like chert and remote as i couldn't workaround to disable backend as they are highly integrated to bin of xapian-delve and remote backend.

June 8

  • Adding support to inverter class to store postlist changes for bigrams.CodeCommit

June 9

  • Documented Bi-grams indexing and access of bigram termlist and postlist in non-generated documentation. DocumentationCommit
  • Adding Methods to merge post-list changes of bigrams in backend.(Will use current infrastructure of postlist as it seems to work correctly)

June 10

  • Checking Backend for Error on previous regression test.(Only one test failed due to newly added changes cursordelbug1 it is for a previous bug at BugTicket )

Weekend - off

Coding Week 4: June 11-June 17


June 11

  • Deciding upon Method to access posting list based on matcher infrastructure and query calling infrastructure.(Since We now treat bi-grams as term calling infrastructure is same).
  • Analyzing the wildcard query expansion for the problem due to storing bi-grams.
  • Discussion on what is done and got suggestion to changes implementation to "treat bi-gram as terms"

June 12

  • Changing the Implementation to "treating the bi-grams as terms" CodeCommit
  • Adjusted the Bigram Iterator to just show the bi-grams with new implementation."There should be some way to iterate the unigrams also but left it for later as it not very important(Mentioned by Olly)".CodeCommit

June 13

  • Discussion on what and how to integrate the Document statistics to the backend.And Key to store the new statistics per document in the backend.(The document statistics need to store in the backend with new keys as a posting list entry).

June 14

  • Implementation of Document statistics in the backend.( PostList entry to store document statistics).CodeCommit ReadMore

June 15

  • GSOC Meetup (OFF)

June 16

  • Regression testing using the previously written test available in tests folder.

June 17

  • Testing Backend for bugs based on tests.

Coding Week 5: June 18-June 24

June 18

  • Analyzing and understanding changes required to query object and query parser for bi-gram implementation.

June 19

  • Backend changes proposed by Olly for changing the Key and per document stats.
  • Analyzing and understanding changes required to query object and query parser for bi-gram implementation.

June 20

  • Per Document statistics in the the matcher infrastructure or query parser object. (Linkage to back-end functions) CodeCommit
  • Removing Bugs in the implementation found during regression testing.

June 21

  • Removing Bugs in the implementation found during regression testing. {Issues with Writable Database,accessing the writable Database i.e changes are in inverter and all was causing some test to fail} CodeCommit

June 22

  • Removed bug of all docs post list was failing due to typo in Writable database function.CodeCommit
  • Update Remove document and replace document for the document statistics added in the back-end. CodeCommit
  • Changes Document Length of Term list to be sum of wdf for Unigram + Bigram. CodeCommit
  • Experiment on Time difference between using the Document Length from termlist table or postlist table for get_eset.ReadMore

June 23

  • Update Compact of Brass with new additions to backend.
  • Regression Test for Brass compact and removed the bugs for failing compact* tests.CodeCommit

June 24

  • Regression Test for Brass compact and removed the bugs for failing compact* tests.CodeCommit

June 25

  • Follow the Depreciation policy and discuss whether to add get_doclength() or rollback and add get_stats(). CodeCommit

Coding Week 6: June 26-July 1

June 26

  • Understand the git merge which jaylett suggested and try to understand benefits of it and work on that.GitHub

  • Check whether the changes you made to automake are necessary or not as asked by jaylett and reply.Mastercorrection

  • Make sure current master is upto date with branch and every thing compiles well and test suite is passed.


June 27

  • Make the Query level changes document and ask for the reviews(Stretched to next day).Archive

June 28

  • Make the Query level changes document and ask for the reviews(Stretched to next day).Archive

June 29

  • Brushed up changes and work to be done and made checked road till now for review meeting.
  • Review Meeting.

June 30


July 1


Coding Week 7: July 2-July 8

July 2

  • Made addition of Bi-gram in group term more efficient using single iterator instead of two iterators. CodeCommit
  • Added functionality to select whether to add bigram to Query or not. CodeCommit
  • Now bigrams for Group Query type is handeled at single place as bigram unaffected from Multi auto synonym.CodeCommit

July 3

  • Added support for bigrams in Terms,i.e for NEAR,PHRASE,ADJ queries CodeCommit
  • Discussion about re-factoring of Language model Weight,evaluation,stats for title,body etc.

July 4

  • Review of work on Weight, remove bugs for Weight.

July 5

  • Refactored UnigramLMweight to LMWeight, bugs of LMWeight CodeCommit

July 6


July 7

July 8

  • Adjust LMWeight to be parametric for all three unigram,bigram,mixturemodel.

Coding Week 8: July 9-July 15 (Midterm deadline July 13)

July 9

  • Added support to include bigram in Weight,And constructor for user to set which gram model to choose CodeCommit

July 10

  • Look over Evaluation Module development from terrier and Andy's Trec Code.

July 11

  • Schemed about Evaluation Module based on lines with terrier and decided to carry with FIRE DATA for while.
  • Hacked code for FIRE Query to be parsed by the module similarly to TREC Query.CodeCommit

July 12

  • Hacked code for FIRE Dataset to be indexed by the module similarly to TREC Dataset.CodeCommit
  • Completed the run of code for FIRE Data and pulled out result file for FIRE DATA.
  • Evaluation Module Work Started QRel Assessment for query can be stored in Class QRelInMemory.CodeCommit

July 13

July 14 to July16

  • Now using TrecQrel object,can load Qrel file and access status of document for Query and access all relevant documents too.CodeCommit

Coding Week 9: July 17-July 22

July 17 to July18

  • Added genric Evaluation Class and class for Adhoc Evaluation CodeCommit
  • Implemented Basic MAP and printing function for MAP in CodeCommit

July 19 & July 20

  • Improved Evaluation and Write Evaluation Function of Adhoc,now MAP works fine CodeCommit

July 21 to July 23

  • Fixed makefile for removing newly added executable file on clean CodeCommit
  • Fixed Makefile,Improved display of evaluation results and Added statistics of Document relevent,retreived CodeCommit

Coding Week 10: July 24-July 29

July 24

  • Addedd precision by Rank,precision by Recall left to implementCodeCommit
  • IRC Meeting to discuss evaluation Module.

July 25 & July 26

  • Checked evaluation module for manually for one query and removed bug.

July 27

  • Implemented writing of evaluation result in colum format to support easy manipulation of results CodeCommit

July 28 & 29

  • Redireted Query formation to QueryParser module instead of making query ourself by splitting in words CodeCommit
  • Redirected Indexing,stemming,stopw to to TermGenerator module of XapianCodeCommit

July 30

  • Made Weighting scheme,Bigram configurable by user through config fileCodeCommit
  • Compiled and Found Result for all the Weighting Scheme with bigram and without bigram.ResultDocument

Coding Week 11: July 31-August 5

July 31

  • Removed a implementation bug from LMWeight (bug was over writing actual weight value).CodeCommit
  • Test for Bi-gram implementation in back-end.'

August 1

  • Reviewed code of Language Model Weighting Scheme.

August 2

  • IRC Meeting to Discss Evaluation Module and problem with Bigrams.MeetingNotes

August3 - August 5

  • Transit to Banglore to join HP.

Coding Week 12: August 6-August 12

August 6

  • Indexing the Collection with stopwords and removing bug with stop words implementation.Stopper was not configured correctly.CodeCommit

August 7

  • Reached to one problem causing the low result was setting high value for log param . Check Log Param if setting large value hurts the performance and it improved the performance.

August 8 - August 13

  • Working on improving Uni-gram model.

Coding Week 13: August 13-August 20 (Final evaluation based on work up to August 20)

August 14 - August 15

  • Working on Finding and improving bugs for Bi-gram.

August 16 - August 17

August 18 - August 19

August 20

  • Cleaning Code
  • Writing Pending Test cases for Bigrams
Last modified 5 years ago Last modified on 26/01/16 10:10:43
Note: See TracWiki for help on using the wiki.