wiki:GSoC2012/Bi-gram Language Modeling/TODOS

TODO's Bold ---> Pending Work

  • Finding/Analyzing? feasibility and changes needed to add smoothing schemes Absolute Smoothing,Dirchlet,Jelinik. {Decided and Added in Notes}
  • Adding and implementing Discussed log trick and handle negative value,clamping of log output.{Done as per discussion on IRC and mailing list discussions}
  • Writing Documentation for Uni-gram Class.
  • Running API test to check compatibility with previous version.
  • Adding user API in uni-gram modelling for smoothing and log parameters.{Done and documentation is available in NOTES}
  • Checking Java and other bindings for correct working of uni-gram LM weighting scheme.
  • Bi-gram LM Proposal on Storage of Bi-grams,Bounds and log issues will be solved similarly to uni-gram model.
  • Implement the BigramTokenization and UnigramTokenization class for Termgenerator and use them.(Not implemented rather different approach followed of in-place tokenization)
  • Implement DocumentBigramTerm class.
  • Update Document class changes for storing bigrams
  • Updating Database add_document to support bigrams
  • TermlistTable changes to store termlist of bigrams
  • Documentation of Indexing and accessing bigrams.
  • Adding or changing PostListTable to store postlist of bigram .(Not required using same infrastructural of !Postlist)
  • Analyzing changes to query object and query parser for bi-gram implementation.
  • Per Document statistics in the the matcher infrastructure or query parser object. (Linkage to back-end functions)
  • Solving the bugs due to writable back-end and solved problem for uncommitted database.
  • Integrated the back-end compact with per document statistics.
  • Make summary of the evaluation module.
  • Add check_adhoceval to the evaluation module
  • ~Draw a list of things to investigate to find why precision is low~
  • ~Reply to Parsenjit sir asking TREC Collection~
  • ~Again Review papers of Unigram and Bigram to see if some thing is missed,and hence low precision~
  • Review the ToDO list and find how much time each will require
  • ~Writing Test for Bi-gram Language model,Bigram Implementation,Unigram~
  • Comments by Jaylett on last meeting
  • ~Add last meeting to the Meetings Note~
  • ~Index FIRE Collection with Bigrams and Stopword~

Things to Investigate for Bigram:

  • Index the collection with StopWords and see if it improves. --> Improved the Performance
  • Check Log Param if setting large value hurts the performance --> Improved the Performance
Last modified 7 years ago Last modified on 19/08/12 20:12:23