Table of Contents
Introduction to Bi-gram in Xapian
Bigrams
Bi-gram indexing will be implemented in Xapian for use in Bigram Language Model Implementation ,collocations.All the indexed data while indexing in stored in Btree table.So to easy access bigram will be stored as a bi-gram term.
For a Document with content:
Read a book about the history of read.
Bi-gram will be (with stop words removed) : given are bigram with bigramid
Bi-gram bgid(bigramid) "read book" - 881 "book about" - 882 "about history" - 883 "history read" - 884
Term List Data Storage
Term list table for the document (Lets say did = 88 and make key for terms will add uni and for bi-gram will add bi):
Although these data is stored with several optimization like reusing previous term etc.i only mean to depict a overview of how (not exact) data is stored.
88uni 5 4 read 2 book 1 about 1 history 1 88bi 4 4 read^book 1 book^about 1 about^history 1 history^read 1
Posting List Data Storage
Posting List Table for the Document(document ids are assumed and key will add uni for unigram terms bi for bigram terms and coll for collocation additions:
BRASSPostTable Entry term termfreq collfreq firstdid islast firstdocid lastdocid docid freq docid freq docid freq readuni 3 8 98 1 98 88 98 8 78 9 88 1 BRASSBigramPostListTable Entry bigram bigramfreq collfreq firstdid islast firstdocid lastdocid docid freq docid freq read^bookbi 2 2 67 1 67 74 67 1 74 1 BigramCollocationPostListTable Entry uniqbigramfreq totalbigramfreq firstbigramid islast firstbigramid lastbigramid bigramid freq bigramid freq readcoll 2(with first gram read) 4(with first gram read) 871 1 871 981 871 2 981 2
I am Suggesting a postinglist type entry of unigram and bigramids so that it is possible to use this bigrams for collocation. Otherwise i wasnot able to foresee the use of indexed bi-grams for collocation.
TermGenerator Class(Indexing)
TermGenerator class is used for indexing the document and since Bi-gram integration requires changes to index bi-gram this is first point for start of changes.
Method to Create Bigram
Currently index_text function of termgenerator_internal.cc index just the uni-gram terms from the text.For bi-gram indexing we want to also receive Bi-grams and Uni-gram Both or either.we can be store the previous tokenization and merge the previous and current term here in the index_text function and form the bi-grams.
I initially proposed to have two classes UnigramTokenizer and BigramTokenizer but a more cautious look in code reveal its better to keep code where it is as making these two classes wont make code of index_text more simpler as we anyhow need to check various option for stemming and CJK.So its sane to keep code there it self.
Importance of Using Stop Words in Bigram Indexing
Stop-word is a very important aspect in Bi-gram creation if stop-words are not removed from while creating bigrams then it would effect the statistics of bi-grams.
For example:
Sentence: Read a book about the history of read Bigrams created: "Read a" , "a book" ,"about the", "the history","history of","of read". Actual bigrams(which are meaningful): "Read book", "book history","history read"
Since collocation, bi-grams of "Read book" "book history", "history read" are missed which are important are more sensible to have.It is very difficult to Pre feed stop-word list for all languages.It is left and advised to API user to put stop-words for there language in case he use bi-grams for better results.
API Changes for End User
API User is provided with an option to choose to index bigram or not.By Default bigram indexing is false and will not be done.If user wants to use BigramLMWeight Weighting scheme then he should enable bigram by calline set_bigrams(true):
indexer.set_bigrams(true);
By enabling this option bi-grams will be stored in the backend and used could use BigramLMWeight Weihing scheme.If bigrams are not indexed and user try to use this weighting scheme he will be thrown with exception.
DocumentBigramTerm
DocumentBigram Term class will store the bi-gram term to be stored in the database for bi-gram searching and other features.
Positional Information for Bi-gram Removed
Since the final storage of bi-gram is similar to normal terms of Xapian. I am planning to DocumentBigramTerm similar to OmDocumentTerm class with following changes.
OmDocumentTerm stores positional information of the term also for various advanced functions like Phrase Search,Near Adj .Since these operator are more close to uni-gram and we couldn't foresee equivalent benefits of storing positional information for bi-grams.So, currently we are not storing the bi-gram information for the bi-gram Terms.Positional variable not to be included.
Changes to support Collocations
BigramDocumentTerm will also support the indexing of bi-grams for Collocation which require maintain list of bi-grams with similar first word.so Class will also have variable called term1 which store first term in bi-gram.Class constructor will take care of setting first term from bi-gram.
Rest will be similar to OmDocumentTerm class.
string term1;
Doubts: BigramDocumentTerm class
- Since these are similar to OmDocumentTerm should new DocumentBigramTerm be inherited by OmDocument term or not? is there any benefit of inheriting OmDocument class or i can go with a whole new seprate class?
Since we are skipping positional information.I restrained inheriting the class and went on making whole new seprate class.
- should we store bigrams in database with term1 term2 or term1-term2 (since we do a lot of optimization there and reuse previous term just wondering would having space between term will make diffrence due to optimization)?
I tested it by storing bi-grams in back-end in place of term and it went well.We will store bigram as term1 term2
Document class changes
This class will have two more storage maps to store bigrams and collocation terms:
Way to add bigram in Document Structure
Storing in Map similar to uni-gram.
map<String bigramterm,!BigramDocumentTerm>
//Discarded for Now map<String unigramterm,map<bigramid bgid,bigramterm>> // this map will store all the bigrams where first term is same in bigrams. for eg. (Read,((812, Read-book),(817,Read-newspaper),(987,Read-magazines)))
Internal Functions Supporting Bigrams
Since the open_term_list tries to load all the terms for the particular document and using TermIterator it iterates over the terms.Similar infrastructure was setup for Bi-grams and BigramIterator was implemented to iterate over the bi-grams in particular document.
Methods to support the reading of these storage maps like bigramlist_begin() and to fill them back from database on open_document change open_bigram_list() implementation to support the reading of bigram from database.Internally Document class maintain few more methods to accommodate bi-grams:
open_bigram_list(); need_bigrams(); bigrams_modified();
Moreover these function redirect request to Database class with document-id to full-fill the request.
API Changes for User in Document Class
API user have these functions to work with bigram and these are the changes API users see for Bi-gram.
Function to add bi-grams to the document: void add_bigram(const std::string & bname, Xapian::termcount wdfinc = 1); Function to remove bi-grams from the document void remove_bigram(const std::string & bname); // Remove all bigrams from the document. void clear_bigrams(); The length of the bigramlist - i.e. the number of different bigrams which index this document. Xapian::termcount bigramlist_count() const; // Iterator for the bigrams in this document(Access all bi-gram in this document from this Iterator). BigramIterator bigramlist_begin() const; // return the end iterator for bigram_begin(); BigramIterator bigramlist_end() const;
Doubts:
- I want to generate bigramsid for all the bigrams mainly to store bigram in back-end for easy access for collocation.How can we generate such uniqueid ... docid is self generated how is it generated ?
Since we can do prefix search so storing bigrams in this way will add extra load only. so no need to bigramid.Will first try will prefix search only.
- What is use and in what all cases document is open from Database using open_document function in brass_backend?
Still Question is unanswered to me
- Generation of bigramid is fine but i will also need to access bigram corresponding to bigramid.Where can i easily store and access those (bigramid,bigram) pair in backend
I guess this is not needed till again think of stroing in that format due to extensive use.
- Do you think it will be a good idea implement and integrate collocation in this way.if you can think of better way to access collocation please suggest.(This is really a low priority task now i think it more sane to first implement BigramLMWeight and make that functional than including all functionality of bigrams)
Please give your reviews on this
Bi-gram Iterator
BigramIterator Usage and Work
In Xapain Iterator is basically iterators over the list of bi-gram in the document.Internally !BigramIterator::Internal is represented as BigramList.Hence iterator is basically accessing bi-grams and maintain bi-gram internally.
Bi-gramList or BigramIterator don't maintain the positional information of the bi-grams since positional information is important for Phrase,Near,Adj query which will remain with uni-gram always.Hence avoiding storing positional information to save space.
API User Guide
API user can use bigram in a way similar to TermIterator.Following are the methods BigramIterator Provides to API User.
BigramIterator bi = db.bigramlist_begin(Xapian::docid(1)); BigramIterator biend = db.bigramlist_end(Xapian::docid(1)); Accessing name of bigram : String name = *bi; // Implemnted using operator*(); Equality,Inequality test for bigrams: if(bi == biend) {} // using operator== if(bi != biend) {} //using operator!= Next Bigram: bi++; Skipping to particular bi-gram: skip_to("Read book"); Reading statistics: get_wdf(); get_termfreq();
Database changes
Changes to add_document_() to facilitate bi-grams addition to backend
- Add document set_bigrams() to transfer the term list from document to to TermListTable.
- Stores posting list changes for bigram in inverter class with PostingListChanges class (i.e did need to be add to the term in document for postlist).
Finally calls flush function on both all the posting list doclength, unigram post list , bigram post list to merge and store them in database.
Changes To facilitate access of bi-gram and there posting
- open_bigram_list(did) will open the Bigram list from the TermListTable .This function basically makes a call to the BrassBigramList to open list in type of Iterator for the user.
- open_postbigram_list(bigram) will open the posting list for the given bi-gram and return the pointer of !LeafPostLIst to the matcher and user to access the post-list for bigram.
API changes for user to Database class
Iterate over posting list of bi-gram: PostingIterator postlistbigram_begin(const std::string &bname) const; Given Ending Iterator for !PostbigramList PostingIterator postlist_end(const std::string &) Return bi-gram list to iterate over: BigramIterator bigramlist_begin(Xapian::docid did) const; and corresponding end iterator Return bi-gram iterator over all bi-gram in database. TermIterator allbigrams_begin() const; and corresponding end iterator Return true if bigram exists bigram_exists()
BrassBiGramList Changes
Opens bigram list in termlist tablenin a way similar to !BRASSTermList to support reading.Thisclass is basically the !BigramIterator::Internal so it provide access to Bigram list through BigramIterator.
Way data is stored is exactly similar to Termist of Xapian.The key used for bi-gram is slightly different.It packs did(document id to string after hex value:
\x42\x49
which stands for Bi.So key would be Bi1,Bi2 etc.
BrassBigramPostList Changes
Opens the posting list of Bigram stored in postlist table and provide access to Bigrams post-list .Currently storing the Postlist for bigram as considering the bigram as terms with space in between.Since the key to store the postlist is the term itself so storing the bigrams in a similar infrastructure wont hurt the normal terms statistics and postlist.
open_post_list(bigram) will return object of this class BrassPostList and user can iterate over posting list of thee bigram.
In discussion with only we got to know potential problem could occur in wildcard Query expansion.(Currently analyzing this).
QueryParser Changes
Need to intergrate BiGram Term creation in the query with a flag set. Need to make changes to calling infrastructure .QueryParser and query internal object makes call to the open_post_list function to get postlist of term.Need to tweak it to call BrassBigramPostList in case term is bigram.
Rest every thing will be similar as BrassBigramPostList will be similar to BrassPostList .