GSoC2011/LTR/Notes – Xapian

wiki:GSoC2011/LTR/Notes

Context Navigation

Brief overview of how to start with svn
QueryLevelNorm
Mathematical Equations in RST
Merging trunk with branch using 'svn merge'
IR Evaluation of Letor ranking scheme
Refactoring directions

Brief overview of how to start with svn

Below are the brief commands which can be handy to set up and work with your branch.

To make your branch.

svn copy svn://svn.xapian.org/trunk/ svn://svn.xapian.org/branches/gsoc2011-parth

Now you do not have your local copy of the branch so go to /home/your/branch/ and give below command

svn checkout svn://svn.xapian.org/branches/gsoc2011-parth .

Now you can add your changes to your local copy and later commit them to your branch with below command

svn ci -m "msg" .

For Reference: http://www.math-linux.com/spip.php?article118

QueryLevelNorm

When we calculate the features for a particular query and its documents, the feature may take any real value. By QueryLevelNorm we want the values to be normalized between 0 and 1. To get the values belong to [0,1] we divide the value of a particular feature by its maximum value among all the documents retrieved for a particular query and hence it is called QueryLevelNorm.

Example:

In the below snippet of a training file each line represents query-document pairs.

1 qid:1 1:32.12 2:31.11 3:1.21 #docid:12345
0 qid:1 1:43.23 2.21.43 3:3.12 #docid:12321
1 qid:1 1:12.12 2:33.99 3:6.32 #docid:22323

where first column is the relevance judgement which is not available at the time of real time ranking and its only available for Machine Learning, second column is queryid, from third columns onwards there are feature values represented in form of 'featureid:value' form while after # is the commented data which contains documents id.

Now for QueryLevelNorm, the maximum value of the feature 1 in the above example is 43.23, so we divide all the values of feature 1 for that particular query and all the retrieved documents by 43.23 so they are now between 0 and 1. We perform the same operation for all the features.

The whole idea of QueryLevelNorm and in turn the normalization to bring all the data points in our case queries to common platform. So that the Machine Learning is not biased towards specific kind of queries for relevance.

Mathematical Equations in RST

I have written most of the equations in the LaTeX and now have to export them to html/xml using some " .. role:: raw-latex(raw) & :format: latex" or so.

Some clues:

http://docutils.sourceforge.net/FAQ.html#how-can-i-include-mathematical-equations-in-documents

http://www1.american.edu/econ/itex2mml/mathhack.rst

Merging trunk with branch using 'svn merge'

Couple of minor points to remember while merging trunk to the branch.

Use Top-Level directory for all the commands (and not ../core/ directory).

After merging if there is a file in conflict then most of the time 'tc' is the good answer.

After merging if you want to compile then start from the ./configure and then make and then make install. [Usually I was tend to start from 'make']

IR Evaluation of Letor ranking scheme

It is very important to test the performance of ranking algorithms in terms of standard measures like MAP and NDCG. So the MAP and NDCG score on the queries of INEX 2010 is reported below. Please note that its is not directly comparable to the score reported in the INEX proceedings because the dataset indexed is not the whole collection but 2,000,038 out of 2,666,190 documents are indexed. First 75% of queries of 2010 query set are used to train the Letor model and rest 25% of queries are used to test the performance of the system. The main aim to see the performance change with respect to existing BM25 xapian scoring scheme. This post must not be confused with the systems being compared under the standard settings of INEX.

Ranking Scheme	MAP [1]	NDCG [2]
BM25	0.3130	0.4146
Letor	0.5184	0.6130

This scores are obtained using evaluation script 'Eval-Score-4.0.pl' of Letor4.0 [3] dataset which is a standard benchmark collection for Letor framework.

Reference:

[1] http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html

[2] Kalervo Jarvelin, Jaana Kekalainen: Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20(4), 422–446 (2002)

[3] http://research.microsoft.com/en-us/um/beijing/projects/letor/letor4dataset.aspx

Refactoring directions

The idea is to handle feature independently from the ranking module. So the code refactoring can be seen in two different modules, 1. Transformation of documents in Xapian::MSet into Letor::RankList and 2. Assigning the score of each document in Letor::RankList

As can be noticed, the first module just calculates the features and stores the MSet into a proper data-structure suitable for Letor. As soon as the Letor::RankList is ready, we can supply it into the Ranking Module to assign a score to each document in the RankList.

The refactoring of the first module can be seen as below:

The method prepare_training_file() in letor_internal.cc should go like this

FeatureManager->set_query(query)

FeatureManager->setdatabase(db)

RankList rl; for(each doc in MSet) {

map<int, double> doc_feature = FeatureVector->transform(); rl.add(doc_feature);

}

The refactoring of the second module can be seen as below:

Ranker.cc will the the abstract class and all the new ranking algorithms will extend it by implementing learn(), save(), load() and score() methods

Last modified 14 years ago Last modified on 22/05/12 16:39:01

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text