Project plan
My main objective in this summer focuses on the algorithm part of learning to rank module. I will add several algorithm into Xapian-Letor as follows:
1. ListMLE and Listnet
The existing implementations of Listnet and ListMLE in Xapian-Letor were too similar with another open source implementation. I will rewrite both of them in my own understanding.
2. Metric module
A metric module is really important for letor. As far as I can see, the metric module should provide precision, recall, MAP, NDCG and ERR. I will add a metric module into Xapian-Letor in this summer.
see more information about metrics
3. Normalization module
The feature normalization is realized by dividing its maximum one. This is a linear normalization method, but usually we use the formula: v-min/max-min to do the linear normalization. Further, I want to add a z-score normalization. It performed well in my previous experience. I also want to implement a abstract class named normalizer which can be inherited by multiple normalization method. The normalization not only can be used in the feature computing but also in the rank aggregation.
4. Ranking aggregation module
We need to blend the multiple ranking results into a final ranking result. In this module, I plan to implement two ways of blending——linear blending and voting model blending. In linear blending, I will use some normalization methods to normalize each predicting score and then combine them in a linear formula. In the voting model, I want to use a Borda Count model:"Each voter ranks a fixed set of c candidates in order of preference. For each voter, the top ranked candidate is given c points, the second ranked candidate is given c−1 points, and so on."
5. Rankboost and Adarank
The letor module has three ranking algorithms. One for SVM and two for neural network. I want to add two boosting based ranking algorithm for letor—Rankboost and Adarank . Both of them use the Adaboost algorithms for optimization. But the target of Rankboost is to optimize the preference between two documents while the adarank is to optimize the final metric.
Optional tasks:
1. Assessing relevance module
The relevance file is important for us to do the model training. If the user don't have relevance file but still want to obtain some. We can use the "pooling" method to crawl the results for a query from different ranking schemes and ask the user to label them. If I still have some time in this summer, I want finish it for xapian.
2. LambdaMART algorithm
LambdaMART, to the best of my knowledge, is state of the art learning to rank algorithm. So, If I have enough time in this summer or some free time after summer, I am glad to implement this algorithm.