Report
Usage
Usage: ./bin/letor_evaluation <Options>
Options:
-a, --training_set=TRAINING_SET_PATH training_set path(required_argument)
-e, --test_set=TEST_SET_PATH test_set path(required_argument)
-r, --ranker=RANKER specify the ranking algorithm
Supported algorithm: (optional_argument, default=ListNet)
0:svmrank
1:ListNet
2:ListMLE
3:Adarank
4:Hybird model
-e, --metric=METRIC specify the metric to evaluate the ranking result
Supported metric: MAP, NGCG, ERR(optional_argument, default=NDCG)
0:MAP
1:NDCG
2:ERR
-i, --iterations=ITERATIONS The number of iterations(optional_argument, default=25)
-l, --learning_rate=LEARNING_RATE learning_rate(optional_argument, default=0.01)
-h, --help display this help and exit
-v, --version output version information and exit
Example: ./bin/letor_evaluation -a /MQ2008/Fold1/train.txt -e /MQ2008/Fold1/test.txt -r 1 -m 1 -i 25 -l 0.05
Evaluation
Dataset: letor4.0 MQ2008 Fold1
Parameter:
SVM-rank: defalut parameter(from LibSVM)
ListNet: iterations 25, learning rate 0.01
ListMLE: iterations 25, learning rate 0.01
Adarank : iterations 25
Hybird model:ListNet & ListMLE & Adarank
Note:
- The evaluation work is using letor4.0 dataset. Please note that its is not directly comparable to the baseline reported in the MSRA's website: baseline. The main objective is to see the performance about the new ranking algorithm and the ranking Aggregation model. So, in the evaluation process, I just simply choose a casual parameter, which maybe far from the optimum parameters.
- The original aggregation model only has a borda-fuse algorithm, which has a unsatisfactory blending result as shown in the below. So I add a z-score blending algorithm, which has a significantly better blending result than both borda-fuse and best single model. And it is the default blending algorithm in the letor_evaluation now.
- In the "Hybird model", I combine the ranking result of ListNet, ListMLE and Adarank. The blending result of these three models has a better performance than a combination of all the four single models.
- Simply and clearly, instead of NDCG@k and ERR@k, the NDCG score and ERR score here were only computed at the full length of the ranking result.
Single model result:
Ranker | SVM-rank | ListNet | ListMLE | Adarank |
MAP | 0.359551 | 0.452270 | 0.449688 | 0.462444 |
NDCG | 0.451360 | 0.522194 | 0.521012 | 0.526589 |
ERR | 0.334137 | 0.478187 | 0.468892 | 0.461910 |
Hybird model result:
Aggregater | borda-fuse | z-score |
MAP | 0.454378 | 0.465396 |
NDCG | 0.527090 | 0.533131 |
ERR | 0.483000 | 0.492659 |
Furture Work
Maybe these work can be considered as the next year's GSoC project.
- Merge code(important): There two versions of letor module now. One is Jiarong's version which use a new Mset instead of ranklist and the other is mine, which still use the old ranklist
- The way use letor in Xapian(important): In my this year version of letor, people need to provide both training set and test set when they want to use letor. But how to get the trianing set and test set, especially the relevance labels from the documents in Xapian's DB naturally? It is a important question in using letor.
- Clean code: There are some useless code in the letor.
- Directory structure: Now, most of the files were in the same directory, which can be split to ranker, metric, feature and so on. It can be more clear.
- Document: more clear documents.
- Add the tree method ranking model, such as LambdaMART.