Details
Ranking is the central part of almost all of the information retrieval problems and academic research has proved that Learning-to-Rank has been shown to perform better than unsupervised ranking models like TF-IDF or BM-25. Currently the letor system in xapian is untested and not ready to be fully releasable and this project aims to make it releasable by the end of the summer. This will have direct impact on the ranking effectivness of xapian and this project also aims to make xapian-evaluation releasable which will provide a platform to evaluate the performance of all the ranking schemes currently present in xapian against standard benchmark datasets. Thus this will benefit all the organisations currently using xapian.
Add more detailed tests of both the higher-level API and the lower-level pieces (so for instance a particular Feature can be tested independently of the overall API), and also looking at corner cases and exceptional behaviour
This part involves writing test about each and every API(non-existing). I have written a sample unittest for one feature in #226 which would be easy to extend for the remaining 16 features which broadly comprises of 3 test(title,body,whole) of individual feature like TfDoclenfeature and last one being weight. Also currently the FeatureList class has test which only tests the default 6 features(like TfDoclenfeature) so adding more features and testing them would be a good idea.Additionally this project invloves addition of various different rankers and score metrics which would need their individual tests as well. Also currently ranker tests(for rankers like listmle,listnet,svmranker) are not detailed enough which test each and every function(method)so i would like to add more detailed test which cover entire API for rankers.
Create an evaluation and performance reporting system for letor, so that both usefulness and speed can be investigated. This should use public datasets. Integrate remaining rankers and scorers.
This part includes fixing the xapian-evaluation module whose code is pretty messy. Some of the issues include indentation,Segmentation fault when used with articles which does not end with "</DOC>". The reason for segmentation fault is that it is not able to determine the end of the document which renders into an infinite loop. Fix would be make it automatically detect the end tag of the document from the start tag of the document. Segmentation fault when used with very large files because in the src/trec_index.cc file curpos variable is an integer which will overflow after a certain point. Simple fix is to convert it to long long which would not cause the check defined at line 149 in src/trec_index.cc file to fail and thus prevent it from entering an infinite loop.Also after the fixing all issues, it should not give segmentation fault on invalid formats.There are many more issues likely to come up during furthur investigation.
Currently running make dist fails with a libtool error "cannot find the library '/home/vaibhav/Desktop/xapian/xapian-core/libxapian-1.5.la' or unhandled argument '/home/vaibhav/Desktop/xapian/xapian-core/libxapian-1.5.la'". I plan to solve this issue during the community bonding period.
Add bindings support, via our existing SWIG-based bindings approach so that we get a range of languages at once. At least initially we don't need to be able to subclass any Letor classes in the bindings, just use Letor functionality from other languages
● My main Goal for this summer is to make letor module releasable and evaluate it through xapian-evaluation on FIRE or INEX dataset and report it's performance against the standard benchmark datasets. This would involve writing tests for the feature api's which were not tested thoroughly.
● Fix bugs that come across while testing.
● Write bindings for letor in different languages and tests for them.
● If time permits polish or write xapian-evaluation code in python for easier addition of different formats of parsing like XML.