Learning to Rank Stabilisation project
Learning to Rank (Letor) is the application of Machine Learning (ML) to Information Retrieval (IR), in particular to the problem of ranking. Each document is represented by a vector of features. These features try to distinguish the levels of relevancy between documents, and are usually a number or a boolean flag. The BM25 score for a document is a common feature; for webpages, the number of in links and out links are two potential features.
In the academic literature, Learning-to-Rank has been shown to perform better than unsupervised ranking models like TF-IDF or BM-25, especially in document retrieval and web-page retrieval.
Some of the state-of-the-art learning to rank algorithms are listed below with the title of their reference paper and basic information. Most of these papers are freely available on the web.
- ListNet - "Learning to Rank: From Pairwise Approach to Listwise Approach" (based on neural net)
- RankNet - "Learning to Rank using Gradient Descent" (based on neural net)
- ListMLE - "Listwise Approach to Learning to Rank - Theory and Algorithm" (based on neural net)
- LamdaRank - "Learning to Rank with Nonsmooth Cost Functions" (based on neural net)
- "Learning to rank with extremely randomized trees" (based on random forest and decision trees)
- "AdaRank - A Boosting Algorithm for Information Retrieval" (based on Boosting)
- YetiRank - "Winning The Transfer Learning Track of Yahoo!’s Learning To Rank Challenge with YetiRank" (based in gradient boosting)
and many more.
Letor in Xapian
Xapian currently has experimental Letor support, originally implemented as a GSoC project in 2011, and further worked on in 2012, by two projects in 2014, and by stablisation projects in 2016, 2017, 2018 and 2019. There was also a project in 2017 to mine relevance judgements from logs of user clicks on search results.
The ultimate aim is to provide an end-to-end system, including a module to extract the features responsible for ranking and also preparing the training data, for both users interesting in machine-learning driven result ranking, and researchers investigating improvements to LTR performance.
Xapian's LTR support consists of a number of distinct pieces:
- feature extraction and representation (feature vectors and so forth)
- a range of rankers
- a range of scorers
- "core" code that ties this all together
The current situation is that the API mostly makes sense, although further use may suggest improvements. We have reasonably high test coverage for the Letor system, although some further testing for edge cases / exceptions is always possible.
The most recent stabilisation project identified a number of problems in our earlier implementation work, and fixed some of them.
There are some open pull requests that would be good to complete and merge:
- add letor documention to the user guide; this probably needs to be expanded as well
- fix various ranker implementation problems
- python3 bindings
The other things that need doing include:
- bindings and user guide examples for other languages (except python2, since the language itself will be unsupported from 2020). At least initially we don't need to be able to subclass any Letor classes in the bindings, just use Letor functionality from other languages
- investigate research that can drive tangible performance improvements above BM25+
- a performance and evaluation system to work with large-scale and real-world datasets. This should use public datasets (either INEX2009, FIRE, or something similar; we can provide access to FIRE). This will probably be at least partly based on evaluation work done previously (samuelharden's repo and vivekp's fork which supports some additional rankers)
- there are likely to be further API improvements that make sense
- add make dist / make distcheck profiles to buildbot, and ensure letor builds on all our other integrated CI (currently AppVeyor and Travis)
Here are some lower-priority ideas on how to further develop our Letor system once we have it in a releasable state:
- Add support for the database backend to track the length of fields would avoid having to handle this specially as a feature for Letor, and would also allow implementation of weighting schemes like BM25F and PL2F.
- Use a linear regression approach to combine the scores given by different Rankers
- A feature reduction technique can be added to eliminate redundant features.
We can help you with all the required knowledge of IR, although prior exposure to ML concepts would be helpful.
- These blog posts (part 1&2) give a good introduction to the basics of learning to rank from a search engine perspective
- Previous projects:
- Previous "Stabilisation" projects:
- Yet another overview of Learning to Rank
- Sample training file generated using xapian-letor from INEX data
- Solid C++
- High level of attention to detail
- Knowledge of Information Retrieval (IR) and Machine Learning (ML) useful but not essential