wiki:GSoCProjectIdeas/LearningtoRankStabilisation

Learning to Rank Stabilisation project

Introduction

Learning to Rank (Letor) is the application of Machine Learning (ML) to Information Retrieval (IR), in particular to the problem of ranking. Each document is represented by a vector of features. These features try to distinguish the levels of relevancy between documents, and are usually a number or a boolean flag. The BM25 score for a document is a common feature; for webpages, the number of in links and out links are two potential features.

In the academic literature, Learning-to-Rank has been shown to perform better than unsupervised ranking models like TF-IDF or BM-25, especially in document retrieval and web-page retrieval.

Some of the state-of-the-art learning to rank algorithms are listed below with the title of their reference paper and basic information. Most of these papers are freely available on the web.

  • ListNet - "Learning to Rank: From Pairwise Approach to Listwise Approach" (based on neural net)
  • RankNet - "Learning to Rank using Gradient Descent" (based on neural net)
  • ListMLE - "Listwise Approach to Learning to Rank - Theory and Algorithm" (based on neural net)
  • LamdaRank - "Learning to Rank with Nonsmooth Cost Functions" (based on neural net)
  • "Learning to rank with extremely randomized trees" (based on random forest and decision trees)
  • "AdaRank - A Boosting Algorithm for Information Retrieval" (based on Boosting)
  • YetiRank - "Winning The Transfer Learning Track of Yahoo!’s Learning To Rank Challenge with YetiRank" (based in gradient boosting)

and many more. Some of these we have implementations for (although not all have been merged yet).

Letor in Xapian

Xapian currently has experimental Letor support, originally implemented as a GSoC project in 2011, and further worked on in 2012, by two projects in 2014, and by stablisation projects in 2016 and 2017. There was also a project in 2017 to mine relevance judgements from logs of user clicks on search results.

The ultimate aim is to provide an end-to-end system, including a module to extract the features responsible for ranking and also preparing the training data.

Xapian's LTR support consists of a number of distinct pieces:

  • feature extraction and representation (feature vectors and so forth)
  • a range of rankers
  • a range of scorers
  • "core" code that ties this all together

Project summary

The current situation is that the API mostly makes sense, although further use may suggest improvements. We have reasonably high test coverage for the Letor system, although some further testing for edge cases / exceptions, and some unit tests at the level of Features would be helpful. There is an unmerged PR for adding letor documentation to the user guide, and we definitely want bindings support so Letor functionality can be used from languages other than C++. We also would like a performance and evaluation system to work with large-scale and real-world datasets.

The focus this year needs to be squarely on turning letor into a stable piece of software with good test coverage and enough documentation that people new to letor can easily start to use it. We suggests the following pieces of work need doing to achieve that:

  1. Create practical code examples that use the core features and API, fleshing out the user guide
  2. Based on this experience, propose (and document) any changes to the user-facing APIs that will make it a more practical system to use (in particular, there are magic numbers in part of the current user-facing API which should be replaced either with symbolic constants or directly using C++ polymorphism features)
  3. Add more detailed tests of both the higher-level API and the lower-level pieces (so for instance a particular Feature can be tested independently of the overall API), and also looking at corner cases and exceptional behaviour
  4. Ensure that make dist works and produces a useful distribution archive (this allows us to do our first proper release of letor); we will probably want to add make distcheck to one of the buildbot configurations
  5. Create an evaluation and performance reporting system for letor, so that both usefulness (accuracy and precision of results) and speed can be investigated. This should use public datasets (either INEX2009, FIRE, or something similar; we can provide access to FIRE). This will probably be at least partly based on evaluation work done previously (samuelharden's repo and vivekp's fork which supports some additional rankers)
  6. Add bindings support, via our existing SWIG-based bindings approach so that we get a range of languages at once. At least initially we don't need to be able to subclass any Letor classes in the bindings, just use Letor functionality from other languages

Here are some lower-priority ideas on how to further develop our Letor system once we have it in a releasable state:

  1. Add support for the database backend to track the length of fields would avoid having to handle this specially as a feature for Letor, and would also allow implementation of weighting schemes like BM25F and PL2F.
  2. Use a linear regression approach to combine the scores given by different Rankers
  3. A feature reduction technique can be added to eliminate redundant features.

We can help you with all the required knowledge of IR and ML.

Resources

Writing a good proposal

It's essential with this project to be able to break the work into small, achievable pieces, each of which can be reviewed and merged before moving on to the next. When constructing a timeline, it's important to ensure there is sufficient time for each milestone to be reviewed by the rest of the Xapian team, including any time you may need to incorporate suggestions, so that sub-projects can be merged to Xapian's master branch. (This will also aid subsequent merges, because you won't need to spend significant time rebasing back onto master any subsequent work.) For this reason, it is important that you take advantage of the opportunity before GSoC proper starts to get familiar with contributing to Xapian by getting a small change merged to master. This will give you a chance to get used to the tools needed (git and so forth), ensure your development environment is set up, and get you into the swing of Xapian's coding and development style.

You may wish to plan to use the time while waiting for feedback on a sub-project to finish documentation for that piece of work, and perhaps to start writing example code for the next sub-project. (The documentation is in a different repository, so can be reviewed in parallel to planning and coding a subsequent sub-project without causing rebase problems. Example code can initially be created and reviewed in a separate repository, and copied over once it's ready.)

Skills required

  • Solid C++
  • High level of attention to detail
  • Knowledge of Information Retrieval (IR) and Machine Learning (ML) useful but not essential
Last modified 6 months ago Last modified on 29/03/19 21:07:11