wiki:GSoC2017/LetorClickstream/ProjectPlan

Details

Click logs provide a highly valuable source of relevance information. Compared to editorial labels, clicks are much cheaper to obtain and always reflect current relevance. Click models have been shown to be very successful for obtaining relevance judgements from the click logs. Xapian currently supports an experimental letor API and having a support for mining relevance judgements from click logs using proven click models to generate timely updated training files will greatly complement this module. It could also be used by others to mine relevance from click logs of their search applications. Log data mining can help to substantially improve the document ranking function employed in the letor module (e.g. RankSVM).

Logging the required data from Omega

As discussed during the pre-GSoC period, logging click data from search results page can be achieved by implementing a second template and redirecting the result links via that as it is not possible to log from the result page template itself due to the fact that the clicks happen after that template is used. This will also involve configuring Omegascript $log command to write log in a secondary format in addition to the existing format so that log data can be directly used by the click model to produce relevance judgements without having to process the logged data.

Format of the click data required for training DBN model or Simplified DBN model is specified by the following fields in each line:

● ID: some identifier for each entry.
● QUERY: text of the query (tab sign \t not allowed).
● URLs: list of the URLs of the documents displayed on the result page. The list could be stored in json format or simply text file.
● CLICKS: list of clicks in json format or simple text file whichever is easier or faster to parse. Each element is the number of times corresponding URL was clicked

Mining the relevance judgements

There are broadly two representative methods for mining information from click-through log data in academic literature - Preference Pairs and Click Models, of which click models have been shown to be more successful. I'll be implementing Dynamic Bayesian Network Click Model; one of the proven click models used for mining relevance judgements and ranking web search results.

Project Workflow

● Implement a template (calling it “Log template” for now) to record the click data in the format specified earlier which will act between the query template and a linked web page pointed by a search result.
● Generate “Query” file (used by xapian-letor for generating its training file) by implementing an automatic mechanism for parsing the required data from Click logs. (Language to use: Python)
● Implement the DBN model and obtain a query relevance dataset with editorial relevance judgements to train the model. (Language to use: C++).
● The trained model will be used to predict relevance judgements based on the Omega click data.
● Last step will be to implement a mechanism to automatically generate “Qrel” file (used by xapian-letor) using the predicted relevance judgements. (Language to use: Python)

Project Timeline

Please click here to see project timeline.

Last modified 8 years ago Last modified on 05/20/17 12:45:37
Note: See TracWiki for help on using the wiki.