wiki:GSoC2017/LetorClickstream/WorkProduct

Project Work Product

In a nutshell, the project had the following two broad parts at its core:

  • the first was logging the search data from Omega and do post-processing to store it in the required format
  • and the second was mining the relevance judgements from the logged search data which basically required implementing the proposed Dynamic Bayesian Network click model.

Note: To get into the details of the project (e.g. project plan, workflow and timeline), please go to the project's main page.

I have been working on my fork of the Xapian project on Github: ivmarkp/xapian

Merged

The following lists the work that has been merged in xapian master:

  • Clicklog template to record the click data in the format specified in the project plan which acts between the query template and a linked web page pointed by a search result #161
  • Postprocess script to generate the final clickstream log file used to train click models and generates query file for Xapian Letor from click log data along with its documentation and tests #161.
  • New $hash{} OmegaScript command (as a part of enabling clickstream logging in Omega) #160

Link to the commits: https://github.com/xapian/xapian/commits/master?author=ivmarkp

In progress

The following lists the work that is currently in progress and will be merged in xapian master soon:

  • Implementation of Simplified DBN (SDBN) click model with its documentation and tests #170.
  • generate-qrel-file command that generates the qrel file needed to prepare training file for letor #170.

Future Work

The following are some ideas for future work on this project:

  • Implement MLE algorithm for the DBN click model.
  • Implement log-likelihood evaluation method.
  • Implement more click models! (e.g. DCM and UBM).
  • End-to-end use of letor with omega (training the letor module on training file obtained from Omega click data and using the letor module for displaying relevant search results on top of SERP).

Steps for using letor module with Omega

You currently need the latest Xapian git master installed on your system - if not, see https://xapian.org/bleeding for how to check out the code from git and build it. If you've not used Omega before, see https://xapian.org/docs/omega/quickstart.html for a quick introduction.

First, we need to enable clickstream logging in Omega so that search and clicks logs are generated whenever searches are performed on it. To do that run Omega CGI with template to be used set to activatelog template by changing the FMT parameter value to activatelog in CGI url in your browser address bar.

clicklog was added in this project along with activatelog template and few changes in query template to enable clickstream logging as an end result.

If logging was enabled successfully, you should see two extra log files in the directory /var/log/omega i.e. search.log and clicks.log. We are going to need those files now to generate the final log file.

Omega provides a utility command postprocess for post-processing the log files to generate the final log file and query file. Final log file is used for training the click model as we'll see later and query file is used by letor to generate its training file.

postprocess is readily available with Omega installation and using it is straightforward. Just specify the path to the search.log file, clicks.log, path to save final.log and query.txt file in the directory where postprocess is installed:

postprocess /path/to/search.log /path/to/clicks.log /path/to/final.log /path/to/query.txt

You can also run postprocess -h to see help text about using the command.

Note: You can also call the two functions generate_combined_log and generate_query_file provided in postprocess into your program individually. Please refer to its documentation to get into more details.

Now that we have final.log file we are ready to train the click model to provide us with relevance judgments and generate qrel file from that. Omega provides Simplified DBN click model which is accessible through its API or via utility command generate-qrel-file to generate qrel file. To use generate-qrel-file you only need to specify path to final.log file and path to save qrel file to:

generate-qrel-file /path/to/final.log /path/qrel/file

You can also run ./generate-qrel-file --help to see help text about using the command.

Note: The click model API and generate-qrel-file are in the last stages of development. It will be available for use only after #170 is merged so I've refrained from including any documentation links for that. Although, you should be able to find the documentation in #170 itself.

At this point, we have both the query file and the qrel file so we can now use letor module to generate its training file.

To generate letor training file, just specify paths to query file and qrel file, index and path to save training file to xapian-prepare-trainingfile command:

xapian-prepare-trainingfile --db=DIR --msize=MSIZE /path/to/query/file path/to/qrel/file /path/to/save/training/file

where, DIR = path/to/database/to/search and MSIZE = maximum number of matches to return. It will generate a training file with the <trainingfile> path provided.

Next, we train the letor model using xapian-train command:

xapian-train --db=DIR /path/to/training/file MODEL_METADATA_KEY

where, DIR = path/to/database/to/search and MODEL_METADATA_KEY is metadata key to save model to.

Now, to see how the documents are re-ranked (assigned new relevance scores based on the trained letor model compared to what was in training file), we will use xapian-rank command which displays documents sorted by their new scores after re-ranking by letor model:

xapian-rank --db=DIR --msize=MSIZE --stemmer=LANG --prefix=PFX:TERMPFX --boolean-prefix=PFX:TERMPFX MODEL_METADATA_KEY QUERY

where, DIR = path/to/database/to/search, MSIZE = maximum number of matches to return, LANG = stemming language (the default is English or pass 'none' to disable stemming), PFX:TERMPFX = a prefix and a boolean prefix for --prefix and --boolean-prefix arguments respectively, MODEL_METADATA_KEY = metadata key to point to a saved letor model and QUERY = any search query.

Note: To get into more details about letor API, please refer its documentation here.

If something doesn't work right for you from what I described above, just let us know on #xapian IRC channel so you can get help from a wider audience.

Last modified 9 months ago Last modified on 02/04/19 04:49:20
Note: See TracWiki for help on using the wiki.