Project Work Product
In a nutshell, the project had the following two broad parts at its core:
- the first was logging the search data from Omega and do post-processing to store it in the required format
- and the second was mining the relevance judgements from the logged search data which basically required implementing the proposed Dynamic Bayesian Network click model.
Note: To get into the details of the project (e.g. project plan, workflow and timeline), please go to the project's main page.
I have been working on my fork of the Xapian project on Github: ivmarkp/xapian
The following lists the work that has been merged in xapian master:
Clicklogtemplate to record the click data in the format specified in the project plan which acts between the query template and a linked web page pointed by a search result #161
Postprocessscript to generate the final clickstream log file used to train click models and generates query file for Xapian Letor from click log data along with its documentation and tests #161.
OmegaScriptcommand (as a part of enabling clickstream logging in Omega) #160
Link to the commits: https://github.com/xapian/xapian/commits/master?author=ivmarkp
The following lists the work that is currently in progress and will be merged in xapian master soon:
- Implementation of Simplified DBN (SDBN) click model with its documentation and tests #170.
generate-qrel-filecommand that generates the qrel file needed to prepare training file for letor #170.
The following are some ideas for future work on this project:
- Implement MLE algorithm for the DBN click model.
- Implement log-likelihood evaluation method.
- Implement more click models! (e.g. DCM and UBM).
- End-to-end use of letor with omega (training the letor module on training file obtained from Omega click data and using the letor module for displaying relevant search results on top of SERP).
Steps for using letor module with Omega
You currently need the latest Xapian git master installed on your system - if not, see https://xapian.org/bleeding for how to check out the code from git and build it. If you've not used Omega before, see https://xapian.org/docs/omega/quickstart.html for a quick introduction.
First, we need to enable clickstream logging in Omega so that search and clicks logs are generated whenever searches are performed on it. To do that run Omega CGI with template to be used set to
activatelog template by changing the
FMT parameter value to
activatelog in CGI url in your browser address bar.
clicklog was added in this project along with
activatelog template and few changes in
query template to enable clickstream logging as an end result.
If logging was enabled successfully, you should see two extra log files in the directory
clicks.log. We are going to need those files now to generate the final log file.
Omega provides a utility command
postprocess for post-processing the log files to generate the final log file and query file. Final log file is used for training the click model as we'll see later and query file is used by letor to generate its training file.
postprocess is readily available with Omega installation and using it is straightforward. Just specify the path to the
clicks.log, path to save
query.txt file in the directory where
postprocess is installed:
postprocess /path/to/search.log /path/to/clicks.log /path/to/final.log /path/to/query.txt
You can also run
postprocess -h to see help text about using the command.
Note: You can also call the two functions
generate_query_file provided in
postprocess into your program individually. Please refer to its documentation to get into more details.
Now that we have
final.log file we are ready to train the click model to provide us with relevance judgments and generate qrel file from that. Omega provides Simplified DBN click model which is accessible through its API or via utility command
generate-qrel-file to generate qrel file. To use
generate-qrel-file you only need to specify path to
final.log file and path to save qrel file to:
generate-qrel-file /path/to/final.log /path/qrel/file
You can also run
./generate-qrel-file --help to see help text about using the command.
Note: The click model API and
generate-qrel-file are in the last stages of development. It will be available for use only after #170 is merged so I've refrained from including any documentation links for that. Although, you should be able to find the documentation in #170 itself.
At this point, we have both the query file and the qrel file so we can now use letor module to generate its training file.
To generate letor training file, just specify paths to query file and qrel file, index and path to save training file to
xapian-prepare-trainingfile --db=DIR --msize=MSIZE /path/to/query/file path/to/qrel/file /path/to/save/training/file
where, DIR = path/to/database/to/search and MSIZE = maximum number of matches to return. It will generate a training file with the <trainingfile> path provided.
Next, we train the letor model using
xapian-train --db=DIR /path/to/training/file MODEL_METADATA_KEY
where, DIR = path/to/database/to/search and MODEL_METADATA_KEY is metadata key to save model to.
Now, to see how the documents are re-ranked (assigned new relevance scores based on the trained letor model compared to what was in training file), we will use
xapian-rank command which displays documents sorted by their new scores after re-ranking by letor model:
xapian-rank --db=DIR --msize=MSIZE --stemmer=LANG --prefix=PFX:TERMPFX --boolean-prefix=PFX:TERMPFX MODEL_METADATA_KEY QUERY
where, DIR = path/to/database/to/search, MSIZE = maximum number of matches to return, LANG = stemming language (the default is English or pass 'none' to disable stemming), PFX:TERMPFX = a prefix and a boolean prefix for
--boolean-prefix arguments respectively, MODEL_METADATA_KEY = metadata key to point to a saved letor model and QUERY = any search query.
Note: To get into more details about letor API, please refer its documentation here.
If something doesn't work right for you from what I described above, just let us know on #xapian IRC channel so you can get help from a wider audience.