Project Work Product
In a nutshell, the project had the following two broad parts at its core:
- the first was logging the search data from Omega and do post-processing to store it in the required format
- and the second was mining the relevance judgements from the logged search data which basically required implementing the proposed Dynamic Bayesian Network click model.
Note: To get into the details of the project (e.g. project plan, workflow and timeline), please go to the project's main page.
I have been working on my fork of the Xapian project on Github: ivmarkp/xapian
Merged
The following lists the work that has been merged in xapian master:
Clicklog
template to record the click data in the format specified in the project plan which acts between the query template and a linked web page pointed by a search result #161Postprocess
script to generate the final clickstream log file used to train click models and generates query file for Xapian Letor from click log data along with its documentation and tests #161.- New
$hash{}
OmegaScript
command (as a part of enabling clickstream logging in Omega) #160
Link to the commits: https://github.com/xapian/xapian/commits/master?author=ivmarkp
In progress
The following lists the work that is currently in progress and will be merged in xapian master soon:
- Implementation of Simplified DBN (SDBN) click model with its documentation and tests #170.
generate-qrel-file
command that generates the qrel file needed to prepare training file for letor #170.
Future Work
The following are some ideas for future work on this project:
- Implement MLE algorithm for the DBN click model.
- Implement log-likelihood evaluation method.
- Implement more click models! (e.g. DCM and UBM).
- End-to-end use of letor with omega (training the letor module on training file obtained from Omega click data and using the letor module for displaying relevant search results on top of SERP).
Steps for using letor module with Omega
You currently need the latest Xapian git master installed on your system - if not, see https://xapian.org/bleeding for how to check out the code from git and build it. If you've not used Omega before, see https://xapian.org/docs/omega/quickstart.html for a quick introduction.
First, we need to enable clickstream logging in Omega so that search and clicks logs are generated whenever searches are performed on it. To do that run Omega CGI with template to be used set to activatelog
template by changing the FMT
parameter value to activatelog
in CGI url in your browser address bar.
clicklog
was added in this project along with activatelog
template and few changes in query
template to enable clickstream logging as an end result.
If logging was enabled successfully, you should see two extra log files in the directory /var/log/omega
i.e. search.log
and clicks.log
. We are going to need those files now to generate the final log file.
Omega provides a utility command postprocess
for post-processing the log files to generate the final log file and query file. Final log file is used for training the click model as we'll see later and query file is used by letor to generate its training file.
postprocess
is readily available with Omega installation and using it is straightforward. Just specify the path to the search.log
file, clicks.log
, path to save final.log
and query.txt
file in the directory where postprocess
is installed:
postprocess /path/to/search.log /path/to/clicks.log /path/to/final.log /path/to/query.txt
You can also run postprocess -h
to see help text about using the command.
Note: You can also call the two functions generate_combined_log
and generate_query_file
provided in postprocess
into your program individually. Please refer to its documentation to get into more details.
Now that we have final.log
file we are ready to train the click model to provide us with relevance judgments and generate qrel file from that. Omega provides Simplified DBN click model which is accessible through its API or via utility command generate-qrel-file
to generate qrel file. To use generate-qrel-file
you only need to specify path to final.log
file and path to save qrel file to:
generate-qrel-file /path/to/final.log /path/qrel/file
You can also run ./generate-qrel-file --help
to see help text about using the command.
Note: The click model API and generate-qrel-file
are in the last stages of development. It will be available for use only after #170 is merged so I've refrained from including any documentation links for that. Although, you should be able to find the documentation in #170 itself.
At this point, we have both the query file and the qrel file so we can now use letor module to generate its training file.
To generate letor training file, just specify paths to query file and qrel file, index and path to save training file to xapian-prepare-trainingfile
command:
xapian-prepare-trainingfile --db=DIR --msize=MSIZE /path/to/query/file path/to/qrel/file /path/to/save/training/file
where, DIR = path/to/database/to/search and MSIZE = maximum number of matches to return. It will generate a training file with the <trainingfile> path provided.
Next, we train the letor model using xapian-train
command:
xapian-train --db=DIR /path/to/training/file MODEL_METADATA_KEY
where, DIR = path/to/database/to/search and MODEL_METADATA_KEY is metadata key to save model to.
Now, to see how the documents are re-ranked (assigned new relevance scores based on the trained letor model compared to what was in training file), we will use xapian-rank
command which displays documents sorted by their new scores after re-ranking by letor model:
xapian-rank --db=DIR --msize=MSIZE --stemmer=LANG --prefix=PFX:TERMPFX --boolean-prefix=PFX:TERMPFX MODEL_METADATA_KEY QUERY
where, DIR = path/to/database/to/search, MSIZE = maximum number of matches to return, LANG = stemming language (the default is English or pass 'none' to disable stemming), PFX:TERMPFX = a prefix and a boolean prefix for --prefix
and --boolean-prefix
arguments respectively, MODEL_METADATA_KEY = metadata key to point to a saved letor model and QUERY = any search query.
Note: To get into more details about letor API, please refer its documentation here.
If something doesn't work right for you from what I described above, just let us know on #xapian IRC channel so you can get help from a wider audience.