In current IR systems, most users probe into the top few results and neglect the rest. In order to increase user’s satisfaction, the presented result set should not only be relevant to the search topic, but should also present a variety of perspectives, that is, the results should be different from one another, especially for ambiguous queries. The effectiveness of web search and the satisfaction of users can be enhanced through providing various results of a search query in a certain order of relevance and concern, known as diversification. This project will allow diversification of search results without requiring external data, such as query logs, and equip the developer to provide diversified and relevant results to users out-of-the-box with Xapian.
The proposed methods are taken from (2), which include the following:
It is an explicit method that uses document vectors and their relevance scores for producing diversified search results. It consists of using the Greedy Local Search algorithm (details on page 6 of (2) with objective function based on MPT (details on page 13 of (2). According to the paper, average latency of GLS-MPT is ~800ms (table V of (2) per query, which involves preprocessing (storing pairwise distance of top N documents, N = 100) and the actual diversification which finally returns top-k (k = 20) diversified results. Note: k = 20 and N = 100 are suitable, as usually, users don't page too far and even if they do, it's suitable to let the results tail off in relevance.
The second stage involves optimising GLS-MPT to reduce the latency to a desirable limit. It involves clustering the top N documents and storing the distance between each document in N and each cluster (there are k clusters). They term this method C2-GLS (details on page 7 and 8 of (2)). The clustering method used is LC (3). The average latency reduces to ~20-40 ms (table V of (2)), while maintaining considerable effectiveness. This involves implementing LC clustering and modifying the code of GLS-MPT. LC clustering can be added as an algorithm in xapian-core/cluster. Note: K-Means clustering could be used for clustering here, but according to (2) (table V), LC clustering is faster with a speedup of around x7 (~20 ms vs ~138ms) and provides almost equal quality of rankings.
For evaluation of the above methods, the ClueWeb09 Category B corpus with queries from TREC 2009/2010 diversification task exist as a standard dataset for evaluation of diversification. The evaluation metrics include:
Please click here to see project timeline.
(1) Search Result Diversification Santos et al. 2015 (Survey Paper)
(2) Scalable and Efficient Web Search Result Diversification Naini et al. 2016
(3) Modelling efficient novelty-based search result diversification in metric spaces Gil-Costa et al. 2013
(4) Novelty and Diversity in Information Retrieval Evaluation Clarke et al. 2008
(5) Intent-based diversification of web search results: Metrics and algorithms. Chapelle et al. 2011b