GSoC2018/Diversification: ProjectPlan

File ProjectPlan, 3.3 KB (added by icebyte, 6 years ago)
Line 
1= Details =
2
3In current IR systems, most users probe into the top few results and neglect the rest. In order to increase user’s satisfaction, the presented result set should not only be relevant to the search topic, but should also present a variety of perspectives, that is, the results should be different from one another, especially for ambiguous queries. The effectiveness of web search and the satisfaction of users can be enhanced through providing various results of a search query in a certain order of relevance and concern, known as diversification. This project will allow diversification of search results without requiring external data, such as query logs, and equip the developer to provide diversified and relevant results to users out-of-the-box with Xapian.
4
5The proposed methods are taken from [#ref1 [2]], which include the following:
6
7● GLS-MPT:
8It is an explicit method that uses document vectors and their relevance scores for producing diversified search results. It consists of using the Greedy Local Search algorithm (details on page 6 of [2]) with objective function based on MPT (details on page 13 of [2]). According to the paper, average latency of GLS-MPT is ~800ms (table V of [2]) per query, which involves preprocessing (storing pairwise distance of top N documents, N = 100) and the actual diversification which finally returns top-k (k = 20) diversified results. Note: k = 20 and N = 100 are suitable, as usually, users don't page too far and even if they do, it's suitable to let the results tail off in relevance. [[BR]]
9
10● C2-GLS
11The second stage involves optimising GLS-MPT to reduce the latency to a desirable limit. It involves clustering the top N documents and storing the distance between each document in N and each cluster (there are k clusters). They term this method C2-GLS (details on page 7 and 8 of [2]). The clustering method used is LC [3]. The average latency reduces to ~20-40 ms (table V of [2]), while maintaining considerable effectiveness. This involves implementing LC clustering and modifying the code of GLS-MPT. LC clustering can be added as an algorithm in xapian-core/cluster. Note: K-Means clustering could be used for clustering here, but according to [2] (table V), LC clustering is faster with a speedup of around x7 (~20 ms vs ~138ms) and provides almost equal quality of rankings. [[BR]]
12
13== Evaluation ==
14For evaluation of the above methods, the ClueWeb09 [https://lemurproject.org/clueweb09/ Category B] corpus with queries from TREC [http://trec.nist.gov/data/web/09/wt09.topics.full.xml 2009]/[http://trec.nist.gov/data/web/10/wt2010-topics.xml 2010] diversification task exist as a standard dataset for evaluation of diversification. The evaluation metrics include:
15
16● - nDCG [#ref4 [4]] [[BR]]
17● ERR-IA [#ref5 [5]] (stretch goal)[[BR]]
18
19= Project Timeline =
20
21Please click [wiki:"../ProjectTimeline" here] to see project timeline.
22
23[=#ref1 Search Result Diversification Santos et al. 2015 (Survey Paper)]
24[=#ref2 Scalable and Efficient Web Search Result Diversification Naini et al. 2016]
25[=#ref3 Modelling efficient novelty-based search result diversification in metric spaces Gil-Costa et al. 2013]
26[=#ref4 Novelty and Diversity in Information Retrieval Evaluation Clarke et al. 2008]
27[=#ref5 Intent-based diversification of web search results: Metrics and algorithms. Chapelle et al. 2011b]