GSoC2018/Diversification: WorkProduct

File WorkProduct, 2.4 KB (added by icebyte, 6 years ago)
Line 
1= Clustering of search results =
2
3WORK PRODUCT
4
5The project that I worked in this year was to do with adding in a clustering functionality to the API. I had worked on this project in GSoC 2016 and we have been able to merge in an initial version of the API this year. The API brings in functionality to cluster search results using spherical KMeans clustering. Currently the distance metric to calculate similarity between documents being used in cosine distance and there can be more added in the future.
6
7I have currently been using the BBC news datasets available [http://mlg.ucd.ie/datasets/bbc.html here] for test purposes. On this dataset, it takes 2.5 - 3 seconds to cluster 1000 documents without any dimensionality reduction and 1.5 - 2 seconds to cluster 1000 documents with dimensionality reduction. When the resulting ClusterSet was passed to ClusterEvaluation which currently implements only the Silhouette coefficient, it returns an average silhouette coefficient of 0.7.
8
9The main parts of this project are:
10
11* Merge in work on the clustering API from last year
12* Add in stemming to reduce dimensionality of document vectors.
13* Add in stopword removal
14* Implement KMeans with triangle inequality optimization
15* Add in Cluster Evaluation class to evaluate clusters
16* Documentation of all work done
17
18== Merged ==
19
20A clustering API which supports KMeans clustering has been merged into master this year. Following components have been merged in this year.
21
22* Initial Clustering API (Link to merged commit [https://github.com/xapian/xapian/commit/4848901bd2d6c2c134b6fe3d3237b1f540650581 here])
23* Stopword Removal (Link to merged commit [https://github.com/xapian/xapian/commit/d4a89263f858f202a676dfaedc630cc43e8beb24 here]
24* Stemming
25* Move Round Robin Clusterer to testsuite
26
27Link containing all merged commits [https://github.com/xapian/xapian/commits/master/?author=richhiey1996]
28
29== Work in Progress ==
30
31* '''''Writing up documentation for the Getting Started with Xapian guide.'''''
32
33Since diversification is a new functionality in Xapian, I would be adding information on how to use the API
34in the Xapian getting started guide [https://getting-started-with-xapian.readthedocs.io/en/latest/ here]
35
36== Future Work ==
37
38* '''''Evaluation on ClueWeb09
39'''''
40
41There is a need to evaluate the current diversification implementation on the ClueWeb09 Category B data set using TREC 2009/2010 topical queries, and compare the results with those of the original paper.