1 | = Clustering of search results =
|
---|
2 |
|
---|
3 | WORK PRODUCT
|
---|
4 |
|
---|
5 | The project that I worked in this year was to do with adding in a clustering functionality to the API. I had worked on this project in GSoC 2016 and we have been able to merge in an initial version of the API this year. The API brings in functionality to cluster search results using spherical KMeans clustering. Currently the distance metric to calculate similarity between documents being used in cosine distance and there can be more added in the future.
|
---|
6 |
|
---|
7 | I have currently been using the BBC news datasets available [http://mlg.ucd.ie/datasets/bbc.html here] for test purposes. On this dataset, it takes 2.5 - 3 seconds to cluster 1000 documents without any dimensionality reduction and 1.5 - 2 seconds to cluster 1000 documents with dimensionality reduction. When the resulting ClusterSet was passed to ClusterEvaluation which currently implements only the Silhouette coefficient, it returns an average silhouette coefficient of 0.7.
|
---|
8 |
|
---|
9 | The main parts of this project are:
|
---|
10 |
|
---|
11 | * Merge in work on the clustering API from last year
|
---|
12 | * Add in stemming to reduce dimensionality of document vectors.
|
---|
13 | * Add in stopword removal
|
---|
14 | * Implement KMeans with triangle inequality optimization
|
---|
15 | * Add in Cluster Evaluation class to evaluate clusters
|
---|
16 | * Documentation of all work done
|
---|
17 |
|
---|
18 | == Merged ==
|
---|
19 |
|
---|
20 | A clustering API which supports KMeans clustering has been merged into master this year. Following components have been merged in this year.
|
---|
21 |
|
---|
22 | * Initial Clustering API (Link to merged commit [https://github.com/xapian/xapian/commit/4848901bd2d6c2c134b6fe3d3237b1f540650581 here])
|
---|
23 | * Stopword Removal (Link to merged commit [https://github.com/xapian/xapian/commit/d4a89263f858f202a676dfaedc630cc43e8beb24 here]
|
---|
24 | * Stemming
|
---|
25 | * Move Round Robin Clusterer to testsuite
|
---|
26 |
|
---|
27 | Link containing all merged commits [https://github.com/xapian/xapian/commits/master/?author=richhiey1996]
|
---|
28 |
|
---|
29 | == Work in Progress ==
|
---|
30 |
|
---|
31 | * '''''Writing up documentation for the Getting Started with Xapian guide.'''''
|
---|
32 |
|
---|
33 | Since diversification is a new functionality in Xapian, I would be adding information on how to use the API
|
---|
34 | in the Xapian getting started guide [https://getting-started-with-xapian.readthedocs.io/en/latest/ here]
|
---|
35 |
|
---|
36 | == Future Work ==
|
---|
37 |
|
---|
38 | * '''''Evaluation on ClueWeb09
|
---|
39 | '''''
|
---|
40 |
|
---|
41 | There is a need to evaluate the current diversification implementation on the ClueWeb09 Category B data set using TREC 2009/2010 topical queries, and compare the results with those of the original paper. |
---|