Clustering of search results
WORK PRODUCT
The project that I worked in this year was to do with adding in a clustering functionality to the API. I had worked on this project in GSoC 2016 and we have been able to merge in an initial version of the API this year. The API brings in functionality to cluster search results using spherical KMeans clustering. Currently the distance metric to calculate similarity between documents being used in cosine distance and there can be more added in the future.
I have currently been using the BBC news datasets available here for test purposes. On this dataset, it takes 2.5 - 3 seconds to cluster 1000 documents without any dimensionality reduction and 1.5 - 2 seconds to cluster 1000 documents with dimensionality reduction. When the resulting ClusterSet was passed to ClusterEvaluation which currently implements only the Silhouette coefficient, it returns an average silhouette coefficient of 0.7.
The main parts of this project are:
- Merge in work on the clustering API from last year
- Add in stemming to reduce dimensionality of document vectors.
- Add in stopword removal
- Implement KMeans with triangle inequality optimization
- Add in Cluster Evaluation class to evaluate clusters
- Documentation of all work done
Merged
A clustering API which supports KMeans clustering has been merged into master this year. Following components have been merged in this year.
- Initial Clustering API (Link to merged commit here)
- Stopword Removal (Link to merged commit here
- Stemming
- Move Round Robin Clusterer to testsuite
Link containing all merged commits https://github.com/xapian/xapian/commits/master/?author=richhiey1996
Work in Progress
- Writing up documentation for the Getting Started with Xapian guide.
Since clustering is a new functionality in Xapian, I would be adding information on how to use the API in the Xapian getting started guide here
- Cluster Evaluation
Once search results are clustered, the result is returned in a ClusterSet. The quality of clustering within this ClusterSet can be calculated by internal clustering measures which are to be implemented. Currently, silhouette coefficient works and the others are to be implemented.
- Triangle inequality
Unfortunately, triangle inequality has not been able to provide a significant speed up and I will be trying to optimize it a little more before starting a PR.
Future Work
- Merge in work on cluster evaluation
The ClusterEvaluation is a feature which is important because the quality of returned clusters can be determined by various internal indices. Thus initial clustering parameters can be chosen properly.
- Improve clustering run time through optimizations.
The run time speed of clustering needs to be improved since it should be useful for interactive search too. Hence using an optimized algorithm, or adding improvements to the current code would be important.
- Implement more initialization techniques and distance metrics.
Currently, we are initializing by taking every k'th point in our collection of points and assigning it as the initial centroid. This is a cheap operation and works fine because the MSet already has some sorted order. But it would be good to have more initialization techniques like kmeans++ and random initialization going forward.
- Providing important cluster labels
Providing the important terms within clusters are important because this is something that a user of the API might want. Its useful to see the important terms in a cluster as it gives some more insight into the search results. This coulod be added onto Omega as a feature where we can see the clusters and the important terms within it.