Clustering of Search Results
This project was aimed at clustering of documents after a search in a database to group and form clusters of similar documents. KMeans clustering, one of the most famous clustering techniques, is used to cluster documents based on their document vectors, which are created by calculating TF-IDF weights of terms within documents. Due to K-Means property of local search, it gets stuck in a local optima. Thus to increase the quality of clusters returned, Particle Search Optimization was to be used with KMeans.
The main parts of the project till now :
- Building an API that can be used by the users, and internally to build up Clusterers
- Implement Round Robin clusterer, a simple clusterer to work with the initial API
- Refactor the API to adapt to KMeans
- Implement Cosine similarity and Euclidian similarity metrics
- Implement KMeans clustering algorithm with random initialization
- Implement KMeans++ initialization for improving initial centroids selection
- Evaluation of the implemented Clusterers
The following components have been merged to xapian/richhiey1996/cluster. This branch will later be merged to master once the entire API is thoroughly tested, evaluated and documented in a satisfying manner.
- Initial API for clustering
- Round Robin clusterer (Test Clusterer to work on existing API)
The link of the commits that have been merged : https://github.com/xapian/xapian/commits/richhiey1996/cluster?author=richhiey1996
Work to be Merged
Currently, I am completing my work on KMeans and refactoring the API to support KMeans. Click here for the pull request containing the work on KMeans and the refactored clustering API.
Work in Progress
- Writing up documentation, evaluation with external clustering measures
- Tidying up the API by moving out parts that aren't required in the public API
- Use PIMPL with public API classes to hide non-public data members and functions within the classes
- Implement more initialization techniques, mainly Particle Swarm Optimization
- Dimensionality reduction of document vectors
- Improve speed of Clusterers
- Completing documentation and improving current documentation