wiki:GSoC2016/Clustering/product

Context Navigation

Clustering of Search Results

This project was aimed at clustering of documents after a search in a database to group and form clusters of similar documents. KMeans clustering, one of the most famous clustering techniques, is used to cluster documents based on their document vectors, which are created by calculating TF-IDF weights of terms within documents. Due to K-Means property of local search, it gets stuck in a local optima. Thus to increase the quality of clusters returned, Particle Search Optimization was to be used with KMeans.

The main parts of the project till now :

Building an API that can be used by the users, and internally to build up Clusterers

Implement Round Robin clusterer, a simple clusterer to work with the initial API

Refactor the API to adapt to KMeans

Implement Cosine similarity and Euclidian similarity metrics

Implement KMeans clustering algorithm with random initialization

Implement KMeans++ initialization for improving initial centroids selection

Evaluation of the implemented Clusterers

Merged

The following components have been merged to xapian/richhiey1996/cluster. This branch will later be merged to master once the entire API is thoroughly tested, evaluated and documented in a satisfying manner.

Initial API for clustering

Round Robin clusterer (Test Clusterer to work on existing API)

The link of the commits that have been merged : https://github.com/xapian/xapian/commits/richhiey1996/cluster?author=richhiey1996

Work to be Merged

Currently, I am completing my work on KMeans and refactoring the API to support KMeans. Click here for the pull request containing the work on KMeans and the refactored clustering API.

Work in Progress

Writing up documentation, evaluation with external clustering measures

Tidying up the API by moving out parts that aren't required in the public API

Future Work

Use PIMPL with public API classes to hide non-public data members and functions within the classes

Implement more initialization techniques, mainly Particle Swarm Optimization

Dimensionality reduction of document vectors

Improve speed of Clusterers

Completing documentation and improving current documentation

Last modified 9 years ago Last modified on 22/08/16 21:44:57

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text