wiki:GSoC2017/Clustering/Journal

Community Bonding Week 1: May 6–May 12

Started work on KMeans by starting implementation of elkans-kmeans (KMeans with triangular inequality to improve performance) Also fixed a few issues in the current KMeans PR. Work in this period will be a little slow due to exams till the end of the month.

Community Bonding Week 2: May 13–May 19

Working on getting the KMeans PR https://github.com/xapian/xapian/pull/149 in shape and will continue to better the API by moving to classes that use refcounted pointers to internal classes.

Community Bonding Week 3: May 20–May 26

No work done due to university exams

Community Bonding Week 4: May 27–June 2 (work begins May 30)

No work done due to university exams

Coding Week 3: June 3–June 9

Work on moving classes to PIMPL implementations

Coding Week 4: June 10–June 16

Work on review on the PIMPL classes and start work on dimensionality reduction, since high dimensionality takes too much time to run.

1) Removal of stopwords

2) Removal of other words which might not be important

3) Start implementing a way to test KMeans and other clusterers

Coding Week 5: June 17–June 23

Worked on PR 149 review https://github.com/xapian/xapian/pull/149 by James and Olly and added in all the necessary changes. Also started discussing about approaching stop word removal and stemming. Started a PR for Stopword removal https://github.com/richhiey1996/xapian/pull/2.

I had opened this PR against my own fork since PR 149 hadn't merged.

Coding Week 6: June 24–June 30 (evaluations: June 26–30)

Started working on dimensionality reduction. PR 149 has been finally merged and closed. Goal for this week :

1) Make stopword removal PR ready for merge

2) Start a PR for stemming. Since we are discarding all unstemmed terms, it is important to remove the 'stemmed stopwords' that exist in the document termlist. For this, as Olly had suggested, it would be best to have a subclass of stopper class to store and identify the stemmed forms of the stopwords too (as SimpleStopper doesn't do this). So I will be working on that and getting this PR ready for merge.

July 1 - August 14

Added in stopword removal, stemming and moved RoundRobin clusterer from public API to tests.

August 15 - August 18

Work on triangle inequality optimization and ClusterEvaluation class and start separate PR's for them as soon as possible

Coding Week 14: August 19–August 25 (evaluations; August 21–29)

Final Evaluations: August 26–August 29

Last modified 7 years ago Last modified on 14/08/17 20:11:31
Note: See TracWiki for help on using the wiki.