wiki:GSoC2014/Arabic Support/Journal

Community Bonding Week 2: April 28-May 4

Community Bonding Week 3: May 5-May 11

Community Bonding Week 4: May 12-May 19

Coding Week 1: May 20-May 25

May 20

  • Basic Normalizer: I started with the scratch. (working only for hours, since I had an urgent travel)

branch: not yet. documentation I needed: snowball.

May 21

  • Test samples: I browsed through many arabic corpuses and I've chosen this cause it contains diverse topics:

Motaz K. Saad and Wesam Ashour, "OSAC: Open Source Arabic Corpus", 6th ArchEng International Symposiums, EEECS’10 the 6th International Symposium on Electrical and Electronics Engineering and Computer Science, European University of Lefke, Cyprus, 2010.

branch: N/A. documentation I needed: omega.

May 22-23-24

  • Stopwords: basic arabic stopword list: - contains about 10k words (counting all forms) link
  • Stopwords: I included also stopword list of other languages from snowball project . eg. English stopwords
  • Stopwords: updates on the arabic stop word list. - eliminate lot of words that may appear not as a stop word - eliminate different forms, Arabic stopword list
  • Stopwords: work on loading stopwords from a file.

branch: stopword, documentation I needed: autotools, SWIG.

Coding Week 2: May 26-June 1

May 26

  • Testing environment: I indexed the chosen corpus using omiga and trying search and other operations on it.

May 27-28

  • stop words: continue working on the loading of stopwords via a file.

pull-requst: https://github.com/xapian/xapian/pull/35

May 29-30

  • sphinx documentation: finishing the work on the patch of sphinx documentation,

pull-request: https://github.com/xapian/xapian/pull/34

Coding Week 3: June 2-June 8

June 2 - 4

TODO

June 5 - 7

  • Normalizer: gathering arabic letters unicodes and proposing a prototype for normalizer.

branch: normalizer_cpp

Coding Week 4: June 9-June 15

june 9-10

  • Normalizer: working on the implementation of normalizer. It's working now, next is working to integrate it.

Example of normalization: مؤيًّدًا ==> مءيدا

branch: normalizer_cpp

june 11-12

branch: normalizer_cpp

june 13

  • Stemmer: Basic Structure of arabic stemmer: defining letters.

branch: stemmer_snowball

Coding Week 5: June 16-June 22

June 17

  • Romanization: Implementation of the ISO233 romanization standard changes

June 18

  • Stemmer: Prototype of an aggressive Arabic stemmer for prefixes and suffixes changes

June 19-21

Coding Week 6: June 23-June 29 (Midterm deadline June 27)

June 23

Coding Week 7: June 30-July 6

Coding Week 8: July 7-July 13

Coding Week 9: July 14-July 20

Coding Week 10: July 21-July 27

Coding Week 11: July 28-August 3

Coding Week 12: August 4-August 10

Coding Week 13: August 11-August 18 (Final evaluation based on work up to August 18)

Last modified 5 years ago Last modified on 24/06/14 02:28:20
Note: See TracWiki for help on using the wiki.