wiki:GSoC2012/QueryParser/Journal

Community Bonding Period: April 23-May 20

  • Forked git repository.
  • Compiled latest repository.
  • Brushed up the knowledge of Scanner, Parser, LL Grammar and LR Grammar.
  • Read Xapian docs, (Source Docs and API Docs) to get acquainted with the code.
  • Got acquainted with the source code of Termgenerator and QueryParser and the use of Lemon Parser Generator .

Coding Week 1: May 21-May 27

Coding Week 2: May 28-June 3

  • Explored the general syntax available in other search engines and on the basis of comparison with current Xapian Query Syntax, proposed some suggestions for new features. FileCommit
  • Tested the real-world-queries present in queryparsertest.cc after disabling the parse with flags off code.
  • Figured out the reasons of failure of ~130 queries. Detailed description of tokens produced, reason of failure etc. for each failing query written down in plain text. FileCommit
  • Summarized the errors detected in the above procedure and wrote them down in a plain text file. FileCommit

Coding Week 3: June 4-June 10

  • Got acquainted with Lucene QueryParser and tried out the real world failing queries on Lucene.
  • Summarized the behaviour of Lucene QueryParser on the tried queries in plain text file. FileCommit
  • Wrote the code for emoticon extractor (report/emoticon.cc). CodeCommit
  • Tried and tested out solutions for other parse errors summarized earlier in a series of small-small commits. CodeCommit
    Details of the solutions present in file report/solutions.txt.

Coding Week 4: June 11-June 17

Coding Week 5: June 18-June 24

  • Got acquainted with the concepts of link grammar via Introduction to Link Grammar Parser.
  • Went through the mailing list of Link Gramamar to have ideas regarding POS tagging.
  • Figured out the differences and similarities between the commonly used Penn-treebank style of POS tagging and the links that Link Grammar generates.
    Got confused initially since the Link Grammar uses Dependency grammar style rather than the more common Constituency grammar style.
  • After going through the documentation and mailing list of Link Grammar and trying some other POS specific parsers (for the sake of comparison), various points have come forward.
  • Modified queryparser doc to correct a wrong parse and change the language as olly pointed out. DocumentationCommit
  • Fixed some typos in report/summary.rst and deleted the backup file from Github repo.
  • Modified queryparser.lemony according to comments given by olly on earlier commits. CodeCommit
  • Modified the testcases present in queryparsertest.cc according to comments given by olly. CodeCommit
  • Figured out what and how to do regarding turning on/off the error recovery code and about giving the corrected query to user. These points are mentioned on the TODOS wiki page.
    I have mentioned the scheme regarding handling of flags for parse error here .

Coding Week 6: June 26-July 1

  • Got acquainted with Link Grammar API via Link Grammar API documentation.
  • Browsed the Link Grammar source code to get familiarized with the code.
  • Explored different ways (and their Pros and Cons) in which Link Grammar can be used in xapian to provide POS tags.
  • Modified queryparser.lemony to ensure that negative numbers are not hated ! CodeCommit
  • Added testcases for negative numbers. CodeCommit
  • Made a remote repo to keep track of the commits in the xapian main branch. Merged it with my working branch "mybranch". Commit
  • Made a copy of TODOS on repo.Commit
  • Discussion regarding the error recovery flags/API etc. going on at present.
  • Writing class to extract POS using Link Grammar.

Coding Week 7: July 2-July 8

  • Refractored the error recovery code and did the following changes: CodeCommit
    • Introduced a struct (parse_error_s) to encode the errors into types.
    • Introduced a flag, FLAG_ERROR_RECOVERY to switch off/on the error recovery code.
    • Added two API methods: get_error_detail() and get_error_description_string() to interact with the struct representing the parse error types.
    • Modified examples/quest.cc to represent the use of get_error_description_string() method.
  • Modified queryparsertest.cc. CodeCommit
  • Adding a check in configure script for Link Grammar library and making use of macro produced by it via config.h to ensure that the compiling of the source code does not fail in case, the Link Grammar library is not present on the machine. CodeCommit
  • Made Link Grammar interface header file.CodeCommit
  • Made Link Grammar interface implementation file.CodeCommit
  • Made implementation of Link Grammar for POS tagged based indexing in termgenerator. CodeCommit

Coding Week 8: July 9-July 15 (Midterm deadline July 13)

  • Added a simple example, examples/pos_index.cc to show the features of the POS based indexing. CodeCommit Given a sentence, it does the following things:
    1. Index it as a Xapian document using POS support from Link Grammar.
    2. Show the linkage diagram produced, for the give sentence.
    3. Show the pos extracted for the words of the sentence.
    4. Show structure of the constituent tree (Showing Noun Phrase, Verb Phrase etc.) produced for the given sentence.

It also contains two sample testcases and the corresponding output produced, as well as the output of delve for the corresponding document, at the end of the file.

Coding Week 9: July 16-July 22

  • Explored options for Sentence Breaking Implementation.
  • Went through and got familiarized with the Sentence Break Iterator of ICU.
  • Check for breakiterator class of ICU in configure script. CodeCommit
  • Integrated the Sentence Break Iterator of ICU in the existing POS based indexing in termgenerator. CodeCommit
  • Making implementation of Link Grammar in QueryParser.
  • Off. Busy with college re-opening.

Coding Week 10: July 23-July 29

Coding Week 11: July 30-August 5

  • Added testcases for LinkGrammar integration in QueryParser. CodeCommit
  • Merging with the main branch and resolving conflicts.

Coding Week 12: August 6-August 12

Coding Week 13: August 13-August 20 (Final evaluation based on work up to August 20)

Last modified 4 years ago Last modified on 26/01/16 10:10:43