GSoC2012/QueryParser/TODOS – Xapian

wiki:GSoC2012/QueryParser/TODOS

Context Navigation

TODO's Bold ---> Pending Work

Understanding the QueryParser and TermGenerator code and making the queryparser doc.
Transferring the above produced doc to rst format.
Exploring the syntax generally available other search engines and figuring out the WHAT and HOW regarding the additions to Xapian QueryParser syntax.
Testing the real-world-queries present in queryparsertest.cc after disabling the re-parse with flags off code.
Figuring out the details of the queries failed in the above test.
Summarizing the errors in the QueryParser.
Using Lucene and understanding its QueryParser source-code.
Figuring out the behavior of Lucene on the ~130 queries which failed after disabling the reparse with flags off code.
Writing the emoticon extractor.
Trying and testing the solutions for the other Parse errors found earlier.
Updating the queryparser doc according to the reviews and doing the other things mentioned in the review.
Adding testcases to queryparsertest.cc for the solutions to Parse errors.
Making the revised roadmap to put the GSoC project on track, updating the GSoC wiki page and making the Journal and TODOS page.
Explore Link Grammar, its concepts and the its differences with respect to Constituency Grammar/Phrase Srtucture Grammar style.

Going through the documentation and the mailing list of Link Grammar as well as some other POS based Parsers like Stanford Parsers etc. suggests the following main points (needs discussion):
- Link Grammar doesn't do POS tagging as such but it does subscripts words with some strings like ".n" for noun, ".v" for verbs, ".a" for adjectives and ".e" for adverbs. This mentioned clearly in Section 3.4 of Introduction to Link Grammar Parser
- These subscripts CAN be used as POS tags, and in our case it may be appropriate to use these basic tags only rather than detailed Penn Treebank style POS tags which other POS specific parsers like Stanford parser etc. use.
- Here are the two links of mails from Link Grammar mailing list where one of the maintainers (Linas Vepstas) has mentioned regarding POS tagging with Link Grammar. Link1 and Link2.
- Using Link Grammar for basic tagging will also mean that we won't tag ALL words with their corresponding POS tags (For example the common word "the" in sentences).
- RelEx, a java program based on Link Grammar applies some rules on the output of Link Grammar and does feature tagging. Do we need any of these feature(s) for our work? In the mail mentioned in Link1 above, it is suggeseted that it can/may be possible to use the rules used by RelEx and code them. [Discussion to be done]
- Olly suggested that tagging Noun Phrases also will be good. [Discussion in-between].
  - It seems that it can be done using Link Grammar since in the recent versions, Link Grammar have provided the new post-process sort of feature - The Phrase Parsrer, which generates constituents such as noun phrases, verb phrases, and prepositional phrases.
- Link Grammar seems to be incapable of generating links in case of incomplete sentences. It won't be a problem in tagging the words while indexing (since the data in the document being indexed usually have complete sentences), but can be a problem in the later stage while doing POS tagging to Queries. Queries would have to be complete sentences so as to get POS tagged using Link Grammar.
  - I tried incomplete sentences in both Link Grammar online parser as well as Stanford online Parser (just for the purpose of comparison). It revealed that Stanford Parser was able to tag POS to all words in incomplete sentences as well, but this wasn't the case for Link Grammar Parser. (Example incomplete sentences which can be a query - "Latest Watches" , "new chrome version" and many more...).
  - Since RelEx uses basic POS tags only as mentioned in feature tagging, and not detailed POS tags, it also doesn't do POS tagging for all the words. BUT, it MIGHT be able to give POS tags to incomplete sentences, haven't tried that yet. Will try and update here.

Making changes in queryparser.lemony and queryparsertest.cc according to comments given by olly on the Github repo.

Regarding the flags for error recovery code, after some thinking I came up with a scheme. It's written here.

Got acquainted with Link Grammar API
Browsed the Link Grammar source code to get familiarized with the code.
Explored different ways (and their Pros and Cons) in which Link Grammar can be used in xapian to provide POS tags.
Modified queryparser.lemony to ensure that negative numbers are not hated !
Added testcases for negative numbers.
Made a remote repo to keep track of the commits in the xapian main branch. Merged it with my working branch "mybranch".
Made a remote repo to keep track of the commits in the xapian main branch. Merged it with my working branch "mybranch".

Discussion regarding the error recovery flags/API etc. going on at present.
Writing class to extract POS using Link Grammar

Last modified 10 years ago Last modified on 26/01/16 10:10:43

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text