Project Work Product
The Project goal is adding a math equation search feature to the Xapian library. I conducted a literature survey with the help of the mentor and decided to implement the Tangent paper https://www.cs.rit.edu/~rlaz/files/sigir-tangent.pdf.
Details about the tangent system and implementation in Xapian is available at Project's main page.
All of my work is available at my fork of Xapian codebase: guruhegde/xapian
The following lists the work that has been merged in Xapian master:
- Implementation of Dice Coefficient weight metric #196
Link to the commits: https://github.com/xapian/xapian/commits/master?author=guruhegde
The following lists the work that is currently in progress and will be merged once review completed.
The following lists the work that is currently being implemented, not yet reached completion.
- Prototype of math search #201 - This PR contains the bulk of the work done during GSoC Period.
This PR contains the implementation of end to end system with a good number of tests.
- Speed up test-suite #210
This PR contains an initial prototype of the idea.
Code Coverage Report
Presentation MathMLis an XML application for representing mathematics. We need to parse MathML to generate feature.
I decided to do parsing MathML elements and generating a symbol layout tree at the same time while parsing. This turned out to be a bit tricky. As I understood, the parser needs to account for syntax errors, for ex. check for closed tag etc. Similarly generating symbol layout tree requires understanding semantics, for ex.
<mfrac>element can have only two children. Providing implementation to handle all the corner cases turned out to be a lot involved and requires better planning.
We wanted to check the prototype hence in the first iteration we covered simple MathML expression. For that, I followed parsing iteratively. During phase 3, I planned to implement a recursive approach. The dataset on which we wanted to evaluate our prototype has a lot of equations that have
mrowelement. To parse the sub-expressions recursive approach seems a better way. While parsing recursively handle symbol layout tree creation poses some challenges like if
msupelement is encountered, child elements parsed need to be added to a layout tree as branches.
- Currently parsing MathML and generating symbol layout tree, done within
MathMLParserclass. If we decouple that, then any existing XML parser can be used.
(If I look back, this is the idea suggested by mentors, which would have helped me complete the evaluation of the system too.)
- Complete the implementation of Parser to generate suitable terms from math formulae (Symbol pair tuple in case of the Tangent method)
- Evaluation of the implementation over the suitable dataset.
- Try new approach or improve the existing system.