Project Work Product
The Project goal is adding a math equation search feature to the Xapian library. I conducted a literature survey with the help of the mentor and decided to implement the Tangent paper https://www.cs.rit.edu/~rlaz/files/sigir-tangent.pdf.
Details about the tangent system and implementation in Xapian is available at Project's main page.
All of my work is available at my fork of Xapian codebase: guruhegde/xapian
Merged
The following lists the work that has been merged in Xapian master:
- Implementation of Dice Coefficient weight metric #196
Link to the commits: https://github.com/xapian/xapian/commits/master?author=guruhegde
Under Review
The following lists the work that is currently in progress and will be merged once review completed.
- Add unique terms bound stats to HoneyVersion class #209.
In Progress
The following lists the work that is currently being implemented, not yet reached completion.
- Prototype of math search #201 - This PR contains the bulk of the work done during GSoC Period.
This PR contains the implementation of end to end system with a good number of tests.
- Speed up test-suite #210
This PR contains an initial prototype of the idea.
Code Coverage Report
Challenges
Presentation MathMLis an XML application for representing mathematics. We need to parse MathML to generate feature.
I decided to do parsing MathML elements and generating a symbol layout tree at the same time while parsing. This turned out to be a bit tricky. As I understood, the parser needs to account for syntax errors, for ex. check for closed tag etc. Similarly generating symbol layout tree requires understanding semantics, for ex.
<mfrac>element can have only two children. Providing implementation to handle all the corner cases turned out to be a lot involved and requires better planning.
We wanted to check the prototype hence in the first iteration we covered simple MathML expression. For that, I followed parsing iteratively. During phase 3, I planned to implement a recursive approach. The dataset on which we wanted to evaluate our prototype has a lot of equations that have
mrowelement. To parse the sub-expressions recursive approach seems a better way. While parsing recursively handle symbol layout tree creation poses some challenges like ifmsubormsupelement is encountered, child elements parsed need to be added to a layout tree as branches.
Future Work
- Currently parsing MathML and generating symbol layout tree, done within
MathMLParserclass. If we decouple that, then any existing XML parser can be used.
(If I look back, this is the idea suggested by mentors, which would have helped me complete the evaluation of the system too.)
- Complete the implementation of Parser to generate suitable terms from math formulae (Symbol pair tuple in case of the Tangent method)
- Evaluation of the implementation over the suitable dataset.
- Try new approach or improve the existing system.
