Project Work Product
The Project goal is adding a math equation search feature to the Xapian library. I conducted a literature survey with the help of the mentor and decided to implement the Tangent paper https://www.cs.rit.edu/~rlaz/files/sigir-tangent.pdf.
Details about the tangent system and implementation in Xapian is available at Project's main page.
All of my work is available at my fork of Xapian codebase: guruhegde/xapian
Merged
The following lists the work that has been merged in Xapian master:
- Implementation of Dice Coefficient weight metric #196
Link to the commits: https://github.com/xapian/xapian/commits/master?author=guruhegde
Under Review
The following lists the work that is currently in progress and will be merged once review completed.
- Add unique terms bound stats to HoneyVersion class #209.
In Progress
The following lists the work that is currently being implemented, not yet reached completion.
- Prototype of math search #201 - This PR contains the bulk of the work done during GSoC Period.
This PR contains the implementation of end to end system with a good number of tests.
- Speed up test-suite #210
This PR contains an initial prototype of the idea.
Code Coverage Report
Challenges
Presentation MathML
is an XML application for representing mathematics. We need to parse MathML to generate feature.
I decided to do parsing MathML elements and generating a symbol layout tree at the same time while parsing. This turned out to be a bit tricky. As I understood, the parser needs to account for syntax errors, for ex. check for closed tag etc. Similarly generating symbol layout tree requires understanding semantics, for ex.
<mfrac>
element can have only two children. Providing implementation to handle all the corner cases turned out to be a lot involved and requires better planning.
We wanted to check the prototype hence in the first iteration we covered simple MathML expression. For that, I followed parsing iteratively. During phase 3, I planned to implement a recursive approach. The dataset on which we wanted to evaluate our prototype has a lot of equations that have
mrow
element. To parse the sub-expressions recursive approach seems a better way. While parsing recursively handle symbol layout tree creation poses some challenges like ifmsub
ormsup
element is encountered, child elements parsed need to be added to a layout tree as branches.
Future Work
- Currently parsing MathML and generating symbol layout tree, done within
MathMLParser
class. If we decouple that, then any existing XML parser can be used.
(If I look back, this is the idea suggested by mentors, which would have helped me complete the evaluation of the system too.)
- Complete the implementation of Parser to generate suitable terms from math formulae (Symbol pair tuple in case of the Tangent method)
- Evaluation of the implementation over the suitable dataset.
- Try new approach or improve the existing system.