wiki:GSoC2018/Maths/Work Product

Project Work Product

The Project goal is adding a math equation search feature to the Xapian library. I conducted a literature survey with the help of the mentor and decided to implement the Tangent paper https://www.cs.rit.edu/~rlaz/files/sigir-tangent.pdf.

Details about the tangent system and implementation in Xapian is available at Project's main page.

All of my work is available at my fork of Xapian codebase: guruhegde/xapian

Merged

The following lists the work that has been merged in Xapian master:

  • Implementation of Dice Coefficient weight metric #196

Link to the commits: https://github.com/xapian/xapian/commits/master?author=guruhegde

Under Review

The following lists the work that is currently in progress and will be merged once review completed.

In Progress

The following lists the work that is currently being implemented, not yet reached completion.

  • Prototype of math search #201 - This PR contains the bulk of the work done during GSoC Period.

This PR contains the implementation of end to end system with a good number of tests.

This PR contains an initial prototype of the idea.

Code Coverage Report

Code Coverage sumary

Challenges

  • Presentation MathML is an XML application for representing mathematics. We need to parse MathML to generate feature.

I decided to do parsing MathML elements and generating a symbol layout tree at the same time while parsing. This turned out to be a bit tricky. As I understood, the parser needs to account for syntax errors, for ex. check for closed tag etc. Similarly generating symbol layout tree requires understanding semantics, for ex. <mfrac> element can have only two children. Providing implementation to handle all the corner cases turned out to be a lot involved and requires better planning.

We wanted to check the prototype hence in the first iteration we covered simple MathML expression. For that, I followed parsing iteratively. During phase 3, I planned to implement a recursive approach. The dataset on which we wanted to evaluate our prototype has a lot of equations that have mrow element. To parse the sub-expressions recursive approach seems a better way. While parsing recursively handle symbol layout tree creation poses some challenges like if msub or msup element is encountered, child elements parsed need to be added to a layout tree as branches.

Future Work

  • Currently parsing MathML and generating symbol layout tree, done within MathMLParser class. If we decouple that, then any existing XML parser can be used.

(If I look back, this is the idea suggested by mentors, which would have helped me complete the evaluation of the system too.)

  • Complete the implementation of Parser to generate suitable terms from math formulae (Symbol pair tuple in case of the Tangent method)
  • Evaluation of the implementation over the suitable dataset.
  • Try new approach or improve the existing system.
Last modified 14 months ago Last modified on 09/08/18 10:40:36