# Project Work Product

The Project goal is adding a math equation search feature to the Xapian library. I conducted a literature survey with the help of the mentor and decided to implement the Tangent paper https://www.cs.rit.edu/~rlaz/files/sigir-tangent.pdf.

Details about the tangent system and implementation in Xapian is available at Project's main page.

All of my work is available at my fork of Xapian codebase: guruhegde/xapian

## Merged

The following lists the work that has been merged in Xapian master:

- Implementation of Dice Coefficient weight metric #196

Link to the commits: https://github.com/xapian/xapian/commits/master?author=guruhegde

## Under Review

The following lists the work that is currently in progress and will be merged once review completed.

- Add unique terms bound stats to HoneyVersion? class #209.

## In Progress

The following lists the work that is currently being implemented, not yet reached completion.

- Prototype of math search #201 - This PR contains the bulk of the work done during GSoC Period.

This PR contains the implementation of end to end system with a good number of tests.

- Speed up test-suite #210

This PR contains an initial prototype of the idea.

### Code Coverage Report

### Challenges

`Presentation MathML`is an XML application for representing mathematics. We need to parse MathML to generate feature.

I decided to do parsing MathML elements and generating a symbol layout tree at the same time while parsing. This turned out to be a bit tricky. As I understood, the parser needs to account for syntax errors, for ex. check for closed tag etc. Similarly generating symbol layout tree requires understanding semantics, for ex.

<mfrac>element can have only two children. Providing implementation to handle all the corner cases turned out to be a lot involved and requires better planning.

We wanted to check the prototype hence in the first iteration we covered simple MathML expression. For that, I followed parsing iteratively. During phase 3, I planned to implement a recursive approach. The dataset on which we wanted to evaluate our prototype has a lot of equations that have

mrowelement. To parse the sub-expressions recursive approach seems a better way. While parsing recursively handle symbol layout tree creation poses some challenges like ifmsubormsupelement is encountered, child elements parsed need to be added to a layout tree as branches.

## Future Work

- Currently parsing MathML and generating symbol layout tree, done within
`MathMLParser`class. If we decouple that, then any existing XML parser can be used.

(If I look back, this is the idea suggested by mentors, which would have helped me complete the evaluation of the system too.)

- Complete the implementation of Parser to generate suitable terms from math formulae (Symbol pair tuple in case of the Tangent method)
- Evaluation of the implementation over the suitable dataset.
- Try new approach or improve the existing system.