Math formula is generally represented in
MathML format. Currently, We support only `Presentation MathML` format.
x = 2 + b / c in Presentation MathML format:-
<math> <mi> x </mi> <mo> = </mo> <mn> 2 </mn> <mo> + </mo> <mfrac> <mi> b </mi> <mi> c </mi> </mfrac> </math>
We parse the math equation and create a symbol layout structure. Symbol Layout structure is a visual representation of MathML format. This structure is formed by connected symbols in the math equations by an edge representing the spatial relationship between connected symbols. The spatial relation can be above, below, adjacent, within etc.
symbol Layout structure of above equation:
Symbol pair tuple is generated from the layout tree structure by taking multiple combinations of symbol pairs within certain path distance. Symbol pair tuple format: [S1, S2, path with spatial relation]. Ex. [V!xO!=N] where N stands for next.
==Key points about implementation in Xapian==
- Math term structure (symbol pair tuple) is different from terms generated from free text, we can't use existing
TermGeneratorclass. We decided to add a new API class
MathTermGeneratorto handle equations in MathML format.
- I planned to store the tree structure in
std::vector, this avoids the frequent call to heap memory allocation, hence gives
better performance. I set the equation size as a heuristic and estimated tree structure size and symbol pair tuple size. These values are used to preallocate capacity for
std::vectorto avoid frequent reallocations. Once we generate symbol pair tuple using the layout tree, memory for the tree will be released.