Opened 6 years ago

Last modified 6 years ago

#761 assigned task

Implement symbol layout tree from presentation mathml expression

Reported by: Guruprasad Hegde Owned by: Guruprasad Hegde
Priority: normal Milestone:
Component: Other Version:
Severity: normal Keywords:
Cc: Olly Betts, Gaurav Arora Blocked By:
Blocking: Operating System: All

Description

This ticket is to discuss the implementation of symbol layout tree.

Some background information:

Presentation MathML

Presentation MathML is one of the formats to represent math expression in documents. Presentation elements are broadly classified into two types:-

  • Token elements
    • mi, mo, mn: these elements correspond to a visible symbol ( like number, identifier text, operator(+,/,%) etc.
  • Layout schemata
    • mrow, mfrac, msqrt, mroot, mfenced: these elements are used to represent fractions, radicals or group subexpressions.
    • msub, msup, msubsup, munder, mover, mmultiscripts: these elements are used to represent script over base.
    • mtable, mtr, mtd: these elements correspond to tables, matrices, and vectors.

Symbol layout tree

Generally, math expression is a group of symbols (integer, operators, summation, integral etc) written on a horizontal line and special structure like subscript, superscript, limits on integral, summation written on top/bottom. The tree is built by traversing from left to right, starting with the first symbol. It will be a deep tree with branches representing script or radical index.

Each node in a tree represents either a symbol or grouping construct like a table, vector, matrix or parenthesized expression.

Every node is assigned a label. A label has two parts - node_type and value. Node type can be an integer, operator, variable, matrix etc. It reflects the value stored in the node. For example, to represent integer 2 in symbol tree, a node is created with the label N!2.

Every edge represents a spatial relationship between two adjacent symbols. For example, if edge type is next means two symbols are adjacent on a horizontal line, above means parent node is base and child node is superscript.

Complete details on symbol layout tree can be found in the wiki: https://github.com/guruhegde/xapian-gsoc-diary/blob/master/docs/slt.rst/ (link to be updated at later point)

Implementation:

After considering various options about parsing MathML, I feel it is better to implement from our own rather than use the existing XML parser. Having studied rapidxml(XML parser) code and MyHtmlParser(from Omega), I felt it can be realized in the time slot allocated.

Question:

  • Interface of indexing math expression - Do we provide a new interface in TermGenerator class (for ex. index_math) or build new API class like MathTermGenerator? Please suggest if there is any other way to do it.

Another option in my mind is if TermGenerator.index_text interface is used for indexing, if <math> term is detected, then text until </math> term is considered as math expression and input them to math index module. (I guess we use UtfIterator, so iterator is passed to math index module)

Change History (3)

comment:1 by Olly Betts, 6 years ago

I think you really need to add a new class for parsing these expressions. It would certainly be surprising to users for TermGenerator to suddenly start to interpret its input as XML instead of plain text, and in the future we'll probably want to support other formats for writing expressions.

comment:2 by Guruprasad Hegde, 6 years ago

I think you really need to add a new class for parsing these expressions.

Right, I added a new class.

patch in progress at https://github.com/xapian/xapian/pull/201

comment:3 by Guruprasad Hegde, 6 years ago

Status: newassigned
Note: See TracTickets for help on using tickets.