I am building a project and doing research on math-aware search (my project is h...

dginev · on April 8, 2016

I am the person behind generating the original NTCIR math datasets, and probably most of the research-produced MathML out there. We've recently presented that we have more than 350 million formulas from arXiv converted over to MathML, together with the rest of the papers as HTML5.

As someone who has stared at arXiv TeX/LaTeX for years, I can testify you don't want to be looking at TeX math in actual latex documents, there is a lot more that goes in there beyond the toy formula syntax used on the web.

As also someone who has worked on math search engines and math-rich NLP for a few years, complaining that you have a structured machine-parseable representation for mathematics and wanting TeX instead sounds naive. On one hand, the MathML formulas in the datasets already could preserve the source TeX (the TeX annotations may even be there, I can't remember right now), should you need it directly. On the other hand, you can use any structured methods, such as the ones used by content-based search engines such as MathWebSearch, or handpick any relevant information from the MathML tree to feed it back into a statistical algorithm, as done for example by the WebMIAS search engine.

The most fundamental bit to understand if you're doing research on automated processing of human mathematics is that formulas are two dimensional objects best represented as trees, be they layout trees describing the presentation, or operator trees describing the content, or some other hybrid tree that tries doing both (such as LaTeXML's XMath spec).

ga6840 · on April 8, 2016

1. In NTCIR (main) dataset, I see many cases where <m:math> does not contain an altext (and thus no TeX). I asked LaTeXML author Bruce Miller <bruce.miller@nist.gov> about this, he said LaTeXML will always put the same TeX string as an altext attribute on the <m:math>. So I assume you guys are using some out-dated LaTeXML version? I really want to plead NTCIR to ensure the original LaTeX annotation is kept in main dataset, or please provide both MathML and LaTeX version corpus for researcher to freely choose. This will allow LaTeX-only math search engines being able to compare results with other MathML search engines. You know it is hard to convert all of them back into LaTeX correctly.

2. I wish NTCIR corpus is not that difficult to download (I once wrote a request for NTCIR corpus, but no one replies), please make it public accessible just like what MIaS does: https://mir.fi.muni.cz/mias/

3. My search engine (http://tkhost.github.io/opmes) is actually using structural method, but I still give up MathML and go parsing TeX directly instead. Why? In TeX I can just omit irrelevant command like "\color" and "\mbox", and only focus on a handful math-related TeX subset, and the result is great. Although my search engine can just handle "toy formula syntax", but maybe it is better than MathWebSearch (https://zbmath.org/formulae/) and even beat Tangent (http://saskatoon.cs.rit.edu/tangent/random) in long query. But in MathML, I have no idea why I need to read its lengthy spec, and I see no reason to write a MathML parser.

NTCIR-math conference (and its none-friendly website) makes me unwilling to submit a single paper.

dginev · on April 8, 2016

1. Correct, the dataset was generated back in 2013 and will probably be regenerated for the next NTCIR issue.

2. There are annoying copyright issues with making the datasets available for public use. We're working with arXiv to resolve that, it's out of our control for now. It's a long-lasting frustration of mine that the datasets can't be simply made public.

3.You can omit anything you like from the MathML, there is no inferiority to omitting from TeX. "but maybe it is better than MWS" - prove it, submit to NTCIR, and beat everyone. Also, being better than MWS is not an argument that MWS should be denied the very data it needs to run. At the same time you can still obtain whatever degradation you need from the presentation MathML. Failing to recognize any claim to correctness than your own without any substantive proof is not a reasonable position and I urge you to reconsider.

"I have no idea why I need to read its lengthy spec, and I see no reason to write a MathML parser."

You don't need to write a parser, you can use an off-the-shelf parser for XML/HTML5 and handle the MathML reliably and appropriately. In fact you can reuse that from any open source search engine for math, MWS included. Writing a TeX parser on the other hand is something I will always roll my eyes at, since actual real world TeX is not something you can "parse", or do anything with reliably, unless you have a full TeX implementation underneath. Which is 1000x harder than using a parser to deal with MathML.

Finally, whining about NTCIR's UI being imperfect as a reason not to submit is just childish.

ga6840 · on April 10, 2016

Thank you for informing me on my first two questions, so now I understand NTCIR's problem.

At very first I tried to compare my results (MAP, recall, precision) with participants in NTCIR, but I take a lot efforts to get dataset, after which I find I cannot convert MathML back into TeX very confidently, most importantly, my parser-generated tree structure is fine-tuned and very dependent on TeX input, I cannot just take MathML tree structure directly, I need much more efforts than just importing an existing XML parser. Because of these, I can not compare my results with mainstream NTCIR researchers. But I definitely tried very hard, sadly I give up. If NTCIR someday can provide (even if request is needed) TeX data for competition, I will consider to (and able to, willing to) compare my results with NTCIR participants (in order to "prove" it).

Writing a TeX parser only for math search is not that difficult, I have written it, it parses most user-created document on math.stackexchange.com. Although I cannot convince you I get better results, I can argue parsing search-interested TeX subset is effortless (if you only care math-related TeX), I even opensourced my search engine TeX parser. Again, problem is not that easy to grab a XML parser and reuse it in my project, I believe a good math-aware search engine needs to get a tree structure very different from that a MathML structure represents, you get a tree by reusing MWS praser, so WHAT? That tree is not the tree I want, I need a lot effort to convert it, the easy way for me is to convert MathML back into TeX (Since I have already done that from TeX), sadly it turns out to be too complicated to worth giving a shot.

ga6840 · on April 10, 2016

Lastly, I am more than childish to complain NTCIR and refuse submit a paper, I give up putting unworthy and duplicated effort on implementing a MathML parser that generates the expression tree I need (this step is the most difficult, rather than just parsing XML), instead, focusing on finding another conference to publish my efforts, it turns out my paper (a demo) get accepted in ECIR 2016, so glad I did not waste too much time on NTCIR, otherwise I would have missed ECIR.