Wednesday, May 23, 2012

Expanding to Digits, Beginning Parser Stuff

We are expanding our dataset to include digits, which will allow us to start fooling around with super/subscripts!


So far we are missing 9,7 from the dataset, but today after class I'm going to add the rest.




FO logic symbols and digits 0-9 except 9,7 (25 total classes)


Parsing

I've modified the basic python script I've been using for parsing to include stuff about subscripts. 

This code will most likely be completely changed within the week, but this is simple prototyping to get a feel for the problem space.

I'm searching out tools and papers to help. I'm planning to try to finish parsing as fast as possible so that I can get back to perfecting the classification/extraction parts of the pipeline.

Right now I'm assuming that the extraction file will output position information similar to what is described in this paper[0]. The following three images are all from [0]. I'm thinking that in addition to direction information distance will also have to be taken into account, so that implicit newlines can be added. 



This will be then used to create a graph like so:



Then a graph representation will be used to determine if a symbol is a superscript/subscript/other and if the symbol should back-off its classification to another possible classification (e.g., negation shouldn't be a superscript, so if the second most probable symbol is a 2 it should output)



I used some of the holdout data to make symbols to represent 2^12. I hardcoded the positional information and edited the parser file to use this data.

Output of the play parser for superscripts




Some papers I'm going to be looking at:

[0] Verification of Mathematical Formulae Based on a Combination of Context-Free Grammar and
Tree Grammar

http://www.mkm-ig.org/meetings/mkm06/Presentations/Sexton.pdf


Towards a Parser for Mathematical Formula Recognition


Tools I'm looking into are:










No comments:

Post a Comment