Wednesday, May 30, 2012

Parser Progress, Dataset update

We've completed adding digits to the dataset.

There is a disunity in our dataset where a few of the symbols were made binary. We're going to test to see if this helps classification by making all the data binary and re-training the classifier.

If accuracy improves, we'll make binary thresholding part of pre-processing. If it doesn't help, we'll revert the data and re-add the now binary images as non-binary.

_______________________________________________________________________________

I've begun work on the parser. As I haven't taken compilers yet , I've found it hard to begin but I've been exploring the problem space.

I've started to play with PLY, a python implementation of lex and yacc. Using this I've made a hacky thing that can substitute some of the characters.

The parser has three overlapping goals:

1) substituting the classifier's coded output with the corresponding LaTeX
0 -> \forall

2) Using the CFG and positional data to make the subscripts and superscripts work:
56 with positional data "6 is to the upper left of 5" -> x^{y}

3) When the parser encounters syntactic errors, the parser should "back-off" the soft classifier results (i.e., the top 3 classes the symbol could be) to a symbol that fits the CFG


I talked to professor Jhala and he gave me some advice:

I have to do 3 steps,

1) Create a parse tree based on the code/position info from the classifier.

2) Evaluate different possible trees by "backing-off" some of the classifications.

3) Read off the tree to translate it to LaTeX


He also pointed me to the following links:

An Ocaml parsing homework assignment for CSE130

Happy, a parser generator for Haskell

A blog post about using Happy to write a compiler that might be useful

No comments:

Post a Comment