We've completed adding digits to the dataset.
There is a disunity in our dataset where a few of the symbols were made binary. We're going to test to see if this helps classification by making all the data binary and re-training the classifier.
If accuracy improves, we'll make binary thresholding part of pre-processing. If it doesn't help, we'll revert the data and re-add the now binary images as non-binary.
_______________________________________________________________________________
I've begun work on the parser. As I haven't taken compilers yet , I've found it hard to begin but I've been exploring the problem space.
I've started to play with PLY, a python implementation of lex and yacc. Using this I've made a hacky thing that can substitute some of the characters.
The parser has three overlapping goals:
1) substituting the classifier's coded output with the corresponding LaTeX
0 -> \forall
2) Using the CFG and positional data to make the subscripts and superscripts work:
56 with positional data "6 is to the upper left of 5" -> x^{y}
3) When the parser encounters syntactic errors, the parser should "back-off" the soft classifier results (i.e., the top 3 classes the symbol could be) to a symbol that fits the CFG
I talked to professor Jhala and he gave me some advice:
I have to do 3 steps,
1) Create a parse tree based on the code/position info from the classifier.
2) Evaluate different possible trees by "backing-off" some of the classifications.
3) Read off the tree to translate it to LaTeX
He also pointed me to the following links:
An Ocaml parsing homework assignment for CSE130
Happy, a parser generator for Haskell
A blog post about using Happy to write a compiler that might be useful
Wednesday, May 30, 2012
Wednesday, May 23, 2012
Expanding to Digits, Beginning Parser Stuff
We are expanding our dataset to include digits, which will allow us to start fooling around with super/subscripts!
[0] Verification of Mathematical Formulae Based on a Combination of Context-Free Grammar and
Tree Grammar
http://www.mkm-ig.org/meetings/mkm06/Presentations/Sexton.pdf
So far we are missing 9,7 from the dataset, but today after class I'm going to add the rest.
FO logic symbols and digits 0-9 except 9,7 (25 total classes) |
Parsing
I've modified the basic python script I've been using for parsing to include stuff about subscripts.
This code will most likely be completely changed within the week, but this is simple prototyping to get a feel for the problem space.
I'm searching out tools and papers to help. I'm planning to try to finish parsing as fast as possible so that I can get back to perfecting the classification/extraction parts of the pipeline.
Right now I'm assuming that the extraction file will output position information similar to what is described in this paper[0]. The following three images are all from [0]. I'm thinking that in addition to direction information distance will also have to be taken into account, so that implicit newlines can be added.
This will be then used to create a graph like so:
Then a graph representation will be used to determine if a symbol is a superscript/subscript/other and if the symbol should back-off its classification to another possible classification (e.g., negation shouldn't be a superscript, so if the second most probable symbol is a 2 it should output)
I used some of the holdout data to make symbols to represent 2^12. I hardcoded the positional information and edited the parser file to use this data.
Output of the play parser for superscripts |
Some papers I'm going to be looking at:
Tree Grammar
http://www.mkm-ig.org/meetings/mkm06/Presentations/Sexton.pdf
Towards a Parser for Mathematical Formula Recognition
Tools I'm looking into are:
Monday, May 14, 2012
The Quest for the Proper Proper Subset Classification
Switch to SVM
We’ve switched from logistic regression with HOG features to Support Vector Machines using a variant of SIFT descriptors.
Our code is a slightly modified version of this vl_feat example code.
Here is a description of what the code does, as far as I currently understand it:
Here is a description of what the code does, as far as I currently understand it:
The code first trains a bag of visual words “vocabulary” based on PHOW descriptors (a variant of dense SIFT, implemented in vl_feat and described in this paper: Image Classification using Random Forests and Ferns. The "vocabulary" is created by running k-means on the PHOW descriptors.
The vocabulary comes from a subset of the training data, and afterwards all the
training data is given a feature description based on PHOW descriptors and the vocabulary.
A feature map is then computed using the chi-squared kernel.
The classification is done through all-vs-one classification, I'm thinking about possibly altering it to training a bunch of binary classifiers then doing a decision tree or a voting thing. Also, I need to make the output soft for the parser.
This is a good tutorial for SVM's.
Experimental Results
Methodology: We have 300 samples for each of the 17 classes. I wrote a python script that randomly selected 20 of each of the class examples to be "holdout" data. I haven't modified the code yet to do cross-validation.
The 20 holdout data samples per class were used to create various formulas in first-order logic (the definition of function and the definition of proper subset). When the segmentation and extractor get better new novel data will be used instead of these hold-out samples for testing the effectiveness of the system.
At first I used 50 random samples as training and the remaining 230 as test. I observed that most of the errors were due to classifying symbols as negation.
Observing the (scores,classes) output of the svm one-vs-all classifier, I saw that for the "negation errors" the correct label usually was the 2nd most likely.
Here is an example of the scores for a "z" that was mistaken for a "negation":
scores class
-0.2888 11 (neg)
-0.3795 17 (z)
-0.4663 9 (left paren)
-0.4823 7 (if then)
-0.5354 12 (or)
-0.5367 16 (y)
-0.5602 10 (ne)
-0.5620 13 (right paren)
-0.5640 14 (subset)
-0.5786 4 (elem)
-0.6220 15 (x)
-0.6455 2 (R)
-0.6618 5 (exist)
-0.6747 3 (and)
-0.6786 1 (F)
-0.7181 6 (for all)
-0.7594 8 (iff)
I decided to create alternate samples that used a tilde instead of the hook-like symbol for negation, to see how that would affect the results.
The 20 holdout data samples per class were used to create various formulas in first-order logic (the definition of function and the definition of proper subset). When the segmentation and extractor get better new novel data will be used instead of these hold-out samples for testing the effectiveness of the system.
At first I used 50 random samples as training and the remaining 230 as test. I observed that most of the errors were due to classifying symbols as negation.
Observing the (scores,classes) output of the svm one-vs-all classifier, I saw that for the "negation errors" the correct label usually was the 2nd most likely.
Here is an example of the scores for a "z" that was mistaken for a "negation":
scores class
-0.2888 11 (neg)
-0.3795 17 (z)
-0.4663 9 (left paren)
-0.4823 7 (if then)
-0.5354 12 (or)
-0.5367 16 (y)
-0.5602 10 (ne)
-0.5620 13 (right paren)
-0.5640 14 (subset)
-0.5786 4 (elem)
-0.6220 15 (x)
-0.6455 2 (R)
-0.6618 5 (exist)
-0.6747 3 (and)
-0.6786 1 (F)
-0.7181 6 (for all)
-0.7594 8 (iff)
I decided to create alternate samples that used a tilde instead of the hook-like symbol for negation, to see how that would affect the results.
In the end I trained 4 models and tested each of them on 2 different formulas:
Models:
1) 50 train
2) 230 train
3) 50 train (tilde for negation)
4) 230 train (tilde for negation)
Models:
1) 50 train
2) 230 train
3) 50 train (tilde for negation)
4) 230 train (tilde for negation)
Formulas:
1)Function definition
Correct classification for function definition |
2)Proper Subset definition
Proper classification, definition of proper subset |
PARAMS that may need to be tweaked:
There are a few parameters in the svm code. I need do more testing to see how tweaking these will affect classification results.
Number of words: This is the "vocabulary" created by k-means. It was originally set to 300, but I increased it to 600. I'm not sure yet if this helped or hurt, more testing is needed.
K-means algorithm: The vl_feat implementation of k-mean can use the standard Lloyd's algorithm or UCSD professor Elkan's accelerated k-mean's algorithm. Current results use Elkan's algorithm, but I'm not sure if this sped-up version gives the same results for our (relatively) small dataset.
Kernel: The feature map is computed using the vl_homkermap function. This function has a few parameters and potentially other kernel mappings could be swapped for it.
Confusion Matrices
"Hook" negation:
50 training examples, 230 test 230 training, 50 test Tilde negation:
230 train, 50 test, tilde for negation Parsing Results for Holdout Data Formulas |
50 training with tilde for negation (4 errors) |
230 training with tilde for negation (4 errors) |
Current Goals:
Classification seems "good enough", next to do is:
1) Implementing the graph structure/Context Free Grammar for parsing
2) Improving segmentation and extraction
3) Expanding the data to include numbers.
Sunday, May 6, 2012
Non-Maximal Suppression
One problem faced when locating characters in an image is extraneous bounding boxes. When the segmenter is done detecting characters and placing bounding boxes for the image, there will be some false positives found, as well as overlapping bounding boxes.
One way to deal with this problem is Non Maximal Suppression (NMS). The idea behind NMS is to cluster bounding box groups, and then apply a heuristic to determine the best box in each cluster. I used kmeans to cluster the bounding boxes, and for each cluster, chose the bounding box closest to that cluster point.
Before NMS, on an input containing 7 characters, there were 21 bounding boxes found. After NMS, there were only 13 bounding boxes left, much closer to the real number of characters input. Here is a plot of the bounding boxes after NMS has been applied. Although it is hard to see a visual difference between the plots before and after NMS, you can see that the boxes left still contain characters.
Image with NMS applied
Goals:
1) Become familiar with VLFeat
2) Run cross validation, and test some more data to see how NMS performs with different expressions
3) Try SIFT vs HOG and compare accuracy results.
4) Fix the extraction method of data from the bounding boxes. Currently, the centroid of the character in the bounding box is found, and then extracted by carving out the pixels around it in the original image. A more useful approach would be to pad the information contained in the bounding box with whitespace, which will help when extracting characters that are nearby each other, such as exponents, or the range of an integral.
Subscribe to:
Posts (Atom)