Handwritten Equation Recognition/Classification: The Quest for the Proper Proper Subset Classification

Switch to SVM

We’ve switched from logistic regression with HOG features to Support Vector Machines using a variant of SIFT descriptors.

Our code is a slightly modified version of this vl_feat example code.

Here is a description of what the code does, as far as I currently understand it:

The code first trains a bag of visual words “vocabulary” based on PHOW descriptors (a variant of dense SIFT, implemented in vl_feat and described in this paper: Image Classiﬁcation using Random Forests and Ferns. The "vocabulary" is created by running k-means on the PHOW descriptors.

The vocabulary comes from a subset of the training data, and afterwards all the

training data is given a feature description based on PHOW descriptors and the vocabulary.

A feature map is then computed using the chi-squared kernel.

The SVM solver is called "pegasos" and is based on this paper

The classification is done through all-vs-one classification, I'm thinking about possibly altering it to training a bunch of binary classifiers then doing a decision tree or a voting thing. Also, I need to make the output soft for the parser.

This is a good tutorial for SVM's.

Experimental Results

Methodology: We have 300 samples for each of the 17 classes. I wrote a python script that randomly selected 20 of each of the class examples to be "holdout" data. I haven't modified the code yet to do cross-validation.

The 20 holdout data samples per class were used to create various formulas in first-order logic (the definition of function and the definition of proper subset). When the segmentation and extractor get better new novel data will be used instead of these hold-out samples for testing the effectiveness of the system.

At first I used 50 random samples as training and the remaining 230 as test. I observed that most of the errors were due to classifying symbols as negation.

Observing the (scores,classes) output of the svm one-vs-all classifier, I saw that for the "negation errors" the correct label usually was the 2nd most likely.

Here is an example of the scores for a "z" that was mistaken for a "negation":

scores class
-0.2888 11 (neg)
-0.3795 17 (z)
-0.4663 9 (left paren)
-0.4823 7 (if then)
-0.5354 12 (or)
-0.5367 16 (y)
-0.5602 10 (ne)
-0.5620 13 (right paren)
-0.5640 14 (subset)
-0.5786 4 (elem)
-0.6220 15 (x)
-0.6455 2 (R)
-0.6618 5 (exist)
-0.6747 3 (and)
-0.6786 1 (F)
-0.7181 6 (for all)
-0.7594 8 (iff)

I decided to create alternate samples that used a tilde instead of the hook-like symbol for negation, to see how that would affect the results.

In the end I trained 4 models and tested each of them on 2 different formulas:

Models:
1) 50 train
2) 230 train
3) 50 train (tilde for negation)
4) 230 train (tilde for negation)

Formulas:

1)Function definition

Correct classification for function definition

2)Proper Subset definition

Proper classification, definition of proper subset

PARAMS that may need to be tweaked:

There are a few parameters in the svm code. I need do more testing to see how tweaking these will affect classification results.

Number of words: This is the "vocabulary" created by k-means. It was originally set to 300, but I increased it to 600. I'm not sure yet if this helped or hurt, more testing is needed.

K-means algorithm: The vl_feat implementation of k-mean can use the standard Lloyd's algorithm or UCSD professor Elkan's accelerated k-mean's algorithm. Current results use Elkan's algorithm, but I'm not sure if this sped-up version gives the same results for our (relatively) small dataset.

Kernel: The feature map is computed using the vl_homkermap function. This function has a few parameters and potentially other kernel mappings could be swapped for it.

Confusion Matrices

"Hook" negation: