Wednesday, June 6, 2012

!Closing Remarks!

This is the closing blog post! The quarter has gone fast!

All the code is and will continue to be available here

I'm going to write a bit about previous work (because I never wrote a post about that), list some papers I found helpful, go over each part of the pipeline and conclude with what I learned and what I would do different if I had a paradox-free time machine.

Previous Work:

The abstract for our project proposal was:

While handwriting provides an efficient means to write mathematical symbols quickly, it is a poor medium for rapid exchange and editing of documents. Meanwhile, advanced typesetting systems like LaTeX and MathML have provided an environment where mathematical symbols can be typeset with precision, but at the cost of typing time and a steep learning curve. In order to facilitate the exchange, preservation and ease of editing of mathematical documents, we propose a method of offline handwritten equational recognition. Our system takes a handwritten document, for example a students calculus homework, then partitions, classifies and parses the document into LaTeX.

The following two links show similar but on-line versions of what we were trying to accomplish:

http://webdemo.visionobjects.com/equation.html
http://detexify.kirelabs.org/classify.html

In reading through papers, I found that this project was highly ambitious. In survey papers, repetitively it was mentioned that classification and parsing verification were mostly tackled as separate research.

While there is the standard MNIST dataset for digits, there didn't seem to be as standard a dataset for handwritten mathematical symbols. The closest thing to that is the infty project. This dataset was used in some of the papers we read to research the project, unfortunately we never got the dataset to work.

Papers That Helped the Most:

Entire project:
This was the paper we read for beginning the project:
http://www.eecs.ucf.edu/courses/cap6938/fall2008/penui/readings/matsakis-MEng-99.pdf

Classification/Feature Descriptors
http://hal.archives-ouvertes.fr/docs/00/54/85/12/PDF/hog_cvpr2005.pdf
http://www.umiacs.umd.edu/~joseph/support-vector-machines4.pdf
http://eprints.pascal-network.org/archive/00004062/01/ShalevSiSr07.pdf
http://eprints.pascal-network.org/archive/00003046/01/bosch07a.pdf

Verification/Parsing:
http://www.springerlink.com/content/6217472316636n13/fulltext.pdf?MUD=MP

Synthetic data (something we considered but abandoned):
http://www.stanford.edu/~acoates/papers/coatesetal_icdar_2011.pdf

Meta-links (These links had links to useful papers):
http://www.inftyproject.org/en/articles_ocr.html
http://www.handwritten.net/mv/papers/

My Work:

Segmentation:

There were two schemes that were attempted for segmentation.

The two methods were based on the following two stackoverflow posts:

http://stackoverflow.com/questions/5305712/how-to-perform-character-segmentation-in-matlab

http://stackoverflow.com/questions/6972918/detect-a-rectangle-bound-of-an-character-or-object-in-black-white-or-binary-imag

Both are based on finding connected-components.

This caused a major problem where it couldn't detect symbols with multiple disconnected components, most egregiously '='.

This was annoying but in the interest of finishing on time we postponed fixing this with possibly implementing a sliding window way of detecting symbols.

Extraction:

Once the bounding boxes were found by the segmenter, we had to extract the symbols. This involved embedding the symbols in new matrices, and outputting positional information how each character related to previous ones. By calculating the centroid of a bounding box, we were able to output this information for help with parsing.

Classification:

We spent the first few weeks toying with logistic regression. After we implemented cross-validation we found the classifier worked abysmally.

Later, we switched to SVMs and used vl_feat. By modifying example code for object recognition, we were able to increase accuracy tremendously.

The details of this switch are described in this earlier post.

Parsing:

Parsing/Verificaiton ended up being an interesting project in itself. After A week or two of hacky and false starts, I was able to out together something with limited functionality using PLY, a python implementation of lex/yacc.

Future:

Despite this being the 'final post', there are a few things we're still working on.

Segmentation will probably not be modified this late.

Extraction isn't tested completely or integrated into the parser yet.

Classification has a few small adjustments to be tested.

Parsing can't handle all subscripts/superscripts yet, it fails in some edge cases like when multiple variables have both superscripts and subscripts one after the other.

What I've Learned/Would Do Different!:

I took this class at the same time as CSE141/141L. 141L is another class with a major project and in the first few weeks I had trouble balancing it. Time management is very important, and I think if I had planned more of the quarter earlier I could have gotten further.

I regret not getting segmentation for the '=' sign!!

I regret never using mechanical turk to generate a dataset.

I've learned a lot about SVM's, and working on a large open-ended project.

Scope and focus have been recurring themes. As the quarter progressed, I had to adjust the scope of what I wanted to accomplish while attempting to focus on the parts that were most essential.

Conclusion:

Thanks to everyone in class for their helpful comments and advice.

This post hasn't had any pictures, so please enjoy this picture of superman programming:

Pictured: Computer Vision

Monday, June 4, 2012

Parsing update

Accuracy with only 3 binary training types

All classes binary

Accuracy didn't increase.

Parsing

My parser now creates a syntax tree as a list of tuples.

I've found this blog post to be very useful.

The form it outputs is like this:

('EXP',

[('QUANT', ' \\forall ', ' x',

('BINOP', ' \\in', ('VAR', ' x'),

('BINOP', ' \\leftrightarrow', ('VAR', ' y'),

('BINOP', ' \\subset', ('VAR', ' x'), ('VAR', ' y')

)

)]

)

This works for correctly classified examples from the first order logic.

TODO:

Parser:

1) Add rules with positional information for {super,sub} scripts

2) Figure out how to tree and print out LaTeX

3) Add a dummy rules/node-types for when the classifier makes mistakes

4) Figure out how to add back-off rules to make more possible trees,

figure out how to select one as the most probable tree

Classifier:

Add sigma and integral signs

cross-validation!!!

play with different settings.

Extraction:

Output positional information

Wednesday, May 30, 2012

Parser Progress, Dataset update

We've completed adding digits to the dataset.

There is a disunity in our dataset where a few of the symbols were made binary. We're going to test to see if this helps classification by making all the data binary and re-training the classifier.

If accuracy improves, we'll make binary thresholding part of pre-processing. If it doesn't help, we'll revert the data and re-add the now binary images as non-binary.

_______________________________________________________________________________

I've begun work on the parser. As I haven't taken compilers yet , I've found it hard to begin but I've been exploring the problem space.

I've started to play with PLY, a python implementation of lex and yacc. Using this I've made a hacky thing that can substitute some of the characters.

The parser has three overlapping goals:

1) substituting the classifier's coded output with the corresponding LaTeX
0 -> \forall

2) Using the CFG and positional data to make the subscripts and superscripts work:
56 with positional data "6 is to the upper left of 5" -> x^{y}

3) When the parser encounters syntactic errors, the parser should "back-off" the soft classifier results (i.e., the top 3 classes the symbol could be) to a symbol that fits the CFG

I talked to professor Jhala and he gave me some advice:

I have to do 3 steps,

1) Create a parse tree based on the code/position info from the classifier.

2) Evaluate different possible trees by "backing-off" some of the classifications.

3) Read off the tree to translate it to LaTeX

He also pointed me to the following links:

An Ocaml parsing homework assignment for CSE130

Happy, a parser generator for Haskell

A blog post about using Happy to write a compiler that might be useful

Wednesday, May 23, 2012

Expanding to Digits, Beginning Parser Stuff

We are expanding our dataset to include digits, which will allow us to start fooling around with super/subscripts!

So far we are missing 9,7 from the dataset, but today after class I'm going to add the rest.

FO logic symbols and digits 0-9 except 9,7 (25 total classes)

Parsing

I've modified the basic python script I've been using for parsing to include stuff about subscripts.

This code will most likely be completely changed within the week, but this is simple prototyping to get a feel for the problem space.

I'm searching out tools and papers to help. I'm planning to try to finish parsing as fast as possible so that I can get back to perfecting the classification/extraction parts of the pipeline.

Right now I'm assuming that the extraction file will output position information similar to what is described in this paper[0]. The following three images are all from [0]. I'm thinking that in addition to direction information distance will also have to be taken into account, so that implicit newlines can be added.

This will be then used to create a graph like so:

Then a graph representation will be used to determine if a symbol is a superscript/subscript/other and if the symbol should back-off its classification to another possible classification (e.g., negation shouldn't be a superscript, so if the second most probable symbol is a 2 it should output)

I used some of the holdout data to make symbols to represent 2^12. I hardcoded the positional information and edited the parser file to use this data.

Output of the play parser for superscripts

Some papers I'm going to be looking at:

[0] Veriﬁcation of Mathematical Formulae Based on a Combination of Context-Free Grammar and
Tree Grammar

http://www.mkm-ig.org/meetings/mkm06/Presentations/Sexton.pdf

Towards a Parser for Mathematical Formula Recognition

Tools I'm looking into are:

http://wiki.python.org/moin/LanguageParsing

http://en.wikipedia.org/wiki/Yacc

http://en.wikipedia.org/wiki/GNU_bison

Monday, May 14, 2012

The Quest for the Proper Proper Subset Classification

Switch to SVM

We’ve switched from logistic regression with HOG features to Support Vector Machines using a variant of SIFT descriptors.

Our code is a slightly modified version of this vl_feat example code.

Here is a description of what the code does, as far as I currently understand it:

The code first trains a bag of visual words “vocabulary” based on PHOW descriptors (a variant of dense SIFT, implemented in vl_feat and described in this paper: Image Classiﬁcation using Random Forests and Ferns. The "vocabulary" is created by running k-means on the PHOW descriptors.

The vocabulary comes from a subset of the training data, and afterwards all the

training data is given a feature description based on PHOW descriptors and the vocabulary.

A feature map is then computed using the chi-squared kernel.

The SVM solver is called "pegasos" and is based on this paper

The classification is done through all-vs-one classification, I'm thinking about possibly altering it to training a bunch of binary classifiers then doing a decision tree or a voting thing. Also, I need to make the output soft for the parser.

This is a good tutorial for SVM's.

Experimental Results

Methodology: We have 300 samples for each of the 17 classes. I wrote a python script that randomly selected 20 of each of the class examples to be "holdout" data. I haven't modified the code yet to do cross-validation.

The 20 holdout data samples per class were used to create various formulas in first-order logic (the definition of function and the definition of proper subset). When the segmentation and extractor get better new novel data will be used instead of these hold-out samples for testing the effectiveness of the system.

At first I used 50 random samples as training and the remaining 230 as test. I observed that most of the errors were due to classifying symbols as negation.

Observing the (scores,classes) output of the svm one-vs-all classifier, I saw that for the "negation errors" the correct label usually was the 2nd most likely.

Here is an example of the scores for a "z" that was mistaken for a "negation":

scores class
-0.2888 11 (neg)
-0.3795 17 (z)
-0.4663 9 (left paren)
-0.4823 7 (if then)
-0.5354 12 (or)
-0.5367 16 (y)
-0.5602 10 (ne)
-0.5620 13 (right paren)
-0.5640 14 (subset)
-0.5786 4 (elem)
-0.6220 15 (x)
-0.6455 2 (R)
-0.6618 5 (exist)
-0.6747 3 (and)
-0.6786 1 (F)
-0.7181 6 (for all)
-0.7594 8 (iff)

I decided to create alternate samples that used a tilde instead of the hook-like symbol for negation, to see how that would affect the results.

In the end I trained 4 models and tested each of them on 2 different formulas:

Models:
1) 50 train
2) 230 train
3) 50 train (tilde for negation)
4) 230 train (tilde for negation)

Formulas:

1)Function definition

Correct classification for function definition

2)Proper Subset definition

Proper classification, definition of proper subset

PARAMS that may need to be tweaked:

There are a few parameters in the svm code. I need do more testing to see how tweaking these will affect classification results.

Number of words: This is the "vocabulary" created by k-means. It was originally set to 300, but I increased it to 600. I'm not sure yet if this helped or hurt, more testing is needed.

K-means algorithm: The vl_feat implementation of k-mean can use the standard Lloyd's algorithm or UCSD professor Elkan's accelerated k-mean's algorithm. Current results use Elkan's algorithm, but I'm not sure if this sped-up version gives the same results for our (relatively) small dataset.

Kernel: The feature map is computed using the vl_homkermap function. This function has a few parameters and potentially other kernel mappings could be swapped for it.

Confusion Matrices

"Hook" negation:

50 training examples, 230 test

230 training, 50 test

Tilde negation:

50 train, 230 test, tilde for negation

230 train, 50 test, tilde for negation

Parsing Results for Holdout Data Formulas

Function Formula(7 symbols from holdout):

50 training ( 0 errors!)

230 training ( 0 errors!)

50 training with tilde for negation (2 errors)

230 training with tilde for negation (1 error, left paren for x)

Proper Subset Formula (30 symbols from holdout):

50 training (7 errors)

230 train (8 errors)

50 training with tilde for negation (4 errors)

230 training with tilde for negation (4 errors)

Current Goals:

Classification seems "good enough", next to do is:

1) Implementing the graph structure/Context Free Grammar for parsing

2) Improving segmentation and extraction

3) Expanding the data to include numbers.

Sunday, May 6, 2012

Non-Maximal Suppression

One problem faced when locating characters in an image is extraneous bounding boxes. When the segmenter is done detecting characters and placing bounding boxes for the image, there will be some false positives found, as well as overlapping bounding boxes.

One way to deal with this problem is Non Maximal Suppression (NMS). The idea behind NMS is to cluster bounding box groups, and then apply a heuristic to determine the best box in each cluster. I used kmeans to cluster the bounding boxes, and for each cluster, chose the bounding box closest to that cluster point.

Before NMS, on an input containing 7 characters, there were 21 bounding boxes found. After NMS, there were only 13 bounding boxes left, much closer to the real number of characters input. Here is a plot of the bounding boxes after NMS has been applied. Although it is hard to see a visual difference between the plots before and after NMS, you can see that the boxes left still contain characters.

Image with NMS applied

Goals:
1)   Become familiar with VLFeat
2)   Run cross validation, and test some more data to see how NMS performs with different expressions
3)   Try SIFT vs HOG and compare accuracy results.
4)   Fix the extraction method of data from the bounding boxes. Currently, the centroid of the character in the bounding box is found, and then extracted by carving out the pixels around it in the original image. A more useful approach would be to pad the information contained in the bounding box with whitespace, which will help when extracting characters that are nearby each other, such as exponents, or the range of an integral.

Wednesday, April 25, 2012

!Useful Matlab Tips!

Matlab feature: varargin

Matlab documentation for the feature: http://www.mathworks.com/help/techdoc/ref/varargin.html

Description: This allows you to specify a variable amount of arguments as input.

How this helped me:

A major part of our development process was working on parts of the pipeline as scripts to get functionality to work. In this phase, we set default values to load for many things.

After we got basic functionality, this became a slight problem as we had to turn the scripts into functions to piece together all the code.

By specifying variable number of arguments, we were able to keep the script functionality for quick testing while still having the ability to piece things together.

The most useful place this happened was in getDataMat.m
This function takes one of our dataset images, reads in each cell that contains a symbol, and outputs a matrix with of the data and the corresponding class descriptions.

getDataMat turns each of these symbols into a matrix

By using varargin, I was able to make it have 3 modes of operation:

1) The original script functionality

2) Ability to specify a directory, then all the *.jpg images from the directory will be grabbed.
EX:[x y] = getDataMat('directory', ''images/logic');

3) Ability to specify a variable number of file paths to read in:
EX: [x y] = getDataMat('images/logic/exist_1.jpg','images/logic/forAll_1.jpg');
[x y] = getDataMat('images/logic/forAll_1.jpg');

HOWTO use/Screenshots of relevant code: (I hope this doesn't contain bugs!!)

To use it, just include varargin as a parameter.
EX: function [data_x data_y] = getDataMat(varargin)

Then do:

nVarargs = length(varargin);

and have an if statement that changes the functionality based on the number of arguments.

Note: varargin is a cell array, so you will have to index it like so: varargin{i}

This can cause some tricky type errors, so be careful!!

Example usage!

By using this feature, tedious coding to play with different subsets of the data was eliminated! I hope it helps you too!

Useful links to help use this feature:

http://blogs.mathworks.com/pick/2008/01/08/advanced-matlab-varargin-and-nargin-variable-inputs-to-a-function/
http://makarandtapaswi.wordpress.com/2009/07/15/varargin-matlab/

Monday, April 23, 2012

Prototype: Iteration 1

All major parts of the pipeline have been touched now, and given basic functionality.

For ease of implementation, we decided that working with a proper subset of 1st order logic would be a reasonable goal. This allows us to get a "working" prototype while side-stepping issues like superscripts/subscripts and bounding box problems. As we get this prototype to increase accuracy on first-order logic, we will look into ways to expanding it to deal with more complex mathematical structures.

The subset of first-order logic that we will be working with for now includes the following symbols:

16 symbols total

It is quite literally a *proper* subset, as it only includes the proper subset symbol! In this scheme equals is created using not and not-equals.

New Dataset:

To help with this goal we have started creating a second toy dataset, made out of 300 examples each of 16 symbols. The scripts that do preprocessing and turn these images into matrices has been turned into functions that automate a lot of what we were doing tediously by matlab commands before.

So far we have 5 symbols finished.

Extraction:

Oren wrote code that takes the bounding boxes from the segmenter and extracts the characters into individual images. We are trying to keep this code as independent of the segmenter as possible. The extractor works well, but it is given lots of garbage by the segmenter.

The next iteration of this code will have to output positional information about the symbols as well, so that a graph structure can be created to help the parser.

Classification:

We have trained on the 5 symbols we have data for (forAll, exist, x, y, R,). Accuracy was greater than chance! However, only 2 classes were ever predicted (exist and x) so the 40% didn't seem very meaningful.

The next step with the classifier is the find new features, and to modify it so that it gives "soft" predictions of the top few most probable classes. This soft output will be helpful for the parser.

Parsing:

I wrote a simple substitution based linear parser in python using scipy/numpy. This was a task that was much easier doing in python than matlab. Creating complex graph structures and parsing them is not something I am looking forward to doing in matlab, but would be a joy in python, so I'm planning on using python for this process for the time being.

If we have time, the code may be ported to matlab (or the rest of the matlab code might be ported to python!!)!

Prototype Iteration 1 Demo:

To test the prototype the following formula was scanned:

This sentence is the definition of a function.

We ran it through the segmenter, which output 21 boxes given these 7 symbols.

Of those, 6 were garbage from a smear on the paper. The rest were slight translations of the correct character extraction.

We cheated and filtered out the garbage and repetitions for the sake of a first demo. The seven selected images were then run through the classifier. The classifiers output was then run through the parser, resulting in the following .tex file:

LaTeX generated by parser.

Which results in this after pdflatex:

Our systems prediction

While the original samples true results should have been:

The correct classification/parsing

TODO:

Classification the most important now, we need to find features with greater discriminative power
We have decided to read at least one computer vision paper a week in order to get ideas and get a good feel for what we should make our final paper look like.
We need to start work on outputting the graph structure for the parser.

Looking ahead (perhaps too far), the following is a tentative list of how we will expand the scope of this project as the 1st order logic accuracy increases:

1) add numbers

2) add subscripts/superscripts

3) Discrete Math symbols

4) Calculus symbols

5) Linear Algebra (matrices)

6) ???????

7) Profit!!

I expect to get to at least 3, possibly 4. Once work begins on creating the graph structure for parsing, we can see how distant a goal linear algebra is.

Wednesday, April 18, 2012

Pipeline + random updates

To help with planning the project I created a diagram of the planned pipeline for HERC using http://www.diagram.ly/. The most updated pipeline image can be found in the repo here.

As shown, the pre-processing, localization, and classification components have "Basic Functionality" implemented, though they have a long way to go until they have "Decent Functionality".

A current goal is the try to get all the components "Basic Functionality" implemented (even if they don't exactly "work") so we can do ceiling analysis. Ceiling analysis is where you see how much a part of the pipeline working 100% would increase overall performance of the system. To do this you iteratively hand feed correct results from each part to the rest of the parts and see how much the entire systems accuracy increases.

This is a method to help prioritize which parts of the pipeline to work on. More information can be found on the Stanford ml-class preview videos site in unit XVIII video "Ceiling Analysis: What Part of the Pipeline to Work on Next "

________________________________________________________________________

Classifier Improvement

Last post, we implemented cross validation and discovered that our classifier didn't perform better than chance. We were just using the raw pixels so this wasn't too surprising.

I found a file on the matlab fileexchange for HOG(Histogram of Oriented Gradients) features. Putting our toy dataset through it transformed a 1000x24963 matrix into a 1000x83 matrix.

This significantly increased the speed it takes to run from 8 minutes for one fold to 5 folds in 32 seconds.

The accuracy also went up, from 5.7 to 11.4! Which is *better* than chance. Confusion matrices have been generated for each fold and put in mistakes.mat in the repo, but we haven't interpreted them yet to figure out where to go from here.

The speed increase made it easier to explore the effect of different values of the lambda parameter for regularized logistic regression .

High values of lambda didn't help

Exploring closer to .1

The values
[0 0.0900 0.1000 0.1100 0.1250 0.1500 0.2000 0.5000 0.7500 1.0000
10.0000 20.0000] were tested.

The corresponding mean cross-validation accuracies were:

[10.5000 11.2000 11.4000 10.6000 11.2000 10.6000 10.3000 9.9000 10.3000 10.7000 9.6000 9.7000]

In the end .1, the value that lambda had originally been set to, worked best.

Accuracies for different values of lambda at each fold can be found here.

___________________________________________________________________________

Localization Update

We still haven't gotten around to figuring out how use some of the infty dataset. We need to parse some comma-separated values in .txt files to do this. Some of the dataset, however, was given in raw images. I passed these through the localizer to see how it performed. It does well (which is to be expected, as these aren't noisy handwritten images), but still has the same problems with non-connected parts.

____________________________________________________________________________

Misc

A few things accomplished since monday:

Cleaned up some code (vectorizing stuff, replacing own implementation of a few things with matlab commands, added/deleted/fixed comments)
Started research on making synthetic data to expand toy dataset
Made folder of each toy dataset sample as an individual image, to help with feature extraction/testing

Sunday, April 15, 2012

Cross-Validation

When training a classifier, data is separated into a training set, and a testing set. The training set is used to learn about the aspects that differentiate items, and the testing set is used to evaluate the results. While it is tempting to use all the data to train the classifier, this can cause problems if the classifier merely memorizes the training set, and fails at classifying novel samples.

This effect is called overfitting.

(image from http://victoranchidin.blogspot.com/2009/10/overfitting-vs-overtraining.html)

In order to avoid overfitting, there is a technique called cross-validation. Cross-validation takes a dataset, breaks it into partitions, and uses them to separately train and test a classifier. By using the data to separately train and test in different combinations, it is easier to see how well a classifier performs.

(image from nltk)

Cross-Validation helps prevent fallacious reasoning when testing hypotheses suggested by the data (http://en.wikipedia.org/wiki/Testing_hypotheses_suggested_by_the_data). The problem occurs when a trend is found in data, then that same data is used to evalute if that trend is real. According to wikipedia, this is sometimes called a type III error.

We first implemented cross-validation the easiest way that came to mind, manually breaking our data into partitions using if statements and careful matrix indexing. There is also a matlab command, cvpartition (http://www.mathworks.com/help/toolbox/stats/cvpartition.html), which breaks your data up for you. Later on, we might switch to this for analyzing our results.

Results of running 5-fold cross-validation with a few different lambda values.

lambda	fold 1	fold 2	fold 3	fold 4	fold 5	average
0.1	3.5	6.5	8	4	6.5	5.7
.5	4	7.5	4	6	4.5	5.2
1	4	7	5	5	6.5	5.5
10	4	9	5.5	5.5	5	5.9
100	3.5	6	4.5	5.5	4	4.8

As you can see, our accuracy is not very good. With 20 classes, ~5% accuracy isn’t better than a classifier that always outputs the same value. Our low accuracy could be due to our small sample size. Our next step is to collect more data and see if this could improve results.

It also takes a long time to train our classifier (~8 minutes). It might be possible to reduce the size of the image files in order to decrease the runtime, and see if that makes a difference when training with more data.

Sunday, April 8, 2012

Version Control, TODO list

We created github accounts (Kyle, Oren) for version control. Neither of us is currently familiar with github, but we are using this as an opportunity to learn a new and useful tool.

The HERC repo can be found here.

Goals/TODO for the week(mostly collected from previous posts):

Create a detailed specification
Become more familiar with github
Clean up the code we have, increase documentation.
Write a README

Figure out how to use InftyMDB
possibly make toydataset2
finalize planned range of math symbols for dataset

Look into ways to increase accuracy.

Cross validation

Different lambda parameters

Run the code on a larger dataset.
Consider alternate ways to classify (SVM?)

Write code to extract characters based on bounding box to feed into classifier
Read more papers on the character localization/extraction
Think about fixing = vs - problem

Localization/character extraction!

One of the most difficult problems we anticipated going into this project was extracting characters for the classifier. With a little bit of searching I was able to find this post on stackoverflow that had a quick, easy solution.

The general idea is:

1) Make the image matrix binary
2) Select connected regions using matlab's bwlabel function:
3) Get bounding boxes for these connected regions.

Modifying that code slightly lead to these results:

The bounding boxes for the digits is nearly perfect.

So far the biggest problem to fix with localization is disambiguating = and -. More generally, symbols with multiple non-connected parts are a challenge.I'm thinking of maybe adding a heuristic that searches vertically a distance proportional to max(height, width) for composite parts to merge bounding boxes for. Another technique entirely may be the answer though, or backing off that problem to the parser. Parsing based on position (=, {super,sub}scripts) is the new biggest challenge overall. Localization/bounding box/Character extraction TODO: write code to extract characters based on bounding box to feed into classifier read more papers on the subject fix = vs - problem

Handwritten Equation Recognition/Classification