Handwritten Equation Recognition/Classification: Cross-Validation

When training a classifier, data is separated into a training set, and a testing set. The training set is used to learn about the aspects that differentiate items, and the testing set is used to evaluate the results. While it is tempting to use all the data to train the classifier, this can cause problems if the classifier merely memorizes the training set, and fails at classifying novel samples.

This effect is called overfitting.

(image from http://victoranchidin.blogspot.com/2009/10/overfitting-vs-overtraining.html)

In order to avoid overfitting, there is a technique called cross-validation. Cross-validation takes a dataset, breaks it into partitions, and uses them to separately train and test a classifier. By using the data to separately train and test in different combinations, it is easier to see how well a classifier performs.

(image from nltk)

Cross-Validation helps prevent fallacious reasoning when testing hypotheses suggested by the data (http://en.wikipedia.org/wiki/Testing_hypotheses_suggested_by_the_data). The problem occurs when a trend is found in data, then that same data is used to evalute if that trend is real. According to wikipedia, this is sometimes called a type III error.

We first implemented cross-validation the easiest way that came to mind, manually breaking our data into partitions using if statements and careful matrix indexing. There is also a matlab command, cvpartition (http://www.mathworks.com/help/toolbox/stats/cvpartition.html), which breaks your data up for you. Later on, we might switch to this for analyzing our results.

Results of running 5-fold cross-validation with a few different lambda values.

lambda	fold 1	fold 2	fold 3	fold 4	fold 5	average
0.1	3.5	6.5	8	4	6.5	5.7
.5	4	7.5	4	6	4.5	5.2
1	4	7	5	5	6.5	5.5
10	4	9	5.5	5.5	5	5.9
100	3.5	6	4.5	5.5	4	4.8

As you can see, our accuracy is not very good. With 20 classes, ~5% accuracy isn’t better than a classifier that always outputs the same value. Our low accuracy could be due to our small sample size. Our next step is to collect more data and see if this could improve results.

It also takes a long time to train our classifier (~8 minutes). It might be possible to reduce the size of the image files in order to decrease the runtime, and see if that makes a difference when training with more data.

Handwritten Equation Recognition/Classification

Sunday, April 15, 2012

Cross-Validation

No comments:

Post a Comment

Contributors