Showing posts with label features. Show all posts
Showing posts with label features. Show all posts

Monday, May 14, 2012

The Quest for the Proper Proper Subset Classification

Switch to SVM 

We’ve switched from logistic regression with HOG features to Support Vector Machines using a variant of SIFT descriptors. 

Our code is a slightly modified version of this vl_feat example code.

Here is a description of what the code does, as far as I currently understand it:

The code first trains a bag of visual words “vocabulary” based on PHOW descriptors (a variant of dense SIFT, implemented in vl_feat and described in this paper: Image Classification using Random Forests and Ferns. The "vocabulary" is created by running k-means on the PHOW descriptors.

The vocabulary comes from a subset of the training data, and afterwards all the
training data is given a feature description based on PHOW descriptors and the vocabulary.

A feature map is then computed using the chi-squared kernel.  


The classification is done through all-vs-one classification, I'm thinking about possibly altering it to training a bunch of binary classifiers then doing a decision tree or a voting thing. Also, I need to make the output soft for the parser.

This is a good tutorial for SVM's.





Experimental Results


Methodology: We have 300 samples for each of the 17 classes. I wrote a python script that randomly selected 20 of each of the class examples to be "holdout" data. I haven't modified the code yet to do cross-validation.

The 20 holdout data samples per class were used to create various formulas in first-order logic (the definition of function and the definition of proper subset). When the segmentation and extractor get better new novel data will be used instead of these hold-out samples for testing the effectiveness of the system.

At first I used 50 random samples as training and the remaining 230 as test. I observed that most of the errors were due to classifying symbols as negation.

Observing the (scores,classes) output of the svm one-vs-all classifier, I saw that for the "negation errors" the correct label usually was the 2nd most likely.


Here is an example of the scores for a "z" that was mistaken for a "negation":

    scores   class
   -0.2888   11 (neg)        
   -0.3795   17 (z)      
   -0.4663    9 (left paren)        
   -0.4823    7 (if then)        
   -0.5354   12 (or)            
   -0.5367   16 (y)        
   -0.5602   10 (ne)          
   -0.5620   13 (right paren)            
   -0.5640   14 (subset)        
   -0.5786    4 (elem)      
   -0.6220   15 (x)        
   -0.6455    2 (R)          
   -0.6618    5 (exist)      
   -0.6747    3 (and)      
   -0.6786    1 (F)      
   -0.7181    6 (for all)    
   -0.7594    8 (iff)      




I decided to create alternate samples that used a tilde instead of  the hook-like symbol for negation, to see how that would affect the results.

In the end I trained 4 models and tested each of them on 2 different formulas:

Models:
1) 50 train
2) 230 train
3) 50 train (tilde for negation)
4) 230 train (tilde for negation)

Formulas:
1)Function definition

Correct classification for function definition 



2)Proper Subset definition


Proper classification, definition of proper subset




PARAMS that may need to be tweaked:


There are a few parameters in the svm code. I need do more testing to see how tweaking these will affect classification results.

Number of words: This is the "vocabulary" created by k-means. It was originally set to 300, but I increased it to 600. I'm not sure yet if this helped or hurt, more testing is needed. 

K-means algorithm: The vl_feat implementation of k-mean can use the standard Lloyd's algorithm or UCSD professor Elkan's accelerated k-mean's algorithm. Current results use Elkan's algorithm, but I'm not sure if this sped-up version gives the same results for our (relatively) small dataset.

Kernel:  The feature map is computed using the vl_homkermap function. This function has a few parameters and potentially other kernel mappings could be swapped for it.



Confusion Matrices




"Hook" negation:
50 training examples, 230 test




230 training, 50 test




Tilde negation:


50 train, 230 test, tilde for negation



230 train, 50 test, tilde for negation




Parsing Results for Holdout Data Formulas






Function Formula(7 symbols from holdout):


50 training ( 0 errors!)



230 training ( 0 errors!)


50 training with tilde for negation (2 errors)



230 training with tilde for negation (1 error, left paren for x)




Proper Subset Formula (30 symbols from holdout):


50 training (7 errors)





230 train (8 errors)







50 training with tilde for negation (4 errors)
230 training with tilde for negation (4 errors)


Current Goals:






Classification seems "good enough", next to do is: 

1) Implementing the graph structure/Context Free Grammar for parsing
2) Improving segmentation and extraction 
3) Expanding the data to include numbers.


Wednesday, April 18, 2012

Pipeline + random updates

To help with planning the project I created a diagram of the planned pipeline for HERC using http://www.diagram.ly/. The most updated pipeline image can be found in the repo here.





As shown, the pre-processing, localization, and classification components have "Basic Functionality" implemented, though they have a long way to go until they have "Decent Functionality".

A current goal is the try to get all the components "Basic Functionality" implemented (even if they don't exactly "work") so we can do ceiling analysis. Ceiling analysis is where you see how much a part of the pipeline working 100% would increase overall performance of the system. To do this you iteratively hand feed correct results from each part to the rest of the parts and see how much the entire systems accuracy increases.

This is a method to help prioritize which parts of the pipeline to work on. More information can be found on the Stanford ml-class preview videos site in unit XVIII video "Ceiling Analysis: What Part of the Pipeline to Work on Next "

________________________________________________________________________


Classifier Improvement



Last post, we implemented cross validation and discovered that our classifier didn't perform better than chance. We were just using the raw pixels so this wasn't too surprising.

I found a file on the matlab fileexchange for HOG(Histogram of Oriented Gradients) features. Putting our toy dataset through it transformed a 1000x24963 matrix into a 1000x83 matrix.

This significantly increased the speed it takes to run from 8 minutes for one fold to 5 folds in 32 seconds.

The accuracy also went up, from 5.7 to 11.4! Which is *better* than chance. Confusion matrices have been generated for each fold and put in mistakes.mat in the repo, but we haven't interpreted them yet to figure out where to go from here.




The speed increase made it easier to explore the effect of different values of the lambda parameter for regularized logistic regression .





High values of lambda didn't help

Exploring closer to .1





The values  
[0 0.0900 0.1000 0.1100 0.1250 0.1500 0.2000 0.5000 0.7500 1.0000  
10.0000 20.0000] were tested.


The corresponding mean cross-validation accuracies were:

[10.5000   11.2000   11.4000   10.6000   11.2000   10.6000   10.3000    9.9000   10.3000 10.7000    9.6000    9.7000]



In the end .1, the value that lambda had originally been set to, worked best.

Accuracies for different values of lambda at each fold can be found here.


___________________________________________________________________________


Localization Update



We still haven't gotten around to figuring out how use some of the infty dataset. We need to parse some comma-separated values in .txt files to do this. Some of the dataset, however, was given in raw images. I passed these through the localizer to see how it performed. It does well (which is to be expected, as these aren't noisy handwritten images), but still has the same problems with non-connected parts.


 ____________________________________________________________________________


Misc


A few things accomplished since monday:

  • Cleaned up some code (vectorizing stuff,  replacing own implementation of a few things with matlab commands, added/deleted/fixed comments)
  • Started research on making synthetic data to expand toy dataset
  • Made folder of each toy dataset sample as an individual image, to help with feature extraction/testing