Hi,

Until now our algorithm was able to perform binary classification, in other words it could only classify one thing among several other stuffs. I was wondering whether it would be nice to improve our algorithm to be a multi-class classifier and classify images with it.

For this post, I’ll be using the very well known MNISTdatabase for training and classification. They are 10 different classes of handwritten numbers, 0 to 9.

## The MNIST dataset

From wikipedia:

The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems.

The MNIST database contains 60k training images and 10k testing images. The images are 28 x 28 pixels and have 8 bits of resolution for grayscale levels. The complete database can be found here.

In order to load the images, we’ll use functions already written for matlab/octave which can be found here.

## Multi-class Logistic Regression

As we know, our logistic regression algorithm can only tell us if “yes, most probably it’s X” or “no, most probably it’s not X”. So, with this in mind, we could make 10 of these classifiers, one for each number, and be able to classify a number among the other nine.

We can remember from Logistic Regression post that we have the following:

z = a_0x_0 + a_1x_1...

h(x) = g(z) = \dfrac{1}{1 + e^{-z}}

and then

z \ge 0 \to y = 1

z < 0 \to y = 0

For the Iris dataset, we had four features (sepal length/width and petal length/width) and then a was a vector of 1×5. Remember that x_0 is always 1. For MNIST database, the images are 28×28 pixels which means that we’ll have 784 features.

As we need 10 classifiers, we’ll have to calculate all parameters for ten z equations. In form of vectors we have:

z = a_0x_0 + a_1x_1 ... = a^Tx

In this case instead of having a vector a 1 x M, it will be a matrix (which will be call A_{10x785}) of 10 x M where M is the quantity of features plus one, in our case 785. This is easy to do in Octave using the same code that we used in logistic regression post.

First we’ll need to train for each class. In the code we’ll have a loop to repeat the very same steps for gradient descent. Once all parameters found, we classify. The classification is done using the 10 classifiers and we’ll pick the one which will have higher probability. Below an example for the digit seven and we can see clearly that h_7 had 99.98% of chance to be well classified:

h_0 = 7.8545e^{-06}

h_1 = 2.0393e^{-08}

h_2 = 2.2768e^{-04}

h_3 = 1.6831e^{-03}

h_4 = 1.3354e^{-04}

h_5 = 8.2681e^{-05}

h_6 = 3.1222e^{-08}

\bold {h_7 = 9.9988e^{-01}}

h_8 = 1.0459e^{-03}

h_9 = 2.0026e^-02

## Results:

First I trained the algorithm with all 60k samples with some different parameters. It took some time to make all those calculations!

We can clearly see that the learning rate and iteration quantity has a big impact on the final accuracy of the classifier. Of course the code is available at github as usual.

### 60k training images

Learning rate \alpha = 0.6

A_{10x785} start value: 0 for entire matrix

iteration: 200

Hits: 9036, Miss: 964. Total: 10000

Multi-class Logistic Regression accuracy: 90.36%

Learning rate \alpha = 0.6

A_{10x785} start value: 0 for entire matrix

iteration: 500

Hits: 8954, Miss: 1046. Total: 10000

Multi-class Logistic Regression accuracy: 89.54%

Learning rate \alpha = 0.6

A_{10x785} start value: 0.1 for entire matrix

iteration: 50

Hits: 8807, Miss: 1193. Total: 10000

Multi-class Logistic Regression accuracy: 88.07%

### 5k training images

Learning rate \alpha = 0.3

A_{10x785} start value: 0 for entire matrix

iteration: 50

Hits: 8607, Miss: 1393. Total: 10000

Multi-class Logistic Regression accuracy: 86.07%

Bye and until next time!

## Leave a Reply