## Logistic Regression – Hands on

Hello!

Today we’ll get hands dirty and test logistic regression algorithm. For this post, we are going to use the very known iris flower data set.This dataset has three classes of flowers which can be classified accordingly to its sepal width/length and petal width/length. From the dataset source “One class is linearly separable from the other 2 […]” which makes this dataset handy for our purposes of binary classification.

## Dataset

This dataset has 150 measurements of sepal width/length and petal width/length, 50 for each type of flower. The first thing to do is understand our data by plotting all combinations between sepal width/length and petal width/length as in the image below:

Figure 1 – Iris flower data visualization (Iris Serosa in blue)

As we can see, the Iris Setosa flower can be linearly be separated from the others two classes in all possibilities. In other words, we can plot a straight line which can split the flowers in two different groups and this is what we must do, find the curve equation. Remember, $z = a_0x_0 + a_1x_1 ... = a^Tx$ in our example is linear, but it could be anything which could allow us to classify Iris Versicolor among the others two for example.

## Results

As we can see, the logistic regression classifier  was able to classify Iris Setosa without errors.

Now, let’s take a look about the others two classes:

Figure 2 – Iris Versicolor vs all other Iris flowers

Figure 3 – Iris Virginica vs all other Iris flowers

As we can see, both Versicolor and Virginica have samples which overlaps each other. If we try to train and classify we’ll get errors during classification. For this test I used 117 samples for training and 33 for testing the algorithm.

After training, here are the parameters found:

$\alpha = 0.01$ and 10000 iterations

Iris Setosa:  $a_0 = 0.32488, a_1 = 0.52351, a_2 = 1.80415, a_3 = -2.81494, a_4 = -1.30496$

Iris Versicolor: $a_0 = 0.90126, a_1 = 0.68625, a_2 = -1.92044, a_3 = 0.47489, a_4 = -1.42768$

Iris Virginica: $a_0 = -1.2021, a_2 = -2.0900, a_3 = -1.8512, a_4 = 3.0392 2.9099$

We need to remember that:

$z = a_0x_0 +a_1x_1 +a_2x_2 +a_3x_3 +a_4x_4$

$h(x) = \dfrac{1}{1 + e^{-z}}$

where:

$x_0$ : 1

$x_1$ : Sepal length

$x_2$ : Sepal width

$x_3$ : Petal length

$x_4$ : Petal width

The table 1 below summarizes the results:

Table 1 – Results over testing samples

True positive / NegativeFalse positiveFalse negativeAccuracy
Iris Setosa3300100%
Iris Versicolor252675.76%
Iris Virginica321096.97%

Most probably we would have worst results if I used more data for testing. I think I got lucky with the 11 samples selected that wasn’t close to the boundaries. For curiosity, I tested again over all data (training and test) and got the following (which we all know that we shouldn’t be doing =D):

Table 2 – Results over all samples

True positive / NegativeFalse positiveFalse negativeAccuracy
Iris Setosa15000100%
Iris Versicolor106123270.67%
Iris Virginica1473097.33%

As usual, I did the code on Octave and the source is available on my github page.

Next post we’ll test our classifier with mnist dataset and cover multi-class classification.

See you then!

Marcelo Jo

Marcelo Jo is an electronics engineer with 10+ years of experience in embedded system, postgraduate in computer networks and masters student in computer vision at Université Laval in Canada. He shares his knowledge in this blog when he is not enjoying his wonderful family – wife and 3 kids. Live couldn’t be better.