Hello!
Today we’ll get hands dirty and test logistic regression algorithm. For this post, we are going to use the very known iris flower data set.This dataset has three classes of flowers which can be classified accordingly to its sepal width/length and petal width/length. From the dataset source “One class is linearly separable from the other 2 […]” which makes this dataset handy for our purposes of binary classification.
Dataset
This dataset has 150 measurements of sepal width/length and petal width/length, 50 for each type of flower. The first thing to do is understand our data by plotting all combinations between sepal width/length and petal width/length as in the image below:
Figure 1 – Iris flower data visualization (Iris Serosa in blue)
As we can see, the Iris Setosa flower can be linearly be separated from the others two classes in all possibilities. In other words, we can plot a straight line which can split the flowers in two different groups and this is what we must do, find the curve equation. Remember, z = a_0x_0 + a_1x_1 ... = a^Tx in our example is linear, but it could be anything which could allow us to classify Iris Versicolor among the others two for example.
Results
As we can see, the logistic regression classifier was able to classify Iris Setosa without errors.
Now, let’s take a look about the others two classes:
Figure 2 – Iris Versicolor vs all other Iris flowers
Figure 3 – Iris Virginica vs all other Iris flowers
As we can see, both Versicolor and Virginica have samples which overlaps each other. If we try to train and classify we’ll get errors during classification. For this test I used 117 samples for training and 33 for testing the algorithm.
After training, here are the parameters found:
\alpha = 0.01 and 10000 iterations
Iris Setosa: a_0 = 0.32488, a_1 = 0.52351, a_2 = 1.80415, a_3 = -2.81494, a_4 = -1.30496
Iris Versicolor: a_0 = 0.90126, a_1 = 0.68625, a_2 = -1.92044, a_3 = 0.47489, a_4 = -1.42768
Iris Virginica: a_0 = -1.2021, a_2 = -2.0900, a_3 = -1.8512, a_4 = 3.0392 2.9099
We need to remember that:
z = a_0x_0 +a_1x_1 +a_2x_2 +a_3x_3 +a_4x_4h(x) = \dfrac{1}{1 + e^{-z}}
where:
x_0 : 1
x_1 : Sepal length
x_2 : Sepal width
x_3 : Petal length
x_4 : Petal width
The table 1 below summarizes the results:
Table 1 – Results over testing samples
True positive / Negative | False positive | False negative | Accuracy | |
---|---|---|---|---|
Iris Setosa | 33 | 0 | 0 | 100% |
Iris Versicolor | 25 | 2 | 6 | 75.76% |
Iris Virginica | 32 | 1 | 0 | 96.97% |
Most probably we would have worst results if I used more data for testing. I think I got lucky with the 11 samples selected that wasn’t close to the boundaries. For curiosity, I tested again over all data (training and test) and got the following (which we all know that we shouldn’t be doing =D):
Table 2 – Results over all samples
True positive / Negative | False positive | False negative | Accuracy | |
---|---|---|---|---|
Iris Setosa | 150 | 0 | 0 | 100% |
Iris Versicolor | 106 | 12 | 32 | 70.67% |
Iris Virginica | 147 | 3 | 0 | 97.33% |
As usual, I did the code on Octave and the source is available on my github page.
Next post we’ll test our classifier with mnist dataset and cover multi-class classification.
See you then!
Leave a Reply