Hi folks,
In the last post we talked about simple linear regression, where we calculated the “trend line” of a set of points for a single variable. But what if we have more than a single variable? How can we solve it?
Imagine that we have a database with the grades of students with following data: grade, country, school name, grade level, student sex and student age. With multiple linear regression we could try to predict the grade of a person knowing student’s country, school name, grade level, sex and age. Of course we don’t need to use all these information, actually one of main quality of a data scientist is finding the best data to be used. Most of time we have irrelevant data or even data that must be worked on to be useful. An obvious example is student name and this is not (or shouldn’t be) correlated with the grade.
Multiple vs Simple
In the simple linear regression we had y = ax + b. In the last post example y was the car price and x was the year of manufacture.
Now, in our student grade we would have something like
y = a_1x_1 + a_2x_2 + a_3x_3 + a_4x_4 + a_5x_5 + b
where:
y = student grade
x_1 = country
x_2 = school name
x_3 = grade level
x_4 = student sex
x_5 = student age
Do you remember the error function that we had to minimize?
\displaystyle error = \frac{1}{2m}\sum_{i=1}^{m} \left (y - h(x) \right )^2 \text {\hspace{10 mm} \small eq. 1}
We still can use it with gradient descent as for simple linear regression. But instead of using
h(x) = ax + b
we’ll use
h(x) = a_1x_1 + a_2x_2 + a_3x_3 + a_4x_4 + a_5x_5 + b.
Or even better, we’ll use
h(x) = a_0x_0 + a_1x_1 + a_2x_2 + a_3x_3 + a_4x_4 + a_5x_5.
where a_0b = b and x_0 = 1. This will simplify later in the partial derivatives and the calculation in Octave.
Gradient Descent
It’s the same approach with the only difference that we’ll have more partial derivatives as shown below:
\displaystyle \frac{\partial}{\partial a_n} = -\frac{1}{m}\sum_{i=1}^{m} x_n \left ( y - (a_0x_0 + a_1x_1 + a_2x_2 + a_3x_3 + a_4x_4 + a_5x_5) \right ) \text {\hspace{6 mm} \small eq. 2}
In the partial derivative \displaystyle \frac{\partial}{\partial a_n} , we use n, which can be 0, 1, 2, 3, 4 or 5, to avoid repetition. Don’t forget that x_0 is always 1.
Updating the parameters follows the same rules for simple linear regression.
\displaystyle new \textunderscore a_n = a_n - \alpha\frac{\partial}{\partial a_n} \text {\hspace{8 mm} \small eq. 4}
where \alpha is the learning rate and n can be 0, 1, 2, 3, 4 or 5.
Results
It’s hard to visualize what happens since we have more than 3 dimensions, so I made tests with two features (age and sex) and the results are in the images below.
First we can see how grade depends on age and sex.
Figure 1 – grade vs age and sex
Of course in this example we don’t have continuous value for sex but the only two possible values are 0 and 1, but the idea for continuous variable is still the same.
We can see that our algorithm is converging with iteration which is a good sign. By the way, for this example, the learning rate \alpha was 0.01 instead of 0.001 of the last post.
As we are doing a linear regression, we expect to have a plane which will pass through the sample points right on the “middle”, in other words, right there where the mean squared error (MSE) was minimized. Never forgot that we could found a local minimum instead of global minimum.
Figure 3 – Linear regression result (age)
Figure 4 – Linear regression result (sex)
In figure 3 we can see where the plane in dividing the age samples and in figure 4, how the plane is dividing the sex samples. So, in multiple linear regression, we try to find a way to find the equation which will minimize the MSE accordingly to all features selected.
I did everything in Octave and if you interested in the source code, check my github page at https://github.com/marcelojo/multi_linear_regression.git.
Bye
Leave a Reply