Linear Regression with Normal Equation

Hello,

In this post we’ll show another way different to solve the problem of error minimization in linear regression. Instead of using gradient descent, we could solve this linear system using matrices.

Normal equation

As we know, a system of linear equations can be solved by matrices. In this case, we don’t need to define learning rate ${\textstyle \alpha}$ , we don’t need to iterate, we don’t need to scale features.

We minimize the square errors by taking the derivatives and setting them to zero and then we solve this linear system with the following matrix equation:

a = (X^T X)^{-1} X^T y

Remember: $a$ is a vector $a_0, a_1, ... , a_n$

How this equation works and why? As seen before, once we define the error function, the whole idea is find a way to minimize it. In our case we are using mean squared error (MSE). We can get the minimum error by taking the derivatives:

\displaystyle \frac{\partial}{\partial a_n} = -\frac{1}{m}\sum_{i=1}^{m} x_n \left ( y - (a_0x_0 + a_1x_1 + a_2x_2 + ... + a_nx_n) \right ) \text {\hspace{6 mm} \small eq. 1}

and setting them to zero

-\frac{1}{m}\sum_{i=1}^{m} x_n \left ( y - (a_0x_0 + a_1x_1 + a_2x_2 + ... = a_nx_n) \right ) = 0 \text {\hspace{6 mm} \small eq. 2}

Working on the equation we have

\sum_{i=1}^{m} x_n \left ( y - (a_0x_0 + a_1x_1 + a_2x_2 + ... + a_nx_n) \right ) = 0 \text {\hspace{14 mm} \small eq. 3}

y - (a_0x_0 + a_1x_1 + ... + a_nx_n) = 0 \text {\hspace{46 mm} \small eq. 4}

and finally

(a_0x_0 + a_1x_1 + a_2x_2 + a_3x_3 + a_4x_4 + a_5x_5) = y \text {\hspace{25 mm} \small eq. 5}

Now, we can solve this using matrices:

i) $X a = y$

ii) $(X^T X) a = X^T y$

iii) $a = (X^T X)^{-1} X^T y$

You can find a very good (and very long) explanation on how it works here.

I wrote a script on octave to calculate the parameters using normal equation method with both datasets used for simple and multiple linear regression. Here is the results compared to those using gradient descent:

Simple linear regression

Starting gradient descent algorithm with $a_0 = 0$ and $a_1 = 0$ and 10000 iterations

	Normal equation	Gradient descent
$a_0$	1.2921	1.4826
$a_1$	9.7777	0.065248
Error	53.776	55.529

We can see that gradient descent finally converged to a local minimum.

Multiple linear regression

Starting the algorithm with $a_0 = 0$ , $a_1 = 0$ and $a_2 = 0$ and 10000 iterations

	Normal equation	Gradient descent
$a_0$	77.09660	73.632116
$a_1$	0.33745	0.652041
$a_2$	-1.01820	-0.859574
Error	11.529	11.716

I did another test with 100,000 iterations and for multiple linear regression it finally converged to the global minimum since I got the very same parameters.

Conclusion

As we can see, normal equation is another way to solve linear regression. It’s more direct, it finds the optimum solution and we don’t need any parameters to be tuned. But why we don’t use it always? Why do we need gradient descent?

Well, as anything is perfect, normal equation need much more processing that gradient descent. For few features (less than 10,000 accordingly to Andrew Ng) it’s acceptable but more than that, we can start thinking about gradient descent. Gradient descent can be thought as an approximation of the ideal solution since it may or may not converge to the global minimum.

See ya!

Marcelo Jo

Marcelo Jo is an electronics engineer with 10+ years of experience in embedded system, postgraduate in computer networks and masters student in computer vision at Université Laval in Canada. He shares his knowledge in this blog when he is not enjoying his wonderful family – wife and 3 kids. Live couldn’t be better.

Marcelo Jo