Hello,

In this post we’ll show another way different to solve the problem of error minimization in linear regression. Instead of using gradient descent, we could solve this linear system using matrices.

## Normal equation

As we know, a system of linear equations can be solved by matrices. In this case, we don’t need to define learning rate {\textstyle \alpha}, we don’t need to iterate, we don’t need to scale features.

We minimize the square errors by taking the derivatives and setting them to zero and then we solve this linear system with the following matrix equation:

a = (X^T X)^{-1} X^T y

**Remember**: a is a vector a_0, a_1, ... , a_n

How this equation works and why? As seen before, once we define the error function, the whole idea is find a way to minimize it. In our case we are using mean squared error (MSE). We can get the minimum error by taking the derivatives:

\displaystyle \frac{\partial}{\partial a_n} = -\frac{1}{m}\sum_{i=1}^{m} x_n \left ( y - (a_0x_0 + a_1x_1 + a_2x_2 + ... + a_nx_n) \right ) \text {\hspace{6 mm} \small eq. 1}

and setting them to zero

-\frac{1}{m}\sum_{i=1}^{m} x_n \left ( y - (a_0x_0 + a_1x_1 + a_2x_2 + ... = a_nx_n) \right ) = 0 \text {\hspace{6 mm} \small eq. 2}

Working on the equation we have

\sum_{i=1}^{m} x_n \left ( y - (a_0x_0 + a_1x_1 + a_2x_2 + ... + a_nx_n) \right ) = 0 \text {\hspace{14 mm} \small eq. 3}

y - (a_0x_0 + a_1x_1 + ... + a_nx_n) = 0 \text {\hspace{46 mm} \small eq. 4}

and finally

(a_0x_0 + a_1x_1 + a_2x_2 + a_3x_3 + a_4x_4 + a_5x_5) = y \text {\hspace{25 mm} \small eq. 5}

Now, we can solve this using matrices:

i) X a = y

ii)(X^T X) a = X^T y

iii) a = (X^T X)^{-1} X^T y

You can find a very good (and very long) explanation on how it works here.

I wrote a script on octave to calculate the parameters using normal equation method with both datasets used for simple and multiple linear regression. Here is the results compared to those using gradient descent:

**Simple linear regression**

Starting gradient descent algorithm with a_0 = 0 and a_1 = 0 and 10000 iterations

Normal equation | Gradient descent | |
---|---|---|

a_0 | 1.2921 | 1.4826 |

a_1 | 9.7777 | 0.065248 |

Error | 53.776 | 55.529 |

We can see that gradient descent finally converged to a local minimum.

**Multiple linear regression**

Starting the algorithm with a_0 = 0 , a_1 = 0 and a_2 = 0 and 10000 iterations

Normal equation | Gradient descent | |
---|---|---|

a_0 | 77.09660 | 73.632116 |

a_1 | 0.33745 | 0.652041 |

a_2 | -1.01820 | -0.859574 |

Error | 11.529 | 11.716 |

I did another test with 100,000 iterations and for multiple linear regression it finally converged to the global minimum since I got the very same parameters.

## Conclusion

As we can see, normal equation is another way to solve linear regression. It’s more direct, it finds the optimum solution and we don’t need any parameters to be tuned. But why we don’t use it always? Why do we need gradient descent?

Well, as anything is perfect, normal equation need much more processing that gradient descent. For few features (less than 10,000 accordingly to Andrew Ng) it’s acceptable but more than that, we can start thinking about gradient descent. Gradient descent can be thought as an approximation of the ideal solution since it may or may not converge to the global minimum.

See ya!

## Leave a Reply