Linear Regression with Normal Equation


In this post we’ll show another way different to solve the problem of error minimization in linear regression. Instead of using gradient descent, we could solve this linear system using matrices.


Normal equation

As we know, a system of linear equations can be solved by matrices. In this case, we don’t need to define learning rate {\textstyle \alpha}, we don’t need to iterate, we don’t need to scale features.

We minimize the square errors by taking the derivatives and setting them to zero and then we solve this linear system with the following matrix equation:


 a = (X^T X)^{-1} X^T y


Remember: a is a vector a_0, a_1, ... , a_n


How this equation works and why? As seen before, once we define the error function, the whole idea is find a way to minimize it. In our case we are using mean squared error (MSE). We can get the minimum error by taking the derivatives:


\displaystyle \frac{\partial}{\partial a_n} = -\frac{1}{m}\sum_{i=1}^{m} x_n \left ( y - (a_0x_0 + a_1x_1 + a_2x_2 + ... + a_nx_n) \right ) \text {\hspace{6 mm} \small eq. 1}


and setting them to zero


 -\frac{1}{m}\sum_{i=1}^{m} x_n \left ( y - (a_0x_0 + a_1x_1 + a_2x_2 + ... = a_nx_n) \right ) = 0 \text {\hspace{6 mm} \small eq. 2}


Working on the equation we have


\sum_{i=1}^{m} x_n \left ( y - (a_0x_0 + a_1x_1 + a_2x_2 + ... + a_nx_n) \right ) = 0 \text {\hspace{14 mm} \small eq. 3}


 y - (a_0x_0 + a_1x_1 + ... + a_nx_n) = 0 \text {\hspace{46 mm} \small eq. 4}


and finally


 (a_0x_0 + a_1x_1 + a_2x_2 + a_3x_3 + a_4x_4 + a_5x_5) = y \text {\hspace{25 mm} \small eq. 5} 


Now, we can solve this using matrices:


i) X a = y

ii)(X^T X) a = X^T y

iii)  a = (X^T X)^{-1} X^T y


You can find a very good (and very long) explanation on how it works here.


I wrote a script on octave to calculate the parameters using normal equation method with both datasets used for simple and multiple linear regression. Here is the results compared to those using gradient descent:


Simple linear regression


Starting gradient descent algorithm with  a_0 = 0 and a_1 = 0 and 10000 iterations


Normal equationGradient descent


We can see that gradient descent finally converged to a local minimum.


Multiple linear regression


Starting the algorithm with  a_0 = 0 , a_1 = 0 and a_2 = 0 and 10000 iterations


Normal equationGradient descent


I did another test with 100,000 iterations and  for multiple linear regression it finally converged to the global minimum since I got the very same parameters.




As we can see, normal equation is another way to solve linear regression. It’s more direct, it finds the optimum solution and we don’t need any parameters to be tuned. But why we don’t use it always? Why do we need gradient descent?

Well, as anything is perfect, normal equation need much more processing that gradient descent. For few features (less than 10,000 accordingly to Andrew Ng) it’s acceptable but more than that, we can start thinking about gradient descent. Gradient descent can be thought as an approximation of the ideal solution since it may or may not converge to the global minimum.


See ya!


Marcelo Jo

Marcelo Jo is an electronics engineer with 10+ years of experience in embedded system, postgraduate in computer networks and masters student in computer vision at Université Laval in Canada. He shares his knowledge in this blog when he is not enjoying his wonderful family – wife and 3 kids. Live couldn’t be better.


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.