Gradient descent tricks

Hi people,

The last two posts were about linear regression. I explained a little about the theory and I left an example to test the algorithm which actually works but could be improved. How can we do this?

Feature scaling

One of problems with gradient descent is when two features are not in the same scale. For example house price ($200,000 – $1,000,000) vs number of bedrooms (1 – 5). When it happens, we are going to descend quickly on small ranges and very slowly in large ranges, which will lead to oscillation and slow convergence.

By scaling the features, the main idea is put all features at the same (or close) scale to converge faster.

Figure 1 – Feature scaled (left) vs unscaled feature (right)

Image taken from internet

So we can imagine that the learning rate $\alpha$ could work well for one feature, but could be too high for the other. It’s easy to imagine that if we throw a small marble ball in the image on the right it will oscillate a lot until it stops in the middle but it would stop much quicker on the left image.

The idea of scaling a feature is to have the feature’s values between 0 and 1 and then all feature’s values would be pretty close. But how can we scale a feature?

There are several techniques (honestly I don’t know why one is better than other, if anyone knows, please leave a comment =D):

i) Rescaling

In this technique, we subtracts the input by the minimum and divide by the range of values (max – min values).

x_i = \frac{\displaystyle x_i - x_{min}}{\displaystyle x_{max} - x_{min}}

ii) Mean normalization

In this technique, we first subtracts input values by the mean of the input values and divide them by the range of values (max – min values).

x_i = \frac{\displaystyle x_i - \mu}{\displaystyle x_{max} - x_{min}}

iii) Standardization

In this technique, we first subtracts input values by the mean of the input values and divide them by the standard deviation.

x_i = \frac{\displaystyle x_i - \mu}{\displaystyle \sigma}

It worth remember that you have to save the values used in the feature scaling as you’ll need after processing data to convert back to real values.

Learning rate

Another issue with gradient descent is that it can oscillate and never converge or overflow. We use the parameter learning rate $\alpha$ to tune the speed of convergence. If too high, gradient descent will oscillate and overflow or never converge to the minimum. If too small, it will take too much time (iterations) to converge.

Personally I start with 0.001, 0.01… to check first what happens and increase/decrease the learning rate value in multiples of 3. I found that multiplying by 10 was very often too much.

As mentioned before, there is a way to check if the gradient descent is converging to a value. We can always track the error vs iteration and see what is going on. The error should decrease over time.

Figure 2 – Error vs iteration

Conclusion

We could see that gradient descent can work well but in some conditions it can be unstable or too slow. Using the techniques discussed in this post can help or solve them allowing a better performance of the algorithm.

See you next time.

Marcelo Jo

Marcelo Jo is an electronics engineer with 10+ years of experience in embedded system, postgraduate in computer networks and masters student in computer vision at Université Laval in Canada. He shares his knowledge in this blog when he is not enjoying his wonderful family – wife and 3 kids. Live couldn’t be better.

Marcelo Jo