Intuitions on L1 and L2 Regularisation

Share on facebook
Share on twitter
Share on linkedin
Share on email
Share on whatsapp
Share on telegram
Share on pocket

This article was first published in Towards Data Science. Photo by rawpixel on Unsplash.


Overfitting is a phenomenon that occurs when a machine learning or statistics model is tailored to a particular dataset and is unable to generalise to other datasets. This usually happens in complex models, like deep neural networks.

Regularisation is a process of introducing additional information in order to prevent overfitting. The focus for this article is L1 and L2 regularisation.

There are many explanations out there but honestly, they are a little too abstract, and I’d probably will forget and end up visiting these pages, only to forget again. In this article, I will be sharing with you some intuitions why L1 and L2 work using gradient descent. Gradient descent is simply a method to find the ‘right’ coefficients through (iterative) updates using the value of the gradient. (This article shows how gradient descent can be used in a simple linear regression.)

Content

0) L1 and L2
1) Model
2) Loss Functions
3) Gradient descent
4) How is overfitting prevented?

Let’s go!


0) L1 and L2

L1 and L2 regularisation owes its name to L1 and L2 norm of a vector respectively. Here’s a primer on norms:

1-norm (also known as L1 norm)
2-norm (also known as L2 norm or Euclidean norm)
p-norm

A linear regression model that implements L1 norm for regularisation is called lasso regression, and one that implements L2 norm for regularisation is called ridge regression. To implement these two, note that the linear regression model stays the same:

but it is the calculation of the loss function that includes these regularisation terms:

Loss function with no regularisation
Loss function with L1 regularisation
Loss function with L2 regularisation

The regularisation terms are ‘constraints’ by which an optimisation algorithm must ‘adhere to’ when minimising the loss function, apart from having to minimise the error between the true y and the predicted ŷ.

1) Model

For simplicity, we define a simple linear regression model ŷ with one independent variable.

Here I have used the deep learning conventions w (‘weight’) and (‘bias’).

In practice, simple linear regression models are not prone to overfitting. As mentioned in the introduction, deep learning models are more susceptible to such problems due to their model complexity.

As such, do note that the expressions used in this article are easily extended to more complex models, not limited to linear regression.

2) Loss Functions

To demonstrate the effect of L1 and L2 regularisation, let’s define 3 loss functions:

  • L
  • L1
  • L2

2.1) Loss function with no regularisation

We define the loss function L as the squared error, where error is the difference between y (the true value) and ŷ (the predicted value).

Let’s assume our model will be overfitted using this loss function.

2.2) Loss function with L1 regularisation

Based on the above loss function, adding an L1 regularisation term to it looks like this:

where the regularisation parameter λ > 0 is manually tuned. Let’s call this loss function L1. Note that |w| is differentiable everywhere except when w=0, as shown below. We will need this later.

2.3) Loss function with L2 regularisation

Similarly, adding an L2 regularisation term to L looks like this:

where again, λ > 0.

3) Gradient descent

Now, let’s use gradient descent optimisation to find w. Recall that updating the parameter in gradient descent is as follows:

Let’s substitute the last term in the above equation with the gradient of LL1 and L2 w.r.t. w.

L:

L1:

L2:


4) How is overfitting prevented?

For readability purposes, let’s perform the following substitution on the equations above:

  • η = 1,
  • H = 2x(wx+b-y)

Thus we have as follows:

L:

L1:

L2:

4.1) L vs. {L1 and L2}

Observe the differences between the weight updates with the regularisation parameter λ and without it.

Intuition A:

Let’s say with Eqn. 0, executing w-H gives us a value that leads to overfitting. Then, intuitively, Eqns. 1.1–2 will reduce the chances of overfitting because introducing λ makes us shift away from the very w that was going to cause us overfitting problems in the previous sentence.

Intuition B:

Let’s say an overfitted model means that we have a w value that is perfect for our model. ‘Perfect’ meaning if we substituted the data (x) back in the model, our prediction ŷ will be very very close to the true y. Sure, it’s good, but we don’t want perfect. Why? Because this means our model is only meant for the dataset which we trained on. This means our model will produce predictions that are far off from the true value for other datasets. So we settle from less than perfect, with the hope that our model can also get close predictions with other data. To do this, we ‘taint’ this perfect w in Eqn. 0 with a penalty term λin Eqns. 1.1–2.

Intuition C:

Notice that (as defined here) is dependent on the model (w and b) and the data (and y). Updating the weights based only on the model and data in Eqn. 0 can lead to overfitting, which leads to poor generalisation. On the other hand, in Eqns. 1.1–2, the final value of w is not only influenced by the model and data, but also by a predefined parameter λ which is independentof the model and data. Thus, we can prevent overfitting if we set an appropriate value of λ, though too large a value will cause the model to be severely underfitted.

4.2) L1 vs. L2

Observe the differences between the weight updates with L1 regularisation and L2 regularisation.

Intuition D:

We shall now focus our attention to L1 and L2, and rewrite Eqns. 1.1 — 2 by rearranging their λ and H terms as follows:

L1:

L2:

For L1 (Eqn. 3.1), if w is positive, the regularisation parameter λ>0 will push w to be less positive, by subtracting λ from w. Conversely in Eqn. 3.2, if w is negative, λ will be added to w, pushing it to be less negative. Hence, this has the effect of pushing w towards 0.

This is of course pointless in a 1-variable linear regression model, but will prove its prowess to ‘remove’ useless variables in multivariate regression models. You can also think of L1 as reducing the number of features in the model altogether. Here is an arbitrary example of L1 trying to ‘push’ some variables in a multivariate linear regression model:

So how does pushing w towards 0 help in overfitting? As mentioned above, as goes to 0, we are reducing the number of features by reducing the variable importance. In the equation above, we see that x_2x_4 and x_5 is almost ‘useless’ because of their small coefficients, hence we can remove them from the equation. This in turn reduces the model complexity, making our model simpler. A simpler model can reduce the chances of overfitting.

While L1 depends on the sign of w, L2, on the other hand, just pushes w away regardless of its sign.


Special thanks to Yu Xuan, Ren Jie, Daniel and Derek for ideas, suggestions and corrections to this article. Also thank you C Gam for pointing out the mistake in the derivative.

Follow me on Twitter @remykarem for digested articles and demos on AI and Deep Learning.

References

Norm (Mathematics) (wikipedia.org)

Lasso (Statistics) (wikipedia.org)

Lasso and Ridge Regularization (medium.com)

Leave a Reply

Please Login to comment

  Subscribe  
Notify of
Previous

An Overview of Federated Learning

Tutorial: Linear Regression with Stochastic Gradient Descent

Next