Here’s an intuitive explanation of why L1 regularization shrinks weights to 0.
Regularization is a popular method to prevent models from overfitting. The idea is simple: I want to keep my model weights small, so I will add a penalty for having large weights. The two most common methods of regularization are Lasso (or L1) regularization, and Ridge (or L2) regularization. They penalize the model by either its absolute weight (L1), or the square of its weight (L2). This begs the questions: So which one should I choose? and why does Lasso perform feature selection?
The Old Way
You often hear the saying “L1 regularization tends to shrink the coefficients of unimportant features to 0, but L2 does not” in all the best explanations of regularization as seen in here and here. Visual explanations usually consist of diagrams like this very popular picture from Elements of Statistical Learning by Hastie, Tibshirani, and Friedman:
also seen here in Pattern Recognition and Machine Learning by Bishop:
I have found these diagrams unintuitive, and so made a simpler one that feels much easier to understand.
The New Way
Here’s my take, step by step with visualizations. First of all, the images above are actually 3 dimensional, which do not translate well onto a book or screen. Instead, let us get back to basics with a linear dataset.
First, we create a really simple dataset with just one weight: y=w*x. Our linear model will try to learn the weight w.
Pretending we do not know the correct value of w, we randomly select values of w. We then calculate the loss (mean squared error) for various values of w. The loss is 0 at w=0.5, which is the correct value of w as we defined earlier. As we move further away from w=0.5, the loss increases.
Now we plot our regularization loss functions. L1 loss is 0 when w is 0, and increases linearly as you move away from w=0. L2 loss increases non-linearly as you move away from w=0.
Now the fun part. Regularized loss is calculated by adding your loss term to your regularization term. Doing this for each of our losses above gets us the blue (L1 regularized losses) and red (L2 regularized losses) curves below.
In the case of L1 regularized loss (blue line), the value of w that minimizes the loss is at w=0. For L2 regularized loss (red line), the value of w that minimizes the loss is lower than the actual value (which is 0.5), but does not quite hit 0.
There you have it, for the same values of lambda, L1 regularization has shrunk the feature weight down to 0!
Another way of thinking about this is in the context of using gradient descent to minimize the loss function. We can follow the gradient of the loss function to the point where loss is minimized. Regularization then adds a gradient to the gradient of the unregularized loss. L1 regularization adds a fixed gradient to the loss at every value other than 0, while the gradient added by L2 regularization decreases as we approach 0. Therefore, at values of w that are very close to 0, gradient descent with L1 regularization continues to push w towards 0, while gradient descent on L2 weakens the closer you are to 0.
This article was originally published here.