On regularization
Regularization
But before optimization, we need to learn about Regularization. Regularization is essentially a new term we add to our loss function to deliberately make the model perform slightly worse on the training set. Yep really.
where:
- is a new hyperparameter to control the Regularization strength.
- is a regularization function of choice.
The core intuition behind adding regularization to our loss is to avoid overfitting during training. By allowing the model to fit the training data slightly worse, we usually obtain better generalization and improved performance on unseen data.
Typical regularization functions are:
-
regularization: Square each weight and sum them.
L2 penalty on a single weight -
regularization: Take the absolute value of each weight and sum them.
L1 penalty on a single weight
When introducing nearest neighbors, we talked about L1 and L2 distances as ways of measuring distances between vectors. Regularization is closely related, here we measure how far the weight vector is from zero: Effectively measuring the size of the weight vector
To fully understand the effect of regularization, imo its a good idea to revist once we cover training and backpropagation. The following explanation is slightly forward-looking. During training, weights are updated using gradient descent:
where:
- is the loss function and is the gradient of the loss with respect to the weights
- is the learning rate
For simplicity, we will temporarily ignore and focus only on
The L2 Regularization uses: and its gradient is:
When we plug this into the update rule:
Intuition: each training step multiplies the weight by a number slightly smaller than 1.
- If W = 10, it shrinks noticeably.
- If W = 0.01, it shrinks only a tiny amount.
As weights get closer to zero, the shrinking effect becomes weaker. Because of this, L2 usually makes weights small, but they rarely become exactly zero.
L1 Regularization uses: and its gradient is:
When we plug this into the update rule ( for simplicity ignore the loss gradient ):
E.g. the expression becomes and for the expression becomes
Note: only the sign changes, the magnitude of W does not influence the step size.
Note: is applied element-wise to each element in is a vector and not a scalar.
Intuition: each training step subtracts a fixed amount from the weight.
Because small weights get pushed just as strongly as large ones, they often get pushed all the way to zero.
This is why L1 regularization tends to produce sparse models, where many weights are exactly zero.