On regularization

Regularization

But before optimization, we need to learn about Regularization. Regularization is essentially a new term we add to our loss function to deliberately make the model perform slightly worse on the training set. Yep really.

L(W)=Ldata(W)loss+λR(W)regularizationL(W)= \underbrace{L_{\text{data}}(W)}_{\text{loss}} +\underbrace{\lambda R(W)}_{\text{regularization}}

where:

  • λ\lambda is a new hyperparameter to control the Regularization strength.
  • R(W)R(W) is a regularization function of choice.

The core intuition behind adding regularization to our loss is to avoid overfitting during training. By allowing the model to fit the training data slightly worse, we usually obtain better generalization and improved performance on unseen data.

Typical regularization functions are:

  • L2L2 regularization: Square each weight and sum them.

    R(W)=W22=klWk,l2R(W) = \|W\|^2_2 = \sum_k \sum_l W^2_{k,l}

    L2 penalty on a single weight
  • L1L1 regularization: Take the absolute value of each weight and sum them.

    R(W)=W1=klWk,lR(W) = \|W\|_1 = \sum_k \sum_l |W_{k,l}|

    L1 penalty on a single weight

When introducing nearest neighbors, we talked about L1 and L2 distances as ways of measuring distances between vectors. Regularization is closely related, here we measure how far the weight vector is from zero: Wp=W0p=dp(W,0)\|W\|_p = \|W - 0\|_p = d_p(W,0) Effectively measuring the size of the weight vector

To fully understand the effect of regularization, imo its a good idea to revist once we cover training and backpropagation. The following explanation is slightly forward-looking. During training, weights are updated using gradient descent:

WWαWL(W)W \leftarrow W - \alpha \nabla_W L(W)

where:

  • L(W)L(W) is the loss function and WL(W)=WLdata(W)+λWR(W)\nabla_W L(W) = \nabla_W L_{\text{data}}(W) + \lambda \nabla_W R(W) is the gradient of the loss with respect to the weights
  • α\alpha is the learning rate

For simplicity, we will temporarily ignore WLdata(W)\nabla_W L_{\text{data}}(W) and focus only on λWR(W)\lambda \nabla_W R(W)


The L2 Regularization uses: R(W)=W22R(W) = \|W\|^2_2 and its gradient is: WW22=2W\nabla_W \|W\|^2_2 = 2W

When we plug this into the update rule:

WWαλ2WW \leftarrow W - \alpha \lambda 2W

WW(12αλ)W \leftarrow W(1 - 2\alpha\lambda)

Intuition: each training step multiplies the weight by a number slightly smaller than 1.

  • If W = 10, it shrinks noticeably.
  • If W = 0.01, it shrinks only a tiny amount.

As weights get closer to zero, the shrinking effect becomes weaker. Because of this, L2 usually makes weights small, but they rarely become exactly zero.

L1 Regularization uses: R(W)=W1R(W) = \|W\|_1 and its gradient is: WW1=sign(W)={1W>01W<0\nabla_W \|W\|_1 = sign(W) = \begin{cases} 1 & W>0 \\ -1 & W<0 \end{cases}

When we plug this into the update rule ( for simplicity ignore the loss gradient ):

WWαλsign(W)W \leftarrow W - \alpha \lambda \cdot sign(W)

E.g. W=10W=10 the expression becomes Wαλ1W - \alpha \lambda \cdot 1 and for W=10W=-10 the expression becomes Wαλ1W - \alpha \lambda \cdot -1

Note: only the sign changes, the magnitude of W does not influence the step size.

Note: sign(W)sign(W) is applied element-wise to each element in WW is a vector and not a scalar.

Intuition: each training step subtracts a fixed amount αλ\alpha \lambda from the weight.

Because small weights get pushed just as strongly as large ones, they often get pushed all the way to zero.

This is why L1 regularization tends to produce sparse models, where many weights are exactly zero.


← Back to blog