Jan 20, 2026

On regularization

This is part #4 of my notes on CS231n. The course is openly available, including the video lectures and assignments.

These notes are based on this lecture by Zane Durante.

# Regularization

But before optimization, we need to learn about Regularization. Regularization is essentially a new term we add to our loss function to deliberately make the model perform slightly worse on the training set. Yep really.

$L(W)= \underbrace{L_{\text{data}}(W)}_{\text{loss}} +\underbrace{\lambda R(W)}_{\text{regularization}}$

where:

$\lambda$ is a new hyperparameter to control the Regularization strength.
$R(W)$ is a regularization function of choice.

The core intuition behind adding regularization to our loss is to avoid overfitting during training. By allowing the model to fit the training data slightly worse, we usually obtain better generalization and improved performance on unseen data.

Typical regularization functions are:

$L2$ regularization: Square each weight and sum them.

$R(W) = \|W\|^2_2 = \sum_k \sum_l W^2_{k,l}$

$R (w) = w^{2}$
L2 penalty on a single weight
$L1$ regularization: Take the absolute value of each weight and sum them.

$R(W) = \|W\|_1 = \sum_k \sum_l |W_{k,l}|$

$R (w) = ∣ w ∣$
L1 penalty on a single weight

When introducing nearest neighbors, we talked about L1 and L2 distances as ways of measuring distances between vectors. Regularization is closely related, here we measure how far the weight vector is from zero: $\|W\|_p = \|W - 0\|_p = d_p(W,0)$ Effectively measuring the size of the weight vector

To fully understand the effect of regularization, imo its a good idea to revist once we cover training and backpropagation. The following explanation is slightly forward-looking. During training, weights are updated using gradient descent:

$W \leftarrow W - \alpha \nabla_W L(W)$

where:

$L(W)$ is the loss function and $\nabla_W L(W) = \nabla_W L_{\text{data}}(W) + \lambda \nabla_W R(W)$ is the gradient of the loss with respect to the weights
$\alpha$ is the learning rate

For simplicity, we will temporarily ignore $\nabla_W L_{\text{data}}(W)$ and focus only on $\lambda \nabla_W R(W)$

The L2 Regularization uses: $R(W) = \|W\|^2_2$ and its gradient is: $\nabla_W \|W\|^2_2 = 2W$

When we plug this into the update rule:

$W \leftarrow W - \alpha \lambda 2W$

$W \leftarrow W(1 - 2\alpha\lambda)$

Intuition: each training step multiplies the weight by a number slightly smaller than 1.

If W = 10, it shrinks noticeably.
If W = 0.01, it shrinks only a tiny amount.

As weights get closer to zero, the shrinking effect becomes weaker. Because of this, L2 usually makes weights small, but they rarely become exactly zero.

L1 Regularization uses: $R(W) = \|W\|_1$ and its gradient is: $\nabla_W \|W\|_1 = sign(W) = \begin{cases} 1 & W>0 \\ -1 & W<0 \end{cases}$

When we plug this into the update rule ( for simplicity ignore the loss gradient ):

$W \leftarrow W - \alpha \lambda \cdot sign(W)$

E.g. $W=10$ the expression becomes $W - \alpha \lambda \cdot 1$ and for $W=-10$ the expression becomes $W - \alpha \lambda \cdot -1$

Note: only the sign changes, the magnitude of W does not influence the step size.

Note: $sign(W)$ is applied element-wise to each element in $W$ is a vector and not a scalar.

Intuition: each training step subtracts a fixed amount $\alpha \lambda$ from the weight.

Because small weights get pushed just as strongly as large ones, they often get pushed all the way to zero.

This is why L1 regularization tends to produce sparse models, where many weights are exactly zero.