Jan 26, 2026

On optimization

This is part #5 of my notes on CS231n. The course is openly available, including the video lectures and assignments.

These notes are based on this lecture by Zane Durante.

So far, we have defined:

A score function that maps inputs to class scores.
A loss function that tells us how good or bad our parameters are.
A regularization term that helps prevent overfitting.

What remains is arguably the most important part of machine learning: how do we actually find the parameters that minimize the loss?

This process is called optimization.

Gradient Descent

The fundamental idea behind most modern deep learning training is Gradient Descent.

Suppose our model has parameters W. These parameters may include all weight matrices and bias vectors in the network. We can think of them collectively as a single large parameter vector.

Our goal is to minimize the loss:

L(W)

Gradients

When the loss depends on many parameters, we need to know how the loss changes with respect to each parameter independently. We measure this using partial derivatives.

For example:

$\frac{\partial L}{\partial W_{ij}}$ measures how the loss changes when we slightly change the element $W_{ij}$ while keeping all other parameters fixed.

Rather than writing derivatives for every element separately, we group them into the gradient:

\nabla_W L(W)

The gradient has the same shape as W and points in the direction of steepest increase of the loss.

Since we want to minimize the loss, we update parameters by moving in the opposite direction:

W_{t+1} = W_t - \alpha \nabla_W L(W_t)

where:

$\alpha$ = learning rate (step size)

Numerical vs Analytical Gradients

The gradient can be approximated numerically or computed analytically.

Numerical gradients approximate derivatives by measuring how the loss changes when parameters are slightly perturbed. These method is mainly used as a debugging tool.
Analytical gradients are computed exactly using calculus. In deep learning, these are efficiently computed using backpropagation, which I will in the next notes.

Dataset Strategies

Computing gradients over the entire dataset can be expensive. Different strategies trade off stability, speed, and noise:

Strategy	Uses	Update Frequency	Pros	Cons
Batch Gradient Descent	Entire dataset	Per epoch	Stable and accurate gradients	Very slow for large datasets, high memory usage
Stochastic Gradient Descent (SGD)	Single sample	Per sample	Very fast updates, helps escape local minima	Noisy and unstable convergence
Mini-Batch Gradient Descent	Small subset of data	Per batch	Good balance between speed and stability, efficient on GPUs	Still stochastic

Mini-batch gradient descent is the most commonly used in practice.

While vanilla gradient descent updates parameters using only the current gradient, modern optimizers improve convergence by adapting how updates are applied.

Momentum

Momentum improves gradient descent by introducing a velocity term that keeps track of past gradients.

V_{t+1} = \mu V_t + \nabla_W L(W_t)

W_{t+1} = W_t - \alpha V_{t+1}

where:

$V$ = velocity (running average of gradients)
$\mu$ = momentum coefficient (typically 0.9-0.99)

while True:
    dW = compute_gradient(W)
 
    V = mu * V + dW
    W -= learning_rate * V

Momentum behaves like pushing a ball downhill, it speeds up progress along stable directions and reduces oscillations across steep valleys.

RMSProp

RMSProp adapts the learning rate for each parameter using the history of squared gradients.

S_{t+1} = \rho S_t + (1 - \rho)(\nabla_W L(W_t))^2

W_{t+1} = W_t - \alpha \frac{\nabla_W L(W_t)}{\sqrt{S_{t+1}} + \epsilon}

where:

$S$ = running average of squared gradients
$\rho$ = decay rate (typically 0.99)
$\epsilon$ = small constant for numerical stability

while True:
    dW = compute_gradient(W)
 
    cache = decay_rate * cache + (1 - decay_rate) * dW**2
    W -= learning_rate * dW / (np.sqrt(cache) + 1e-7)

RMSProp helps balance learning across parameters with different scales, for large gradients produces smaller update steps and for small gradients large update steps.

Adam (Adaptive Moment Estimation)

Adam combines Momentum and RMSProp by tracking both the mean and variance of gradients.

First moment (Momentum):

M_{t+1} = \beta_1 M_t + (1 - \beta_1)\nabla_W L(W_t)

where:

$\beta_1 \approx 0.9$

Second moment (RMSProp-style):

V_{t+1} = \beta_2 V_t + (1 - \beta_2)(\nabla_W L(W_t))^2

where:

$\beta_2 \approx 0.999$

Final update:

Bias-corrected moments:

$\hat{M}_{t+1} = \frac{M_{t+1}}{1 - \beta_1^{t+1}}$

$\hat{V}_{t+1} = \frac{V_{t+1}}{1 - \beta_2^{t+1}}$

This correction accounts for the zero initialization of $M_t$ and $V_t$ , preventing the moment estimates from being artificially small at the start of training.

W_{t+1} = W_t - \alpha \frac{\hat{M}_{t+1}}{\sqrt{\hat{V}_{t+1}} + \epsilon}

while True:
    t += 1
    dW = compute_gradient(W)
 
    m = beta1 * m + (1 - beta1) * dW
    v = beta2 * v + (1 - beta2) * dW**2
 
    m_hat = m / (1 - beta1**t)
    v_hat = v / (1 - beta2**t)
 
    W -= learning_rate * m_hat / (np.sqrt(v_hat) + 1e-7)

Adam maintains stable gradient directions thanks to Momentum and adapts step size per parameter like RMSProp. Works well across many deep learning tasks and is often used as a strong default optimizer.

The classic way to compare optimizers is to visualize their trajectories on the same loss surface.

Comparison

Step 0 / 50

Loss: 17.5025

Step

Steps50Learning rate0.08Momentum (mu)0.9

Momentum17.5025

Where Do Gradients Come From?

All optimizers above assume we can compute:

\nabla_W L(W)

For simple models this derivative can be computed manually. For neural networks, however, the loss is composed of many nested functions (layers).

Computing gradients efficiently requires a systematic method called backpropagation, which uses the chain rule over computational graphs.

That is the focus of the next note.