On optimization

So far, we have defined:

  • A score function that maps inputs to class scores.
  • A loss function that tells us how good or bad our parameters are.
  • A regularization term that helps prevent overfitting.

What remains is arguably the most important part of machine learning: how do we actually find the parameters that minimize the loss?

This process is called optimization.

Gradient Descent

The fundamental idea behind most modern deep learning training is Gradient Descent.

Suppose our model has parameters W. These parameters may include all weight matrices and bias vectors in the network. We can think of them collectively as a single large parameter vector.

Our goal is to minimize the loss:

L(W)L(W)

Gradients

When the loss depends on many parameters, we need to know how the loss changes with respect to each parameter independently. We measure this using partial derivatives.

For example:

  • LWij\frac{\partial L}{\partial W_{ij}} measures how the loss changes when we slightly change the element WijW_{ij} while keeping all other parameters fixed.

Rather than writing derivatives for every element separately, we group them into the gradient:

WL(W)\nabla_W L(W)

The gradient has the same shape as W and points in the direction of steepest increase of the loss.

Since we want to minimize the loss, we update parameters by moving in the opposite direction:

Wt+1=WtαWL(Wt)W_{t+1} = W_t - \alpha \nabla_W L(W_t)

where:

  • α\alpha = learning rate (step size)

Numerical vs Analytical Gradients

The gradient can be approximated numerically or computed analytically.

  • Numerical gradients approximate derivatives by measuring how the loss changes when parameters are slightly perturbed. These method is mainly used as a debugging tool.
  • Analytical gradients are computed exactly using calculus. In deep learning, these are efficiently computed using backpropagation, which I will in the next notes.

Dataset Strategies

Computing gradients over the entire dataset can be expensive. Different strategies trade off stability, speed, and noise:

StrategyUsesUpdate FrequencyProsCons
Batch Gradient DescentEntire datasetPer epochStable and accurate gradientsVery slow for large datasets, high memory usage
Stochastic Gradient Descent (SGD)Single samplePer sampleVery fast updates, helps escape local minimaNoisy and unstable convergence
Mini-Batch Gradient DescentSmall subset of dataPer batchGood balance between speed and stability, efficient on GPUsStill stochastic

Mini-batch gradient descent is the most commonly used in practice.

While vanilla gradient descent updates parameters using only the current gradient, modern optimizers improve convergence by adapting how updates are applied.

Momentum

Momentum improves gradient descent by introducing a velocity term that keeps track of past gradients.

Vt+1=μVt+WL(Wt)V_{t+1} = \mu V_t + \nabla_W L(W_t) Wt+1=WtαVt+1W_{t+1} = W_t - \alpha V_{t+1}

where:

  • VV = velocity (running average of gradients)
  • μ\mu = momentum coefficient (typically 0.9-0.99)
while True:
    dW = compute_gradient(W)
 
    V = mu * V + dW
    W -= learning_rate * V

Momentum behaves like pushing a ball downhill, it speeds up progress along stable directions and reduces oscillations across steep valleys.

RMSProp

RMSProp adapts the learning rate for each parameter using the history of squared gradients.

St+1=ρSt+(1ρ)(WL(Wt))2S_{t+1} = \rho S_t + (1 - \rho)(\nabla_W L(W_t))^2 Wt+1=WtαWL(Wt)St+1+ϵW_{t+1} = W_t - \alpha \frac{\nabla_W L(W_t)}{\sqrt{S_{t+1}} + \epsilon}

where:

  • SS= running average of squared gradients
  • ρ\rho= decay rate (typically 0.99)
  • ϵ\epsilon= small constant for numerical stability
while True:
    dW = compute_gradient(W)
 
    cache = decay_rate * cache + (1 - decay_rate) * dW**2
    W -= learning_rate * dW / (np.sqrt(cache) + 1e-7)

RMSProp helps balance learning across parameters with different scales, for large gradients produces smaller update steps and for small gradients large update steps.

Adam (Adaptive Moment Estimation)

Adam combines Momentum and RMSProp by tracking both the mean and variance of gradients.

First moment (Momentum):

Mt+1=β1Mt+(1β1)WL(Wt)M_{t+1} = \beta_1 M_t + (1 - \beta_1)\nabla_W L(W_t)

where:

  • β10.9\beta_1 \approx 0.9

Second moment (RMSProp-style):

Vt+1=β2Vt+(1β2)(WL(Wt))2V_{t+1} = \beta_2 V_t + (1 - \beta_2)(\nabla_W L(W_t))^2

where:

  • β20.999\beta_2 \approx 0.999

Final update:

Bias-corrected moments:

M^t+1=Mt+11β1t+1\hat{M}_{t+1} = \frac{M_{t+1}}{1 - \beta_1^{t+1}}

V^t+1=Vt+11β2t+1\hat{V}_{t+1} = \frac{V_{t+1}}{1 - \beta_2^{t+1}}

This correction accounts for the zero initialization of MtM_t and VtV_t, preventing the moment estimates from being artificially small at the start of training.

Wt+1=WtαM^t+1V^t+1+ϵW_{t+1} = W_t - \alpha \frac{\hat{M}_{t+1}}{\sqrt{\hat{V}_{t+1}} + \epsilon}
while True:
    t += 1
    dW = compute_gradient(W)
 
    m = beta1 * m + (1 - beta1) * dW
    v = beta2 * v + (1 - beta2) * dW**2
 
    m_hat = m / (1 - beta1**t)
    v_hat = v / (1 - beta2**t)
 
    W -= learning_rate * m_hat / (np.sqrt(v_hat) + 1e-7)

Adam maintains stable gradient directions thanks to Momentum and adapts step size per parameter like RMSProp. Works well across many deep learning tasks and is often used as a strong default optimizer.


The classic way to compare optimizers is to visualize their trajectories on the same loss surface.

Comparison
Step 0 / 50
Loss: 17.5025
Momentum17.5025

Where Do Gradients Come From?

All optimizers above assume we can compute:

WL(W)\nabla_W L(W)

For simple models this derivative can be computed manually. For neural networks, however, the loss is composed of many nested functions (layers).

Computing gradients efficiently requires a systematic method called backpropagation, which uses the chain rule over computational graphs.

That is the focus of the next note.

← Back to blog