On optimization
So far, we have defined:
- A score function that maps inputs to class scores.
- A loss function that tells us how good or bad our parameters are.
- A regularization term that helps prevent overfitting.
What remains is arguably the most important part of machine learning: how do we actually find the parameters that minimize the loss?
This process is called optimization.
Gradient Descent
The fundamental idea behind most modern deep learning training is Gradient Descent.
Suppose our model has parameters W. These parameters may include all weight matrices and bias vectors in the network. We can think of them collectively as a single large parameter vector.
Our goal is to minimize the loss:
Gradients
When the loss depends on many parameters, we need to know how the loss changes with respect to each parameter independently. We measure this using partial derivatives.
For example:
- measures how the loss changes when we slightly change the element while keeping all other parameters fixed.
Rather than writing derivatives for every element separately, we group them into the gradient:
The gradient has the same shape as W and points in the direction of steepest increase of the loss.
Since we want to minimize the loss, we update parameters by moving in the opposite direction:
where:
- = learning rate (step size)
Numerical vs Analytical Gradients
The gradient can be approximated numerically or computed analytically.
- Numerical gradients approximate derivatives by measuring how the loss changes when parameters are slightly perturbed. These method is mainly used as a debugging tool.
- Analytical gradients are computed exactly using calculus. In deep learning, these are efficiently computed using backpropagation, which I will in the next notes.
Dataset Strategies
Computing gradients over the entire dataset can be expensive. Different strategies trade off stability, speed, and noise:
| Strategy | Uses | Update Frequency | Pros | Cons |
|---|---|---|---|---|
| Batch Gradient Descent | Entire dataset | Per epoch | Stable and accurate gradients | Very slow for large datasets, high memory usage |
| Stochastic Gradient Descent (SGD) | Single sample | Per sample | Very fast updates, helps escape local minima | Noisy and unstable convergence |
| Mini-Batch Gradient Descent | Small subset of data | Per batch | Good balance between speed and stability, efficient on GPUs | Still stochastic |
Mini-batch gradient descent is the most commonly used in practice.
While vanilla gradient descent updates parameters using only the current gradient, modern optimizers improve convergence by adapting how updates are applied.
Momentum
Momentum improves gradient descent by introducing a velocity term that keeps track of past gradients.
where:
- = velocity (running average of gradients)
- = momentum coefficient (typically 0.9-0.99)
while True:
dW = compute_gradient(W)
V = mu * V + dW
W -= learning_rate * VMomentum behaves like pushing a ball downhill, it speeds up progress along stable directions and reduces oscillations across steep valleys.
RMSProp
RMSProp adapts the learning rate for each parameter using the history of squared gradients.
where:
- = running average of squared gradients
- = decay rate (typically 0.99)
- = small constant for numerical stability
while True:
dW = compute_gradient(W)
cache = decay_rate * cache + (1 - decay_rate) * dW**2
W -= learning_rate * dW / (np.sqrt(cache) + 1e-7)RMSProp helps balance learning across parameters with different scales, for large gradients produces smaller update steps and for small gradients large update steps.
Adam (Adaptive Moment Estimation)
Adam combines Momentum and RMSProp by tracking both the mean and variance of gradients.
First moment (Momentum):
where:
Second moment (RMSProp-style):
where:
Final update:
Bias-corrected moments:
This correction accounts for the zero initialization of and , preventing the moment estimates from being artificially small at the start of training.
while True:
t += 1
dW = compute_gradient(W)
m = beta1 * m + (1 - beta1) * dW
v = beta2 * v + (1 - beta2) * dW**2
m_hat = m / (1 - beta1**t)
v_hat = v / (1 - beta2**t)
W -= learning_rate * m_hat / (np.sqrt(v_hat) + 1e-7)Adam maintains stable gradient directions thanks to Momentum and adapts step size per parameter like RMSProp. Works well across many deep learning tasks and is often used as a strong default optimizer.
The classic way to compare optimizers is to visualize their trajectories on the same loss surface.
Where Do Gradients Come From?
All optimizers above assume we can compute:
For simple models this derivative can be computed manually. For neural networks, however, the loss is composed of many nested functions (layers).
Computing gradients efficiently requires a systematic method called backpropagation, which uses the chain rule over computational graphs.
That is the focus of the next note.