On backpropagation
This is part of my series of notes on cs231n. To understand backpropagation you'll be better off reading CS231n Backpropagation page. What follows are my notes based on that page, this lecture, and some private Chat sessions.
Quick primer on MLP before covering backpropagation.
A neural network is a composition of simple functions (modules) stacked together. The most basic module is an affine (fully-connected) map:
If we stack only affine maps, the whole network is still linear. To model non-linear decision boundaries we insert non-linearities (activation functions) applied elementwise:
A 2-layer MLP (one hidden layer + classifier) is:
A 3-layer MLP (two hidden layers + classifier) is:
In this post I use single-example, column-vector notation:
- ,
- ,
- class scores
Always pay attention to the shapes. It prevents most bugs.
There are many activations in practice; which to choose is mostly empirical and architecture-dependent:
Backpropagation
Backpropagation is the practical way we apply the chain rule to a big composed function (a network). The key trick is that we never need to symbolically expand the full expression.
Instead, we treat the network as a computational graph of small modules.
We saw in optimization that we update parameters by computing gradients (e.g. ). When networks get large, doing derivatives from scratch becomes painful and error-prone. Backprop + computational graphs give a systematic, reusable way to compute gradients.
Each module does two things:
- Forward: transforms inputs to outputs and caches what it will need later.
- Backward: transforms an upstream gradient into gradients for its parameters and a gradient for its input to pass to the previous module.
The backward pass is always:
You can think of it as: "how the loss changes w.r.t my output" times "how my output changes w.r.t my input."
Example: 2-layer MLP + softmax cross-entropy
Assume:
- ,
- ,
- label index and one-hot
Forward pass
- Affine
- ReLU
- Affine
- Softmax
Softmax outputs probabilities that sum to 1:
- Cross-entropy loss
For the true class:
is the probability assigned to the true class by softmax (4).
Equivalently, with one-hot targets:
Since is zero for all classes except the true class y, it simplifies to
Backward pass
- Blue = forward pass activations (usually cached)
- Green = ground-truth targets
- Orange = backward pass gradients
-
- Cross-entropy + Softmax
For one-hot labels, this is literally "predicted probability − true probability".
- Affine:
- ReLU:
ReLU passes gradient only where it was active:
is the elementwise (Hadamard) product. At the derivative is undefined and implementations typically pick 0.
- Affine:
Notice what happened? Each layer only needed the upstream gradient from the next module and the activation values from its own forward pass .
Python implementation
Below is intentionally small 2-layer MLP + softmax cross-entropy. Each module caches what it needs during forward
and returns upstream gradients in backward.
import numpy as np
def softmax(s):
s = s - np.max(s)
exp = np.exp(s)
return exp / np.sum(exp)
class Affine:
def __init__(self, W, b):
self.W = W
self.b = b
self.x = None
def forward(self, x):
self.x = x
return self.W @ x + self.b
def backward(self, dout):
dW = dout @ self.x.T
db = dout
dx = self.W.T @ dout
return dW, db, dx
class ReLU:
def __init__(self):
self.a = None
def forward(self, a):
self.a = a
return np.maximum(0, a)
def backward(self, dout):
return dout * (self.a > 0)
class SoftmaxCrossEntropy:
def __init__(self):
self.p = None
self.y_onehot = None
def forward(self, s, y):
p = softmax(s)
y_onehot = np.zeros_like(p)
y_onehot[y] = 1.0
self.p = p
self.y_onehot = y_onehot
return -np.log(p[y] + 1e-12)
def backward(self):
return self.p - self.y_onehot
class TwoLayerNet:
def __init__(self, W1, b1, W2, b2):
self.fc1 = Affine(W1, b1)
self.relu = ReLU()
self.fc2 = Affine(W2, b2)
def forward(self, x):
a1 = self.fc1.forward(x)
h1 = self.relu.forward(a1)
s = self.fc2.forward(h1)
return s
def backward(self, ds):
dW2, db2, dh1 = self.fc2.backward(ds)
da1 = self.relu.backward(dh1)
dW1, db1, dx = self.fc1.backward(da1)
return {"W1": dW1, "b1": db1, "W2": dW2, "b2": db2, "x": dx}
class SGD:
def __init__(self, lr=1e-2):
self.lr = lr
def step(self, net, grads):
net.fc1.W -= self.lr * grads["W1"]
net.fc1.b -= self.lr * grads["b1"]
net.fc2.W -= self.lr * grads["W2"]
net.fc2.b -= self.lr * grads["b2"]
def train_step(net, optim, x, y, loss):
s = net.forward(x)
L = loss.forward(s, y)
ds = loss.backward()
grads = net.backward(ds)
optim.step(net, grads)
return LNote how each layer is composable: the forward pass maps inputs to outputs (and caches what it needs), the backward pass consumes an upstream gradient and produces gradients for its inputs and parameters. The chain rule takes care of the rest.
Most training loops reduce to train_step logic.
From this point on, most deep learning problems follow the same general recipe: represent your data as tensors, design a computational graph that maps input tensors to output tensor, collect a dataset of input–output pairs, define a loss function that measures prediction error, and optimize that loss using gradient descent with backpropagation. By repeatedly measuring mistakes and correcting them, learning emerges. This is the foundation of deep learning.