On backpropagation

This is part of my series of notes on cs231n. To understand backpropagation you'll be better off reading CS231n Backpropagation page. What follows are my notes based on that page, this lecture, and some private Chat sessions.

Quick primer on MLP before covering backpropagation.

A neural network is a composition of simple functions (modules) stacked together. The most basic module is an affine (fully-connected) map:

a=Wx+ba = Wx + b

If we stack only affine maps, the whole network is still linear. To model non-linear decision boundaries we insert non-linearities (activation functions) applied elementwise:

h=ϕ(a),e.g. ϕ(a)=max(0,a) (ReLU)h = \phi(a), \quad \text{e.g. } \phi(a) = \max(0, a) \text{ (ReLU)}

A 2-layer MLP (one hidden layer + classifier) is:

a1=W1x+b1,h1=ϕ(a1),s=W2h1+b2a_1 = W_1 x + b_1,\quad h_1 = \phi(a_1),\quad s = W_2 h_1 + b_2

A 3-layer MLP (two hidden layers + classifier) is:

a1=W1x+b1,h1=ϕ(a1),a2=W2h1+b2,h2=ϕ(a2),s=W3h2+b3a_1 = W_1 x + b_1,\quad h_1 = \phi(a_1),\quad a_2 = W_2 h_1 + b_2,\quad h_2 = \phi(a_2),\quad s = W_3 h_2 + b_3

In this post I use single-example, column-vector notation:

  • xRD×1x \in \mathbb{R}^{D\times 1}
  • W1RH×DW_1 \in \mathbb{R}^{H\times D}, b1RH×1b_1 \in \mathbb{R}^{H\times 1}
  • W2RC×HW_2 \in \mathbb{R}^{C\times H}, b2RC×1b_2 \in \mathbb{R}^{C\times 1}
  • class scores sRC×1s \in \mathbb{R}^{C\times 1}

Always pay attention to the shapes. It prevents most bugs.


There are many activations in practice; which to choose is mostly empirical and architecture-dependent:

ReLU
Leaky ReLU (α=0.1)
ELU (α=1)
Sigmoid
GELU
SiLU
tanh

Backpropagation

Backpropagation is the practical way we apply the chain rule to a big composed function (a network). The key trick is that we never need to symbolically expand the full expression.

Instead, we treat the network as a computational graph of small modules.

We saw in optimization that we update parameters by computing gradients (e.g. LW\frac{\partial L}{\partial W}). When networks get large, doing derivatives from scratch becomes painful and error-prone. Backprop + computational graphs give a systematic, reusable way to compute gradients.

Each module does two things:

  • Forward: transforms inputs to outputs and caches what it will need later.
  • Backward: transforms an upstream gradient into gradients for its parameters and a gradient for its input to pass to the previous module.

The backward pass is always:

(upstream gradient)    ×    (local Jacobian)\text{(upstream gradient)} \;\;\times\;\; \text{(local Jacobian)}

You can think of it as: "how the loss changes w.r.t my output" times "how my output changes w.r.t my input."

Example: 2-layer MLP + softmax cross-entropy

Assume:

  • xRD×1x \in \mathbb{R}^{D\times 1}
  • W1RH×DW_1 \in \mathbb{R}^{H\times D}, b1RH×1b_1 \in \mathbb{R}^{H\times 1}
  • W2RC×HW_2 \in \mathbb{R}^{C\times H}, b2RC×1b_2 \in \mathbb{R}^{C\times 1}
  • label index y{1,,C}y \in \{1,\dots,C\} and one-hot y(onehot){0,1}C×1y^{(onehot)} \in \{0,1\}^{C\times 1}

Forward pass

  1. Affine
(1)a1=W1x+b1RH×1(1)\quad a_1 = W_1 x + b_1 \quad \in \mathbb{R}^{H\times 1}
  1. ReLU
(2)h1=max(0,a1)RH×1(2)\quad h_1 = \max(0, a_1)\quad \in \mathbb{R}^{H\times 1}
  1. Affine
(3)s=W2h1+b2RC×1(3)\quad s = W_2 h_1 + b_2 \quad \in \mathbb{R}^{C\times 1}
  1. Softmax
(4)p=softmax(s)RC×1(4)\quad p = \text{softmax}(s)\quad \in \mathbb{R}^{C\times 1}

Softmax outputs probabilities that sum to 1: k=1Cpk=1\sum_{k=1}^C p_k = 1

  1. Cross-entropy loss

For the true class:

(5)L=logpyR(5)\quad L = -\log p_y \quad \in \mathbb{R}

pyp_y is the probability assigned to the true class yy by softmax (4).

Equivalently, with one-hot targets:

L=k=1Cyk(onehot)logpkL = - \sum_{k=1}^C y^{(onehot)}_k \log p_k

Since yk(onehot)y^{(onehot)}_k is zero for all classes except the true class y, it simplifies to L=logpyL = -\log p_y

Backward pass

  • Blue = forward pass activations (usually cached)
  • Green = ground-truth targets
  • Orange = backward pass gradients
    1. Cross-entropy + Softmax
ds=Ls=py(onehot)RC×1{\color{orange}{ds}} = \frac{\partial L}{\partial s} = {\color{deepskyblue}{p}} - {\color{forestgreen}{y^{(onehot)}}} \quad \in \mathbb{R}^{C\times 1}

For one-hot labels, this is literally "predicted probability − true probability".

  1. Affine: s=W2h1+b2s = W_2 h_1 + b_2
LW2=dsh1TRC×H\frac{\partial L}{\partial W_2} = {\color{orange}{ds}} \, {\color{deepskyblue}{h_1^T}} \quad \in \mathbb{R}^{C\times H} Lb2=dsRC×1 \frac{\partial L}{\partial b_2} = {\color{orange}{ds}} \quad \in \mathbb{R}^{C\times 1}
dh1=Lh1=W2TdsRH×1 {\color{orange}{dh_1}} = \frac{\partial L}{\partial h_1} = {\color{deepskyblue}{W_2^T}} {\color{orange}{ds}} \quad \in \mathbb{R}^{H\times 1}
  1. ReLU: h1=max(0,a1)h_1 = \max(0, a_1)

ReLU passes gradient only where it was active:

da1=La1=dh11[a1>0]RH×1 {\color{orange}{da_1}} = \frac{\partial L}{\partial a_1} = {\color{orange}{dh_1}} \odot \mathbb{1}[{\color{deepskyblue}{a_1}} > 0] \quad \in \mathbb{R}^{H\times 1}

\odot is the elementwise (Hadamard) product. At a1=0a_1=0 the derivative is undefined and implementations typically pick 0.

  1. Affine: a1=W1x+b1a_1 = W_1 x + b_1
LW1=da1xTRH×D \frac{\partial L}{\partial W_1} = {\color{orange}{da_1}} \, {\color{deepskyblue}{x^T}} \quad \in \mathbb{R}^{H\times D} Lb1=da1RH×1 \frac{\partial L}{\partial b_1} = {\color{orange}{da_1}} \quad \in \mathbb{R}^{H\times 1}

Notice what happened? Each layer only needed the upstream gradient from the next module and the activation values from its own forward pass .

Python implementation

Below is intentionally small 2-layer MLP + softmax cross-entropy. Each module caches what it needs during forward and returns upstream gradients in backward.

import numpy as np
 
def softmax(s):
    s = s - np.max(s)
    exp = np.exp(s)
    return exp / np.sum(exp)
 
class Affine:
    def __init__(self, W, b):
        self.W = W
        self.b = b
        self.x = None
 
    def forward(self, x):
        self.x = x
        return self.W @ x + self.b
 
    def backward(self, dout):
        dW = dout @ self.x.T
        db = dout
        dx = self.W.T @ dout
        return dW, db, dx
 
class ReLU:
    def __init__(self):
        self.a = None
 
    def forward(self, a):
        self.a = a
        return np.maximum(0, a)
 
    def backward(self, dout):
        return dout * (self.a > 0)
 
class SoftmaxCrossEntropy:
    def __init__(self):
        self.p = None
        self.y_onehot = None
 
    def forward(self, s, y):
        p = softmax(s)
        y_onehot = np.zeros_like(p)
        y_onehot[y] = 1.0
 
        self.p = p
        self.y_onehot = y_onehot
        return -np.log(p[y] + 1e-12)
 
    def backward(self):
        return self.p - self.y_onehot
 
class TwoLayerNet:
    def __init__(self, W1, b1, W2, b2):
        self.fc1 = Affine(W1, b1)
        self.relu = ReLU()
        self.fc2 = Affine(W2, b2)
 
    def forward(self, x):
        a1 = self.fc1.forward(x)
        h1 = self.relu.forward(a1)
        s  = self.fc2.forward(h1)
        return s
 
    def backward(self, ds):
        dW2, db2, dh1 = self.fc2.backward(ds)
        da1 = self.relu.backward(dh1)
        dW1, db1, dx  = self.fc1.backward(da1)
        return {"W1": dW1, "b1": db1, "W2": dW2, "b2": db2, "x": dx}
 
class SGD:
    def __init__(self, lr=1e-2):
        self.lr = lr
 
    def step(self, net, grads):
        net.fc1.W -= self.lr * grads["W1"]
        net.fc1.b -= self.lr * grads["b1"]
        net.fc2.W -= self.lr * grads["W2"]
        net.fc2.b -= self.lr * grads["b2"]
 
def train_step(net, optim, x, y, loss):
    s = net.forward(x)
    L = loss.forward(s, y)
 
    ds = loss.backward()
    grads = net.backward(ds)
 
    optim.step(net, grads)
    return L

Note how each layer is composable: the forward pass maps inputs to outputs (and caches what it needs), the backward pass consumes an upstream gradient and produces gradients for its inputs and parameters. The chain rule takes care of the rest.

Most training loops reduce to train_step logic.




From this point on, most deep learning problems follow the same general recipe: represent your data as tensors, design a computational graph that maps input tensors to output tensor, collect a dataset of input–output pairs, define a loss function that measures prediction error, and optimize that loss using gradient descent with backpropagation. By repeatedly measuring mistakes and correcting them, learning emerges. This is the foundation of deep learning.

Nice video recap by Justin Johnson

← Back to blog