Jan 15, 2026

On loss functions

This is part #3 of my notes on CS231n. The course is openly available, including the video lectures and assignments.

These notes are based on this lecture by Ehsan Adeli.

Softmax & Cross-Entropy Loss

Given a dataser of examples: $\{(x_i,y_i)\}^N_{i=1}$

where $x_i$ is i image and $y_i$ is integer label. The goal of a loss function is to measure how well the predicted scores match the ground-truth labels.

The loss over the dataset is defined as the average loss across all training examples:

$L = \frac{1}{N} \sum_i{L_i(f(x_i,W),y_i)}$

where:

$f(x_i,W)$ are the predicted class scores (also called logits)
$L_i(f(x_i,W),y_i)$ is the loss for the i-th training example

There are many valid choices for the loss function. A very common and important one is the Softmax loss, also known as cross-entropy loss.

Softmax function

The score function produces raw class scores:

$z = f(x_i, W)$

These scores are not probabilities. To convert them into probabilities, we apply the Softmax function:

s_{i} = \frac{e ^{z_{i}}}{\sum _{j} e ^{z_{j}}}

where:

$z_i$ is the score for class $i$
$s_i$ is the predicted probability for class $i$

Intuitively:

The exponential $e^{z_i}$ turns each class score into a positive value: so all outputs are positive.
The denominator normalizes them into a valid probability distribution: all predicted probabilities sum to 1

If we have a good set of parameters ( $W$ , $b$ ), the linear model with the softmax function predicts the probabilities of class labels given an input image. Great we could call it a day. But realisticlly, the problem is how $W$ and $b$ are defined. As starter, we need an objective i.e. we need something that tells us how good are our parameters compared to the ground truth.

To compare predicted probabilities with the ground truth, we represent it using a one-hot vector effectively making the ground truth a probability distribution.

For example, if there are three classes and the correct class is class 1: y = [0, 1, 0]

To measure how different the predicted and ground truth prob distributions are we use cross-entropy:

$L_i = - \sum_k y_k \log(s_k)$

Which simplifies to $L_i = -\log(s_{y_i})$ because the ground truth vector is one-hot encoded.

This means:

The loss is large when the model assigns low probability to the correct class
The loss is small when the model assigns high probability to the correct class

The loss is what gives us the signal, great we now know how good or bad our parameters are. But how do we efficiently determine the parameters that give the lowest loss? This process is optimization!