On loss functions
Softmax & Cross-Entropy Loss
Given a dataser of examples:
where is i image and is integer label. The goal of a loss function is to measure how well the predicted scores match the ground-truth labels.
The loss over the dataset is defined as the average loss across all training examples:
where:
- are the predicted class scores (also called logits)
- is the loss for the i-th training example
There are many valid choices for the loss function. A very common and important one is the Softmax loss, also known as cross-entropy loss.
Softmax function
The score function produces raw class scores:
These scores are not probabilities. To convert them into probabilities, we apply the Softmax function:
where:
- is the score for class
- is the predicted probability for class
Intuitively:
- The exponential turns each class score into a positive value: so all outputs are positive.
- The denominator normalizes them into a valid probability distribution: all predicted probabilities sum to 1
If we have a good set of parameters (, ), the linear model with the softmax function predicts the probabilities of class labels given an input image. Great we could call it a day. But realisticlly, the problem is how and are defined. As starter, we need an objective i.e. we need something that tells us how good are our parameters compared to the ground truth.
To compare predicted probabilities with the ground truth, we represent it using a one-hot vector effectively making the ground truth a probability distribution.
For example, if there are three classes and the correct class is class 1: y = [0, 1, 0]
To measure how different the predicted and ground truth prob distributions are we use cross-entropy:
Which simplifies to because the ground truth vector is one-hot encoded.
This means:
- The loss is large when the model assigns low probability to the correct class
- The loss is small when the model assigns high probability to the correct class
The loss is what gives us the signal, great we now know how good or bad our parameters are. But how do we efficiently determine the parameters that give the lowest loss? This process is optimization!