Loss Functions in Deep learning — Everything you need to know
What are Loss Functions?
Loss function is a method of evaluating how well the algorithm is modelling the dataset,They are mathematical functions that measure the difference between the predicted output and the true output in a deep learning model. They play a crucial role in the training process of a deep learning model as they help to evaluate the performance of the model and determine how well it is fitting the training data. The loss function’s output is used as the feedback signal for adjusting the model’s weights in order to minimize the loss and improve the model’s accuracy(Back Propagation).

Why do we require Loss Functions?
We can’t imporve what we can’t measure
As stated above loss functions output is fed back to the perceptron acoording to which the weights and models are updated (This calculation of updating weights and biases is done using Gradient Descent) .Hence, Loss functions are crucial for the optimization process, as they provide a metric that the optimizer can minimize in order to improve the model’s accuracy. The goal of training a deep learning model is to find the best set of weights that minimize the loss, so that the model’s predictions are as close as possible to the true outputs.By choosing an appropriate loss function, we can ensure that the model is learning the right representation of the data and making accurate predictions.

Types of Loss Functions
There are several types of loss functions used in deep learning, and the choice of loss function depends on the specific problem being solved. Here are some common loss functions and the types of problems they are used for
Regression -
- Mean Squared Error (MSE): This is a commonly used loss function for regression problems, where the goal is to predict continuous values. MSE measures the average squared difference between the predicted and actual values.
- Mean Absolute Error (MAE): This is another commonly used loss function for regression problems, where the goal is to predict continuous values. MAE measures the average absolute difference between the predicted and actual values. It is less sensitive to outliers compared to Mean Squared Error (MSE), which is why it is often used when the data contains outliers.
- Huber Loss: This is a hybrid loss function that combines the characteristics of both mean squared error (MSE) and mean absolute error (MAE). It is less sensitive to outliers compared to MSE and less computationally expensive compared to MAE. Huber loss is often used in problems where there are outliers in the data, and it provides a balance between the robustness to outliers offered by MAE and the sensitivity to large errors offered by MSE.
Classification -
- Binary Cross-Entropy: This is used for binary classification problems, where the goal is to predict one of two classes (e.g., “positive” or “negative”). Binary cross-entropy measures the dissimilarity between the predicted probability distribution and the true distribution for the two classes.
- Categorical Cross-Entropy: This is used for multiclass classification problems, where the goal is to predict one of multiple classes (e.g., “dog”, “cat”, “bird”). Categorical cross-entropy measures the dissimilarity between the predicted probability distribution and the true distribution over multiple classes.
- Hinge Loss: This is used for support vector machine (SVM) models and is a commonly used loss function for training models for binary classification problems.
Auto Encoders -
- Kullback-Leibler Divergence (KL Divergence): This is a measure of the difference between two probability distributions and is often used as a loss function in variational autoencoders, which are a type of generative model.
Object Detection & Embedding -
- Focal Loss: Focal Loss is a loss function used in object detection and image segmentation problems, where the goal is to predict the presence of objects or segments within an image. Focal Loss addresses the problem of imbalanced class distributions by down-weighting well-classified examples and up-weighting misclassified examples. This helps the model to focus on the most challenging examples and improve its overall performance.
- Triple Loss: Triple Loss is a loss function used in metric learning problems, where the goal is to learn a metric space in which similar examples are close to each other and dissimilar examples are far apart. Triple Loss enforces this constraint by minimizing the distances between similar examples and maximizing the distances between dissimilar examples. This helps the model to learn a useful representation of the data that can be used for tasks such as classification and retrieval.
There are other loss functions as well, but these are some of the most commonly used ones. The choice of loss function depends on the problem being solved and the nature of the output. The loss function should align with the problem’s objectives and the nature of the output to ensure that the model is being trained to optimize the appropriate metric.
Loss Functions VS Cost Function
The terms “loss function” and “cost function” are often used interchangeably in the context of machine learning and deep learning.
A loss function measures the difference between the predicted values and the true values for a single example in the training data. The loss function provides a measure of how well the model is performing for each example in the training data, and its goal is to be minimized in order to achieve the best possible predictions.
A cost function, on the other hand, is the average of the loss function over the entire training dataset. The cost function provides a measure of how well the model is performing over the entire training dataset and its goal is to be minimized in order to achieve the best overall performance of the model.

So, in essence, the cost function is an aggregate measure of the performance of the model, while the loss function provides a measure of the performance of the model for each individual example. The cost function is calculated by summing the losses over all examples in the training data and dividing by the number of examples.
Now let us understand about some of the most used loss functions in brief:-
Mean Squared Error
Mean Squared Error (MSE) is a widely used loss function in regression problems, where the goal is to predict continuous values. It measures the average squared difference between the predicted values and the true values for the entire dataset.
The formula for MSE is given by:
MSE = 1/N * Σ(y_pred — y_true)²
where N is the number of samples in the dataset, y_pred is the predicted value for a sample, and y_true is the true value for the same sample.
Here we take square bcause squaring the errors ensures that the MSE loss is always non-negative, which can be useful for ensuring that the optimization algorithm converges to a minimum rather than a maximum ans also it has the advantage of making the loss function differentiable, which is important for optimization algorithms like gradient descent that rely on computing the gradient of the loss function with respect to the model parameters.
Here’s an example: let’s say you are building a model to predict the price of a house based on its size, and you have a dataset with three examples (size and price in square feet and dollars, respectively):
Example 1: Size = 1,000 sq. ft., Price = 100,000 dollars Example 2: Size = 2,000 sq. ft., Price = 200,000 dollars Example 3: Size = 3,000 sq. ft., Price = 300,000 dollars
If your model predicts the following values:
Example 1: Price = 90,000 dollars Example 2: Price = 200,000 dollars Example 3: Price = 400,000 dollars
The MSE would be calculated as:
MSE = (1/3) * [(90,000–100,000)² + (200,000–200,000)² + (400,000–300,000)²] = 1,000,000
Regarding outliers, MSE treats all errors equally that is it squares the difference between actual and predicted value (if difference is 2 units then MSE will b 4 and so on), regardless of their magnitude. This means that large errors, or outliers, in the dataset can have a disproportionate impact on the final MSE value that means our linear line will be impacted drastically beacause of outliers. In some cases, this can lead to overfitting to the outliers, resulting in a model that does not generalize well to new data.
Graph of Weights VS Loss Function

It will always have only one local minima and it is differentiable
Advantages of MSE:
- Easy to understand and calculate
- Provides a clear and interpretable measure of model performance
- Differentiable
Disadvantages of MSE:
- Can be sensitive to outliers in the dataset, leading to overfitting
- The squared error can cause the optimizer to be slower in reaching convergence
In conclusion, MSE is a simple and widely used loss function, but it can be sensitive to outliers in the dataset and may result in overfitting.
Mean Asolute Error
Mean Absolute Error (MAE) is a widely used loss function in regression problems, where the goal is to predict continuous values. It measures the average absolute difference between the predicted values and the true values for the entire dataset.
The formula for MAE is given by:
MAE = 1/N * Σ|y_pred — y_true|
where N is the number of samples in the dataset, y_pred is the predicted value for a sample, and y_true is the true value for the same sample.
Here’s an example: let’s say you are building a model to predict the price of a house based on its size, and you have a dataset with three examples (size and price in square feet and dollars, respectively):
Example 1: Size = 1,000 sq. ft., Price = 100,000 dollars Example 2: Size = 2,000 sq. ft., Price = 200,000 dollars Example 3: Size = 3,000 sq. ft., Price = 300,000 dollars
If your model predicts the following values:
Example 1: Price = 90,000 dollars Example 2: Price = 200,000 dollars Example 3: Price = 400,000 dollars
The MAE would be calculated as:
MAE = (1/3) * [|90,000–100,000| + |200,000–200,000| + |400,000–300,000|] = 100,000
Graph of Weights VS Loss Function

It will always have only one local minima but it is non-differentiable
Regarding outliers, MAE treats all errors similarly, regardless of their magnitude. This means that large errors, or outliers, in the dataset will have less of an impact on the final MAE value compared to the MSE loss function. This property makes MAE a more robust loss function, particularly in the presence of outliers in the dataset.
Advantages of MAE:
- Treats all errors similarly, regardless of magnitude
- More robust to outliers in the dataset compared to MSE
Disadvantages of MAE:
- Can be harder to interpret compared to MSE
- May not provide as clear a measure of model performance as MSE
Now consider a scenario in which there are 25% outliers in our dataset and we know that MSE will give more importance to these where MAE will treat them equally so we want a line which is more nearer to the 75% value so here come Huber Loss.
Huber Loss
The Huber loss function is a combination of the Mean Absolute Error (MAE) and Mean Squared Error (MSE) loss functions. It provides a compromise between these two loss functions by being less sensitive to outliers in the dataset compared to the MSE loss function, while still penalizing large errors more heavily than the MAE loss function.
The Huber loss function is defined as follows:
L(y_pred, y_true) =
- 1/2 * (y_pred — y_true)², if |y_pred — y_true| <= delta
- delta * |y_pred — y_true| — delta²/2, if |y_pred — y_true| > delta
where y_pred is the predicted value, y_true is the true value, and delta is a hyperparameter which is set by user that controls the transition between the two loss functions.
NOTE — TO USE ABOVE MENTIONED LOSS FUNCTIONS ACTIVATION FUNCTIONS SHOULD BE LINEAR.
Binary Cross Entropy
The Binary Cross Entropy (BCE) loss function is used in binary classification problems, where the goal is to predict one of two possible outcomes (e.g. 0 or 1). The BCE loss function measures the dissimilarity between the predicted probabilities and the true labels.

The BCE loss function is defined as follows:
L(y_pred, y_true) = — y_true * log(y_pred) — (1 — y_true) * log(1 — y_pred)
where y_pred is the predicted probability of the positive class, and y_true is the true label (0 or 1).
Here’s an example: let’s say you are building a model to predict whether a customer will buy a product based on their demographic information, and you have a dataset with two examples (age, income, and purchase):
Example 1: Age = 35, Income = 50,000 dollars, Purchase = Yes Example 2: Age = 25, Income = 30,000 dollars, Purchase = No
If your model predicts the following probabilities of a purchase:
Example 1: Purchase = 0.8 Example 2: Purchase = 0.3
The BCE loss would be calculated as:
L = — 1 * log(0.8) — 0 * log(1–0.8) + 0 * log(0.3) + 1 * log(1–0.3) = 0.22 + 0.51 = 0.73
Regarding outliers, the BCE loss function is not particularly robust to outliers, as large errors in the predicted probabilities will have a large impact on the BCE loss. This can make the BCE loss function more sensitive to outliers compared to other loss functions like the Huber loss.
Advantages of BCE:
- Easy to interpret, as the loss is a measure of the dissimilarity between the predicted probabilities and the true labels
- Well-suited for binary classification problems and also Differentiable
Disadvantages of BCE:
- Sensitive to outliers, as large errors in the predicted probabilities will have a large impact on the BCE loss
- Not well-suited for multi-class classification problems
In conclusion, the Binary Cross Entropy loss function is a good choice for binary classification problems, as it is easy to interpret and well-suited for this type of problem. However, it may be more sensitive to outliers compared to other loss functions, and is not well-suited for multi-class classification problems.for multiclass classification problem we will use Categorical Cross Entropy
Categorical Cross Entropy
Categorical Cross Entropy (CCE) loss is a loss function used in multi-class classification problems, where the goal is to predict one of several possible outcomes (e.g. classify an image into one of several different classes). CCE loss measures the dissimilarity between the predicted class probabilities and the true labels.
The CCE loss function is defined as follows:
L(y_pred, y_true) = -sum over all classes (y_true_j * log(y_pred_j))
where y_pred_j is the predicted probability of the j-th class and y_true_j is the true label (0 or 1) for the j-th class. The sum is taken over all classes.
Here’s an example: let’s say you are building a model to predict the type of animal in an image, and you have a dataset with three examples:
Example 1: Image of a cat, Label = Cat Example 2: Image of a dog, Label = Dog Example 3: Image of a bird, Label = Bird
If your model predicts the following class probabilities:
Example 1: Cat = 0.9, Dog = 0.05, Bird = 0.05 Example 2: Cat = 0.05, Dog = 0.9, Bird = 0.05 Example 3: Cat = 0.05, Dog = 0.05, Bird = 0.9
The CCE loss would be calculated as:
L = — (1 * log(0.9) + 0 * log(0.05) + 0 * log(0.05)) — (0 * log(0.9) + 1 * log(0.05) + 0 * log(0.05)) — (0 * log(0.9) + 0 * log(0.05) + 1 * log(0.05)) = -2.302
Note that it is important to One Hot Encode the multiclass output that means to make seprate columns for each output class (we will talk about this while learning about spare categorical cross entropy) before using categorical cross entropy.
NOTE — TO USE BINARY CROSS ENTROPY LOSS FUNCTIONS ACTIVATION FUNCTIONS SHOULD BE SIGMOID.
Sparse Categorical Cross Entropy
Sparse Categorical Cross Entropy (SCCE) is a variant of the Categorical Cross Entropy (CCE) loss that is commonly used in multi-class classification problems. The main difference between SCCE and CCE is that in SCCE, the target labels are represented as integer values instead of one-hot encoded vectors.
In CCE, the target labels are represented as one-hot encoded vectors, where each class is represented by a binary vector with a 1 in the position corresponding to the correct class and 0s elsewhere. For example, if we have three classes (A, B, and C), a target label of class B would be represented as [0, 1, 0].
In contrast, in SCCE, the target labels are represented as integer values, where each class is assigned a unique integer value. For example, if we have three classes (A, B, and C), a target label of class Awould be represented as 1 , B will be 2 and C will be 3.
The formula for SCCE is similar to that of CCE, with the main difference being that in SCCE, the logits are first passed through an activation function (e.g. softmax) to obtain the predicted class probabilities, and then the predicted class probabilities are compared to the integer target labels.
NOTE — TO USE ABOVE MENTIONED LOSS FUNCTIONS ACTIVATION FUNCTIONS SHOULD BE SOFTMAX
In addition to these commonly used loss functions, it is also possible to create custom loss functions tailored to specific problems. Custom loss functions can be created by combining existing loss functions or defining new ones from scratch. The key is to ensure that the custom loss function is differentiable so that the model can be trained using gradient descent.