Cross Entropy (Loss)

Cross Entropy

https://en.wikipedia.org/wiki/Cross_entropy

In information theory, the cross-entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution q, rather than the true distribution p.

Cross-Entropy Loss Function

https://towardsdatascience.com/cross-entropy-loss-function-f38c4ec8643e

When working on a Machine Learning or a Deep Learning Problem, loss/cost functions are used to optimize the model during training. The objective is almost always to minimize the loss function. The lower the loss the better the model. Cross-Entropy loss is a most important cost function. It is used to optimize classification models. The understanding of Cross-Entropy is pegged on understanding of Softmax activation function. I have put up another article below to cover this prerequisite

Softmax Activation Function — How It Actually Works

Why Softmax (as opposed to say std-norm)

math - Why use softmax as opposed to standard normalization? - Stack Overflow

I've had this question for months. It seems like we just cleverly guessed the softmax as an output function and then interpret the input to the softmax as log-probabilities. As you said, why not simply normalize all outputs by dividing by their sum? I found the answer in the Deep Learning book by Goodfellow, Bengio and Courville (2016) in section 6.2.2.

Let's say our last hidden layer gives us z as an activation. Then the softmax is defined as [omitted by the author]

Very Short Explanation

The exp in the softmax function roughly cancels out the log in the cross-entropy loss causing the loss to be roughly linear in z_i. This leads to a roughly constant gradient, when the model is wrong, allowing it to correct itself quickly. Thus, a wrong saturated softmax does not cause a vanishing gradient.

Short Explanation

The most popular method to train a neural network is Maximum Likelihood Estimation. We estimate the parameters theta in a way that maximizes the likelihood of the training data (of size m). Because the likelihood of the whole training dataset is a product of the likelihoods of each sample, it is easier to maximize the log-likelihood of the dataset and thus the sum of the log-likelihood of each sample indexed by k:

Now, we only focus on the softmax here with z already given, so we can replace

with i being the correct class of the kth sample. Now, we see that when we take the logarithm of the softmax, to calculate the sample's log-likelihood, we get:

, which for large differences in z ==> which is to say that we can make a comfortably confident prediction roughly approximates to

First, we see the linear component z_i here. Secondly, we can examine the behavior of max(z) for two cases:

  1. If the model is correct, then max(z) will be z_i. Thus, the log-likelihood asymptotes zero (i.e. a likelihood of 1) with a growing difference between z_i and the other entries in z.
  2. If the model is incorrect, then max(z) will be some other z_j > z_i. So, the addition of z_i does not fully cancel out -z_j and the log-likelihood is roughly (z_i - z_j). This clearly tells the model what to do to increase the log-likelihood: increase z_i and decrease z_j.

We see that the overall log-likelihood will be dominated by samples, where the model is incorrect. Also, even if the model is really incorrect, which leads to a saturated softmax, the loss function does not saturate. It is approximately linear in z_j, meaning that we have a roughly constant gradient. This allows the model to correct itself quickly. Note that this is not the case for the Mean Squared Error for example.

Long Explanation

If the softmax still seems like an arbitrary choice to you, you can take a look at the justification for using the sigmoid in logistic regression:

Why sigmoid function instead of anything else?

The softmax is the generalization of the sigmoid for multi-class problems justified analogously.

Consider a 4-class classification task where an image is classified as either a dog, cat, horse or cheetah.

Input image source: Photo by Victor Grabarczyk on Unsplash . Diagram by author.

In the above Figure, Softmax converts logits into probabilities. The purpose of the Cross-Entropy is to take the output probabilities (P) and measure the distance from the truth values (as shown in Figure below).

For the example above the desired output is [1,0,0,0] for the class dog but the model outputs [0.775, 0.116, 0.039, 0.070] .

The objective is to make the model output be as close as possible to the desired output (truth values). During model training, the model weights are iteratively adjusted accordingly with the aim of minimizing the Cross-Entropy loss. The process of adjusting the weights is what defines model training and as the model keeps training and the loss is getting minimized, we say that the model is learning.

The concept of cross-entropy traces back into the field of Information Theory where Claude Shannon introduced the concept of entropy in 1948. Before diving into Cross-Entropy cost function, let us introduce entropy .

Entropy

Entropy of a random variable X is the level of uncertainty inherent in the variables possible outcome.

For p(x) — probability distribution and a random variable X, entropy is defined as follows

Equation 1: Definition of Entropy. Note log is calculated to base 2.

Reason for negative sign: log(p(x)) < 0 for all p(x) in (0,1) . p(x) is a probability distribution and therefore the values must range between 0 and 1.

==> it's the expectation value of log(p(x)) w.r.t. to distribution p and variable x

A plot of log(x). For x values between 0 and 1, log(x) <0 (is negative). (Source: Author).

The greater the value of entropy,H(x) , the greater the uncertainty for probability distribution and the smaller the value the less the uncertainty.

Example

Consider the following 3 “containers” with shapes: triangles and circles

3 containers with triangle and circle shapes. (Source: Author).

Container 1: The probability of picking a triangle is 26/30 and the probability of picking a circle is 4/30. For this reason, the probability of picking one shape and/or not picking another is more certain.

Container 2: Probability of picking the a triangular shape is 14/30 and 16/30 otherwise. There is almost 50–50 chance of picking any particular shape. Less certainty of picking a given shape than in 1.

Container 3: A shape picked from container 3 is highly likely to be a circle. Probability of picking a circle is 29/30 and the probability of picking a triangle is 1/30. It is highly certain than the shape picked will be circle.

Let us calculate the entropy so that we ascertain our assertions about the certainty of picking a given shape.

As expected the entropy for the first and third container is smaller than the second one. This is because probability of picking a given shape is more certain in container 1 and 3 than in 2.

Cross-Entropy Loss Function, also called logarithmic losslog loss or logistic loss. [where] Each predicted class probability is compared to the actual class desired output 0 or 1 and a score/loss is calculated that penalizes the probability based on how far it is from the actual expected value. The penalty is logarithmic in nature yielding a large score for large differences close to 1 and small score for small differences tending to 0.

Cross-entropy loss is used when adjusting model weights during training. The aim is to minimize the loss, i.e, the smaller the loss the better the model. A perfect model has a cross-entropy loss of 0.

Cross-entropy is defined as

Equation 2: Mathematical definition of Cross-Entopy. Note the log is calculated to base 2.

Binary Cross-Entropy Loss

For binary classification, we have binary cross-entropy defined as

Equation 3: Mathematical definition of Binary Cross-Entopy.

Binary cross-entropy is often calculated as the average cross-entropy across all data examples

Equation 4

Example

Consider the classification problem with the following Softmax probabilities (S) and the labels (T). The objective is to calculate for cross-entropy loss given these information.

Logits(S) and one-hot encoded truth label(T) with Categorical Cross-Entropy loss function used to measure the ‘distance’ between the predicted probabilities and the truth labels. (Source: Author)

The categorical cross-entropy is computed as follows

Softmax is continuously differentiable function. This makes it possible to calculate the derivative of the loss function with respect to every weight in the neural network. This property allows the model to adjust the weights accordingly to minimize the loss function (model output close to the true values).

Assume that after some iterations of model training the model outputs the following vector of logits

0.095 is less than previous loss, that is, 0.3677 implying that the model is learning. The process of optimization (adjusting weights so that the output is close to true values) continues until training is over.

Keras provides the following cross-entropy loss functions: binary, categorical, sparse categorical cross-entropy loss functions.

Categorical Cross-Entropy and Sparse Categorical Cross-Entropy

Both categorical cross entropy and sparse categorical cross-entropy have the same loss function as defined in Equation 2. The only difference between the two is on how truth labels are defined.

  • Categorical cross-entropy is used when true labels are one-hot encoded, for example, we have the following true values for 3-class classification problem [1,0,0][0,1,0] and [0,0,1].
  • In sparse categorical cross-entropy , truth labels are integer encoded, for example, [1][2] and [3] for 3-class problem.

I hope this article helped you understand cross-entropy loss function more clearly.

Thanks for reading :-)

Extension

a more detailed tutorial

A Gentle Introduction to Cross-Entropy for Machine Learning

including

Tutorial Overview

This tutorial is divided into five parts; they are:

  1. What Is Cross-Entropy?
  2. Cross-Entropy Versus KL Divergence
  3. How to Calculate Cross-Entropy
    1. Two Discrete Probability Distributions
    2. Calculate Cross-Entropy Between Distributions
    3. Calculate Cross-Entropy Between a Distribution and Itself
    4. Calculate Cross-Entropy Using KL Divergence
  4. Cross-Entropy as a Loss Function
    1. Calculate Entropy for Class Labels
    2. Calculate Cross-Entropy Between Class Labels and Probabilities
    3. Calculate Cross-Entropy Using Keras
    4. Intuition for Cross-Entropy on Predicted Probabilities
  5. Cross-Entropy Versus Log Loss
    1. Log Loss is the Negative Log Likelihood
    2. Log Loss and Cross Entropy Calculate the Same Thing
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值