【Pytorch】交叉熵损失学习笔记

凯子要面包

已于 2023-11-24 16:25:06 修改

阅读量1.2k

点赞数

分类专栏： pytorch 文章标签： pytorch

于 2022-04-11 10:41:18 首次发布

本文链接：https://blog.csdn.net/weixin_44815943/article/details/124045196

版权

pytorch 专栏收录该内容

8 篇文章 1 订阅

订阅专栏

笔记要点总结

在训练文本分类模型时，首先需要判断任务的类型。如果是多类型单标签任务，应选择CrossEntropy；如果是多类型多标签任务，应选择BCEWithLogitsLoss；如果是二分任务，推荐优先选择BCEWithLogitsLoss。
CrossEntropy 与 BCEWithLogitsLoss中输入 $l o g i t s$ 的元素值的含义不同，在CrossEntropy中，元素值表示的是模型对某一样本在特定类型上的预测似然概率值——未被softmax激活函数与对数函数处理过的。在BCEWithLogitsLoss中，元素值表示的是模型对某一样本预测属于特定类型的似然概率值——未被sigmoid激活函数与对数函数处理过的。

交叉熵损失函数

交叉熵损失（Cross Entropy Loss Function）是机器学习中非常重要的一种损失度量函数。交叉熵损失本质是衡量模型预测的概率分布与实际概率分布的差异程度，其值越小，表明模型的预测结果与实际结果越接近，模型效果越好。另外交叉熵损失也称为Log Loss损失，因为损失值的计算使用了 $log_e (x)$ 函数。

二分类（ $C = 2$ ）

在二分类（0-1分类）任务中，模型对正例与负例的预测概率可以通过一个变量表示，如 $p$ 与 $(1 - p)$ 。此时的交叉熵损失定义如下，其中 $N$ 为Batch中样本的数量， $i$ 表示在Batch中的第 $i$ 条样本， $y_i$ 表示真实值， $p_i$ 表示模型对正例的预测概率。默认对batch loss进行平均处理，后文不再赘述：
$\frac{1}{N}\sum_iL_i=\frac{1}{N}\sum_i-[y_i*log(p_i) + (1-y_i)*log(1-p_i)]$

对于 $- l o g (x)$ 函数图像，在 $x\in[0,1]$ 区间段，当 $x$ 趋于1， $- l o g (x)$ 趋于0；当 $x$ 趋于0， $- l o g (x)$ 趋于正无穷。
$y_i=1$ 时，方括号部分等价于 $log(p_i)$ 。因此如果真实值为1，模型正例的预测概率值越接近1，则损失无限小，反之无穷大。
$y_i=0$ 时，方括号部分等价于 $log(1-p_i)$ 。因此如果真实值为0，模型正例的预测概率值越接近1，则损失无穷大，反之无限小。

多类型（ $C > 2$ ）

多类型的情况本质可以看成是二分类情况的拓展，在多类型场景中，交叉熵的损失定义如下，其中 $y_{i,c}$ 表示第 $i$ 条样本在第 $c$ 个类别中的真实值（如1表示该条样本属于第 $c$ 类，0则表示不属于）， $p_{i, c}$ 表示模型预测第 $i$ 条样本属于第 $c$ 个类别的似然概率：
$\frac{1}{N}\sum_iL_i=-\frac{1}{N}\sum_i\displaystyle\sum_{c=1}^C[y_{i,c}*log(p_{i,c}) + (1-y_{i,c}) *log(1-p_{i,c})]$
在Pytorch的CrossEntropy中，使用了交叉熵的另一种定义形式: ¹
$\frac{1}{N}\sum_iL_i=-\frac{1}{N}\sum_i\displaystyle\sum_{c=1}^Cy_{i,c}log(p_{i,c})$

对于分类任务存在一个易混淆的概念——多类别与多标签，多类别是指分类任务中有多少种类别，多标签（Multi Labels Classification）是指某一样本可能同时属于几种类型。简而言之，多类别从分类任务类别的角度而言，多标签是分类任务中单条样本而言。

二分类单标签：仅有两个类别，每一条样本仅能属于某一种类型，如为正例或者负例。
多类型单标签：有多个类别（ $C > 2$ ），并且每一条样本仅能属于某一种类型。即 $y_{i,c}$ 为One-Hot向量，向量的长度为 $C$ 。
多类型多标签：有多个类别（ $C > 2$ ），并且每一条样本可能属于多种类型。即 $y_{i,c}$ 可能不为标准的One-Hot向量，可能在多处取值为1。

Pytorch API

torch.nn.Softmax(dim=None)

softmax是一种非线性激活函数，其数学表达式为：
$Softmax(x_i) = \frac{exp(x_i)}{\sum_jexp(x_j)} \newline j\in[0, x.size()[dim] - 1]$

在pytorch中，softmax期望的输入张量可以是任意维度的。经过softmax处理后，输出张量与输入张量具有相同的size。
softmax在指定的dim维度上，对张量进行上述表达式的转换，转换后每一元素属于(0, 1)区间，并且该维度的元素相加之和等于1.
softmax激活函数是为计算交叉熵损失服务的，经过softmax处理之后，可以将模型输出的logits映射到(0, 1)区间，softmax的输出值可以看做是未经过对数函数处理的似然值（当模型参数值确定时，称为概率值；当模型参数值不确定时，称为似然值）。
从交叉熵损失函数的定义可知，计算交叉熵时，并不是直接利用似然值，而是利用似然的对数值。因此经过softmax处理之后的似然值，还需要进过torch.log处理。或者直接使用torch.nn.LogSoftmax。

torch.nn.LogSoftmax

本质上等于torch.log(torc.nn.Softmax()(x))，其数学表达式为：
$LogSoftmax(x_i) = log(\frac{exp(x_i)}{\displaystyle\sum_j exp(x_j)})$

LogSoftmax会按照指定的dim，将输入张量在指定维度上，将元素值转换为(-inf, 0)区间内。

import torch

# batch_size=2, num_labels=3
logits = torch.tensor([[0.3, 0.4, 0.3], [0.7, 0.1, 0.2]], dtype=torch.float, requires_grad=True)

print(torch.log(torch.nn.Softmax(dim=1)(logits)))
# tensor([[-1.1331, -1.0331, -1.1331], [-0.7679, -1.3679, -1.2679]], grad_fn=<LogBackward0>)

print(torch.nn.LogSoftmax(dim=1)(logits))
# tensor([[-1.1331, -1.0331, -1.1331], [-0.7679, -1.3679, -1.2679]], grad_fn=<LogSoftmaxBackward0>)

torch.nn.NLLLoss

负的对数似然损失（Negative Log Likelihood Loss）用于度量预测值与真实值之间的差异损失。从名称上就可以看出其与torch.nn.LogSoftmax关联性。

NLLLoss期望的输入input是“经过对数函数转换后的似然值”，即LogSoftmax处理logits之后的结果。在文本分类任务中，输入input张量一般满足shape=(N, C)。
NLLLoss期望的输入target一般需满足shape=(N,)，并且输入target张量的元素取值必须为类别索引（[0, C-1]）或者ignore_index。
对于类别标签分布不均衡的数据集，可以指定标签权重值，并传入到weight位置参数，以调整损失值的计算。
$\frac{1}{N}\sum l_n=-\frac{1}{N} \sum w_{y_n} * x_{n,y_n}$ ， $y_n$ 表示第 $n$ 条样本的类别标签， $x_{n,y_n}$ 表示第 $n$ 条样本，模型在 $y_n$ 类别上的对数似然预测值。

log_softmax_logits = torch.nn.LogSoftmax(dim=1)(logits)
target = torch.tensor([0, 2], dtype=torch.long)
print(torch.nn.NLLLoss(reduction="none")(log_softmax_logits, target))
# tensor([1.1331, 1.2679], grad_fn=<NllLossBackward0>)

从python执行代码可以看出，第一条样本的负对数似然损失值就是-log_softmax_logits[0][0]，第二条样本的负对数似然损失值就是-log_softmax_logits[1][2]。

torch.nn.CrossEntropy

CrossEntropy可以看做是LogSoftmax与NLLLoss二者的结合，其初始化参数值的说明与NLLLoss基本一致。对于单条样本，其损失计算如下：
$l_n = - \sum_c w_c * y_{n, c} *log( \frac {exp(x_{n, c})}{\displaystyle\sum_{i=1}^C exp(x_{n, i})})$

可以CrossEntropy的返回结果与LogSoftmax+NLLLoss的返回结果一致。

print(torch.nn.CrossEntropyLoss(reduction="none")(logits, target))
# tensor([1.1331, 1.2679], grad_fn=<NllLossBackward0>)

torch.nn.Sigmoid

sigmoid激活函数是为计算交叉熵损失服务的。对输入张量的每一元素进行sigmoid激活函数处理，sigmoid转换定义如下：
$\frac{1}{1 + exp(-x)}$

输出张量与输出张量具有相同的size，并输出张量的元素值属于(0, 1) 区间。当 $x$ 趋于正无穷， $s i g m o i d (x)$ 趋于1；当 $x$ 趋于负无穷， $s i g m o i d (x)$ 趋于0；当 $x = 0$ ， $s i g m o i d (x) = 0.5$ 。

torch.nn.BCELoss

BECLoss(Binary Entropy Cross Loss) 二元交叉熵损失，用于计算模型预测概率分布与真实标签概率分布的差异损失。对于单一样本的损失按如下公式进行计算、与交叉熵的定义一致：
$l_n = -w_n*[y_n*log(x_n) + (1- y_n)*log(1-x_n)]$

变量 $w_n$ 表示第 $n$ 条样本的权重， $x_n$ 表示模型预测为正例的似然值， $y_n$ 表示模型的真实标签值。
BCELoss期望的输入 $in p u t$ 与 $t a r g e t$ 张量可以是任意维度的，接受任意维度的输入，是BCELoss的核心优势。但是在BCELoss中，要求 $t a r g e t$ 张量的元素必须在[0.0, 1.0]，并且是浮点型。
在真实分类场景中， $t a r g e t$ 张量元素的取值一般为0.0或者1.0。

# batch_size = 3, num_labels = 2
binary_logits = torch.tensor([[0.8, 0.2], [0.1, 0.9], [0.6, 0.4]], dtype=torch.float, requires_grad=True)

# float dtype
binary_target = torch.tensor([[0.9, 0.1], [0.2, 0.8], [0.5, 0.5]], dtype=torch.float)

print(torch.nn.BCELoss(reduction="none")(torch.nn.Sigmoid()(binary_logits), binary_target))
# tensor([[0.4511, 0.7781], [0.7244, 0.5212], [0.7375, 0.7130]], grad_fn=<BinaryCrossEntropyBackward0>)

# 计算过程分解
# step1：将模型binary_logits进行sigmoid激活函数转换
sigmoid_binary_logits = torch.nn.Sigmoid()(binary_logits)
print(sigmoid_binary_logits)
# tensor([[0.6900, 0.5498], [0.5250, 0.7109], [0.6457, 0.5987]], grad_fn=<SigmoidBackward0>)

# step2：根据模型预测概率与target计算损失
# 输入[0, 0]上的损失：-[0.9 * torch.log(sigmoid_binary_logits[0,0]).item() + (1 - 0.9) * torch.log(1 - sigmoid_binary_logits[0, 0]).item()] = 0.4511
# 输入[0, 1]上的损失：-[0.1 * torch.log(sigmoid_binary_logits[0,1]).item() + (1 - 0.1) * torch.log(1 - sigmoid_binary_logits[0, 1]).item()] = 0.7781
# 输入[1, 0]上的损失：-[0.2 * torch.log(sigmoid_binary_logits[1,0]).item() + (1 - 0.2) * torch.log(1 - sigmoid_binary_logits[1, 0]).item()] = 0.7244

BCEWithLogitsLoss

BCEWithLogitsLoss相当于Sigmoid与BCELoss的结合。WithLogits表示该类期望的输入 $in p u t$ 是模型输出的logits，而非经过sigmoid处理之后的似然值。单条样本的损失计算如下：
$l_n = -w_n*[y_n*log(sigmoid(x_n)) + (1- y_n)*log(1-sigmoid(x_n))]$

BCEWithLogitsLoss常用于多标签文本分类任务中，在多标签场景下，单条样本的损失计算如下：
$l_n = \displaystyle\sum_{c=1}^C-w_{n,c}*[p_c *y_{n,c}*log(sigmoid(x_{n, c})) + (1- y_{n,c})*log(1-sigmoid(x_{n,c}))]$

$p_c$ 表示对于某一类型取正的加权。 $p_c > 1$ 会提高recall，相反会提高Precision。

print(torch.nn.BCEWithLogitsLoss(reduction="none")(binary_logits, binary_target))
# tensor([[0.4511, 0.7781], [0.7244, 0.5212],  [0.7375, 0.7130]], grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)

API小结

torch.nn.LogSoftmax与torch.nn.NLLLoss结合相当于torch.nn.CrossEntropy，torch.nn.Sigmoid与torch.nn.BCELoss结合相当于torch.nn.BCEWithLogitsLoss。因此应该直接考虑torch.nn.CrossEntropy与torch.nn.BCEWithLogitsLoss计算交叉熵损失。
CrossEntropy中 $l o g i t s$ 张量中的元素值，可以看做未进行激活函数与对数函数处理之前，模型对某一样本在特定类型上的预测似然概率值。
BCEWithLogitsLoss中 $l o g i t s$ 张量中的元素值，可以看做未进行激活函数与对数函数处理之前，模型对某一样本属于特定类型上的预测似然概率值。

典型分类任务

文本序列多类型单标签（C>2）

当需要将一个文本序列划分到某一特定类型，并且文档类型集合大于2，该任务为多类型单标签文本分类任务。推荐使用CrossEntropy计算训练损失。

# batch_size = 2, num_labels=3
logits = torch.tensor([[0.1, 0.2, 0.7], [0.4, 0.5, 0.1]], dtype=torch.float, requires_grad=True)
target = torch.tensor([2, 1], dtype=torch.long, requires_grad=False)
print(torch.nn.CrossEntropyLoss()(logits, target)) 
# tensor(0.8569, grad_fn=<NllLossBackward0>)

# 也是表示第一条样本仅属于第三种类型、第二条样本仅属于第二种类型
target_onehot = torch.tensor([[0.0, 0.0, 1.0], [0.0, 1.0, 0.0]], dtype=torch.float, requires_grad=False)
print(torch.nn.BCEWithLogitsLoss()(logits, target_onehot))
# tensor(0.6795, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)

注意相同含义输入条件下，CrossEntropyLoss 与 BCEWithLogitsLoss的计算结果不相同。

# 验证CrossEntropyLoss 与 BCEWithLogitsLoss的计算结果不相同
log_softmax_logits = torch.nn.LogSoftmax(dim=1)(logits)
print(log_softmax_logits)
# tensor([[-1.3679, -1.2679, -0.7679], [-1.0459, -0.9459, -1.3459]], grad_fn=<LogSoftmaxBackward0>)
# 对于CrossEntropyLoss
# 第一条样本的损失等于： 0 + 0 + -1 * -0.7679
# 第二条样本的损失等于： 0 + -1 * -0.9459 + 0
# batch的平均损失等于： （0.7679 + 0.9459）/ 2 = 0.8569
print(torch.nn.CrossEntropyLoss(reduction="none")(logits, target))
# tensor([0.7679, 0.9459], grad_fn=<NllLossBackward0>)

sigmoid_logits = torch.nn.Sigmoid()(logits)
print(sigmoid_logits)
# 对于BCEWithLogitsLoss，定义[n, c]表示模型在第n条样本对于第c种类型的预测损失
# [0, 0] 位置：-(0 + 1 * torch.log(1- sigmoid_logits[0,0]).item()) = 0.7444
# [0, 1] 位置：-(0 + 1 * torch.log(1- sigmoid_logits[0,1]).item()) = 0.7981
# [0, 2] 位置：-(1 * torch.log(sigmoid_logits[0, 2]).item() + 0) = 0.4032
# 第一条样本平均损失： (0.7444 + 0.7987 + 0.4032) / 3 = 0.6488
# [1, 0] 位置：-(0 + 1 * torch.log(1- sigmoid_logits[0,0]).item()) = 0.9130
# [1, 1] 位置：-(1 * torch.log(sigmoid_logits[1, 1]).item() + 0) = 0.4741
# [1, 2] 位置：-(0 + 1 * torch.log(1- sigmoid_logits[1, 2]).item()) = 0.7444
# 第二条样本平均损失：(0.9130 + 0.4741 + 0.744) /3  = 0.7103
# batchh的平均损失等于：(0.6488 + 0.7103) / 2 = 0.6795
print(torch.nn.BCEWithLogitsLoss(reduction="none")(logits, target_onehot))
# tensor([[0.7444, 0.7981, 0.4032], [0.9130, 0.4741, 0.7444]], grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)

注意在语言模型预测任务中，可以将预测任务转换成 (batch_size * seq_len, vocab_size) ，此时可以看成 $num\_labels=vocab\_size$ 。因此语言模型本质上可以看成是“多类型单标签”任务。

文本序列多类型多标签（C>2）

当需要将一个文本序列划分到特定的一类或者几类，并且文档类型集合大于2，该任务为多类型多标签文本分类任务。此时应使用BCEWithLogitsLoss，并且BCEWithLogitsLoss期望的输入 $t a r g e t$ 应该是OneHot变体的形式。

即在类型维度上，某一样本属于哪几个类型，则 $t a r g e t$ 张量在对应的类型位置应该取1.0，其它位置取0.0。
Pytorch CrossEntropy 不支持此任务，因为官网API限定了当输入 $l o g i t s$ 的 $s i ze = (N, C)$ ， $t a r g e t$ 的 $s i ze = (N,)$ 。即假定单条样本只属于某一种类型。

 output_num = 2  # 假设每一样本至多属于两种类型
 batch_size, num_classes = 2, 4
 logits = torch.randn(batch_size, num_classes, dtype=torch.float, requires_grad=True)
 target_idx = torch.randint(0, high=num_classes, size=(batch_size, output_num))
 target_onehot = torch.zeros(batch_size, num_classes).float()
 target_onehot = torch.scatter(target_onehot, dim=1, index=target_idx, value=1.0)
 loss = torch.nn.BCEWithLogitsLoss()(logits, target_onehot)

文本序列二分任务（二种类型单个标签）

当给定一个文本序列，要求模型判断该文本序列属于两个特定类型中的哪一类时，视为二分类任务。二分任务可以采用CrossEntropy或者BCEWithLogitsLoss，但推荐使用BCEWithLogitsLoss。当使用CrossEntropy时，注意类别维度的size应该等于2：

logits = torch.tensor([[0.6, 0.4], [0.3, 0.7]], dtype=torch.float, requires_grad=True)
target = torch.tensor([0, 1], dtype=torch.long, requires_grad=False)
print(torch.nn.CrossEntropyLoss()(logits, target))
# tensor(0.5556, grad_fn=<NllLossBackward0>)

但如果选择使用BCEWithLogitsLoss，类别维度的size应该等于1，并且 $t a r g e t$ 属于浮点型：

logits = torch.tensor([0.6, 0.3], dtype=torch.float, requires_grad=True)
target = torch.tensor([0, 1], dtype=torch.float, requires_grad=False)
print(torch.nn.BCEWithLogitsLoss()(logits, target))
# tensor(0.7959, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>)