损失函数与优化算法

最新推荐文章于 2024-07-31 14:47:45 发布

qq_26697045

最新推荐文章于 2024-07-31 14:47:45 发布

阅读量1.9k

点赞数 1

分类专栏：深度学习文章标签：算法 pytorch 深度学习

本文链接：https://blog.csdn.net/qq_26697045/article/details/118878286

版权

深度学习专栏收录该内容

34 篇文章 0 订阅

订阅专栏

欢迎访问我的博客首页。

损失函数与优化算法

1. 损失函数
2. 二分类损失
3. 多分类损失
4. 回归损失
5. 优化算法
- 5.1 Adam
- 5.2 SGD
6. 参考

1. 损失函数

Pytorch 损失函数的两个参数 size_average 和 reduce 已被废弃，采用 reduction 代替它们。reduction 的取值是 none、mean、sum 三个字符串，分别表示保留维度、取均值、取和，默认是 sum。

2. 二分类损失

虽然多分类任务更常见，但简单的二分类损失函数也有特定的使用场景，比如在目标检测任务中，如果只需要区分背景与目标，使用二分类任务就可以了。

1. sigmoid 函数

网络输出的值称为 logit，它是取值在 $[-\infty, +\infty]$ 的 tensor。这个取值范围太大，需要进行归一化，即把 logit 的值都归一化到 [0, 1] 区间。一般使用 sigmoid 函数和 softmax 函数实现归一化，它们分别用于二分类和多分类。

$\frac{exp(x)}{1 + exp(x)}. \tag{2.1}$

1. BCELoss

BCELoss 全称 binary cross entropy loss，即二分类交叉熵损失。它的 input 和 target 的维度须相同，且 input 的每个值的取值区间须是 [0, 1]。因此，一般把网络的输出经过 Sigmoid 函数转换到 [0, 1] 区间，再作为 BCELoss 的 input。用 x 代表 input，y 代表 target，BCELoss 的计算方法是：

$\times [ y \cdot ln(x) + (1 - y) \cdot ln(1 - x)] = \begin{cases} -ln(1-x)\ &y=0, \\ -lnx &y=1. \end{cases} \tag{2.2}$

BCELoss 的定义和用法如下：

def self_BCELoss(logit, target):
    loss = -1 * (target * torch.log(logit) + (1 - target) * torch.log(1 - logit))
    return loss

if __name__ == '__main__':
    logit = torch.Tensor([0.9, 0.1])
    target = torch.Tensor([0, 1])
    BCELoss = torch.nn.BCELoss(reduction='none')
    loss1 = BCELoss(logit, target)
    loss2 = self_BCELoss(logit, target)
    print(loss1)
    print(loss2)

logit 来自网络输出，它的取值必须在 [0, 1] 区间，所以它一般是对网络输出求 softmax 或 sigmoid 的结果。
target 是取值为 0 或 1 的标签，表示负样本或正样本。由公式 2.2 可以看出：当标签为 0 时，损失函数是关于 logit 的单调递增函数，即让负样本的 logit 值减小；当标签为 1 时，损失函数是关于 logit 的单调递减函数，即让正样本的 logit 值增大。
当 reduction=‘none’ 时，loss.shape = logit.shape = target.shape，而且 loss[i] 仅与 logit[i] 和 target[i] 有关。

3. 多分类损失

3.1 交叉熵

交叉熵损失 CrossEntropyLoss 是常用的多分类损失函数。它可以分解为 Softmax、ln、NLLLoss 三个子操作。

1. softmax 函数

softmax 函数定义如下：

$\frac{exp(x)}{exp(x_1) + exp(x_2) + \cdots + exp(x_n)}. \tag{3.1}$

其中 Softmax 的输入是向量，Sigmoid 的输入是标量。下面是 Softmax 的定义和用法：

def self_Softmax(logit):
    return torch.exp(logit) / torch.exp(logit).sum()

if __name__ == '__main__':
    logit = torch.Tensor([1, 2, 3])
    Softmax = torch.nn.Softmax(dim=0)
    out1 = Softmax(logit)
    out2 = self_Softmax(logit)
    print(out1)
    print(out2)

2. LogSoftmax

LogSoftmax 等价于对 Softmax 值取自然对数 log，使用方法如下：

def self_LogSoftmax(logit):
    Softmax = torch.nn.Softmax(dim=0)
    return torch.log(Softmax(logit))

if __name__ == '__main__':
    logit = torch.Tensor([1, 2, 3])
    LogSoftmax = torch.nn.LogSoftmax(dim=0)
    loss1 = LogSoftmax(logit)
    loss2 = self_LogSoftmax(logit)
    print(loss1)
    print(loss2)

3. NLLLoss

NLLLoss 全称 negative log likelihood loss，即负对数似然损失。NLLLoss 只是根据 target 中的下标返回 input 中对应的值的相反数，所以 target 须是长整型，input 须是浮点型。用法如下：

if __name__ == '__main__':
    input = torch.tensor([[0.1, 0.2, 99.5], [1.3, 2.4, 0.4]], dtype=torch.float32)
    target = torch.tensor([2, 0], dtype=torch.long)
    NLLLoss = torch.nn.NLLLoss(reduction='none')
    output = NLLLoss(input, target)
    print(output)  # tensor([-99.5, -1.3]).

可以看出，NLLLoss 只是根据 target 中的下标返回 input 中对应的值的相反数。

4. CrossEntropyLoss

CrossEntropyLoss 即交叉熵损失，它等价于 LogSoftmax(Softmax + ln) + NLLLoss。交叉熵就是 softmax 值的负对数。假如类别总数是 n，网络的输出 $logit = [x_1, x_2, ..., x_k, ..., x_n]$ ，标签 $t a r g e t = k$ 。那么交叉熵损失就是

$loss(logit,\ target) = - ln(\frac{exp(x_k)}{\sum_{i=1}^{n}exp(x_i)}).\tag{3.2}$

通常 logit 和 target 都是多维的，下面就以二维的 logit 和 target 为例介绍 PyTorch 中交叉熵损失函数的用法。

if __name__ == '__main__':
    logit = torch.tensor([[9, 5, 8, 67, 11], [6, 75, 8, 4, 7]], dtype=torch.float32)
    CrossEntropyLoss = torch.nn.CrossEntropyLoss(reduction='none')
    # 标签形式的target。
    label = torch.tensor([3, 1], dtype=torch.long)
    loss1 = CrossEntropyLoss(logit, label)
    print(loss1)
    # 独热编码形式的target。
    one_hot = torch.tensor([[0, 0, 0, 1, 0], [0, 1, 0, 0, 0]], dtype=torch.float32)
    loss2 = CrossEntropyLoss(logit, one_hot)
    print(loss2)
    # 等价。
    NLLLoss = torch.nn.NLLLoss(reduction="none")
    loss3 = torch.log(torch.softmax(logit, dim=1))
    loss3 = NLLLoss(loss3, label)
    print(loss3)

PyTorch 1.10 及以上版本才支持独热编码形式的 target。损失函数还可以有其它参数：

ignore_index=3：类别 3 的交叉熵置 0。因为 target[1]=3，所以让 z[1]=0。
weight = torch.tensor([1, 2, 3, 4, 5], dtype=torch.float)：类别 0 的交叉熵乘1，类别1的交叉熵乘 2，以此类推。因为 target[1]=3，所以 z[1] 要乘 weight[3]。

3.2 平滑标签的交叉熵

1. 平滑标签交叉熵的原理

交叉熵对异常值敏感。为了解决这个问题，提出了标签平滑损失：

$y_{warm} =y_{hot}(1-\alpha) + \alpha / K. \tag{3.3}$

其中 K 是类别总数， $\alpha$ 是个较小的数，如 0.1， $y_{hot}$ 是标注值的独热编码。可以看到，平滑标签就是把标注值 $y_{hot}$ 转换为 $y_{warm}$ 。

2. 平滑标签交叉熵的实现

PyTorch 中暂时没有实现平滑标签的交叉熵，我们先看看 TensorFlow 实现的平滑标签的交叉熵。

if __name__ == '__main__':
    logit = tf.convert_to_tensor([[9, 5, 8, 67, 11], [6, 75, 8, 4, 7]], dtype=tf.float32)
    target = tf.convert_to_tensor([[0, 0, 0, 1, 0], [0, 1, 0, 0, 0]])
    loss = tf.compat.v1.losses.softmax_cross_entropy(
        onehot_labels=target, logits=logit, label_smoothing=0.2, reduction=tf.compat.v1.losses.Reduction.NONE)
    print(loss)

使用 PyTorch 实现平滑标签的交叉熵如下。

def CELoss(input, target, label_smooth=0.2, reduction="none"):
    assert (input.size(0) == target.size(0))
    num_classes = input.size(1)
    if input.size() != target.size():
        target = F.one_hot(target.long(), num_classes)
    if label_smooth is not 0:
        target = target * (1 - label_smooth) + torch.ones_like(target) * label_smooth / num_classes
        # target = target * (1 - label_smooth) + label_smooth / num_classes
    return F.cross_entropy(input, target.float(), reduction=reduction)

if __name__ == '__main__':
    logits = torch.tensor([[9, 5, 8, 67, 11], [6, 75, 8, 4, 7]], dtype=torch.float32)
    label = torch.tensor([3, 1], dtype=torch.long)
    loss1 = CELoss(logits, label)
    one_hot = torch.tensor([[0, 0, 0, 1, 0], [0, 1, 0, 0, 0]], dtype=torch.float32)
    loss2 = CELoss(logits, one_hot)
    print(loss1)
    print(loss2)

这里实现的平滑标签的交叉熵函数可以接收两种类型的标签。第 7 行和第 8 行等价，它们是公式 3.3 的实现。

3.3 focal loss

focal loss 用于缓解困难样本问题。

1. 二分类的 focal loss 的原理

Focal loss 即可以解决正负样本不均衡问题，也可以解决困难样本问题。回顾公式 2.2

$\begin{cases} -ln(1-x) &y=0,\\ -lnx &y=1. \end{cases} \tag{2.2}$

可以看出，无论是正样本(y=1)还是负样本(y=0)，减小损失函数的目的都是使 $ln(\cdot)$ 为 0。我们可以把公式 2.2 写成下面的等价形式：

$-lnp_t\quad 其中 p_t = \begin{cases} 1-x &y=0,\\ x &y=1. \end{cases} \tag{3.4}$

那么，减小损失函数的目的就是使 $p_t$ 接近 1。 $p_t$ 接近 0 则说明样本是困难样本， $p_t$ 接近 1 则说明样本是容易样本。为了缓解困难样本问题，可以给困难样本的损失函数乘一个较大的权重，给容易样本的损失函数乘一个较小的权重：

$\alpha_{t} (1-p_t)^\gamma ln(p_t) = \begin{cases} -\alpha \cdot x^\gamma \cdot ln(1-x) &y=0,\\ -(1 - \alpha) \cdot (1-x)^\gamma \cdot ln(x) &y=1. \end{cases} \tag{3.5}$

如果 $p_t$ 接近 1，则说明这个样本是易分的容易样本，于是损失函数乘以接近 0 的 $(1-p_t)^\gamma$ ；如果 $p_t$ 接近 0，则说明这个样本是难分的困难样本，于是损失函数乘以接近 1 的 $(1-p_t)^\gamma$ 。经验表明 $\gamma$ 取 2 效果最好。
公式 3.5 还使用了系数 $\alpha$ ，它的取值为 0.25，作用是平衡正负样本。注意，是正样本乘 0.25，负样本乘 0.75。

2. 多分类的 focal loss

对多分类来说，因为包括背景的每一类都有标签，所以每一类都是正样本，focal loss 的计算也就更简单：对网络预测的结果求 softmax 的结果就是 $p_t$ ，然后把 $p_t$ 的负对数值与 $p_t)^\gamma$ 相乘就是 focal loss 损失。比如包含背景共有 n 类，网络的输出是 $x = [x_1, x_2, ..., x_k, ..., x_n]$ ，标签 $t a r g e t = [k]$ 。那么 focal loss 的计算步骤如下：

$\begin{cases} p &= sigmoid(logit) \\ p_t &= p \cdot target + (1 - p) \cdot (1 - target) \\ \alpha &= \alpha \cdot target + (1 - \alpha) \cdot (1 - target) \\ ce\_loss &= - [(1 - target) \cdot ln(1 - p) + target \cdot ln(p)] \\ loss &= \alpha (1 - p_t) ^ \gamma \cdot ce\_loss \end{cases} \tag{3.6}$

公式 3.6 并不复杂，因为 target 是独热编码，取值只有 0 和 1。focal loss 把 n 分类问题看成 n 个二分类问题，对每个二分类问题按公式 3.5 计算损失，返回 n 个二分类问题的 n 个损失值。再根据需要求和或求均值。

3. focal loss 的实现

下面的 focal loss 来自 torchvision，它基于 sigmoid 函数，对每个类别进行二分类，所以称为 sigmoid_focal_loss。

def sigmoid_focal_loss(inputs, targets, alpha=0.25, gamma=2, reduction="none"):
    ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none")
    p = torch.sigmoid(inputs)
    p_t = p * targets + (1 - p) * (1 - targets)
    loss = ce_loss * ((1 - p_t) ** gamma)
    if alpha >= 0:
        alpha_t = alpha * targets + (1 - alpha) * (1 - targets)
        loss = alpha_t * loss
    if reduction == "mean":
        loss = loss.mean()
    elif reduction == "sum":
        loss = loss.sum()
    return loss

if __name__ == '__main__':
    logit = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.float32)
    target = torch.tensor([[0, 0, 1], [0, 0, 1]], dtype=torch.float32)
    loss1 = torchvision.ops.focal_loss.sigmoid_focal_loss(logit, target)
    loss2 = sigmoid_focal_loss(logit, target)
    print(loss1)
    print(loss2)

其中 binary_cross_entropy_with_logics 的计算过程基于 BCELoss：

t1 = (torch.ones_like(targets) - targets) * torch.log(torch.ones_like(p) - p)
t2 = targets * torch.log(p)
ce_loss = -1 * (t1 + t2)

4. 回归损失

下面介绍L1、L2、平滑L1三种损失函数。平滑L1损失是L1损失和L2损失的结合，所以一般使用平滑L1损失。

1. L1损失

$L1Loss(x, y) =|x_i - y_i|.$

loss = torch.nn.L1Loss(reduction='sum')
x = torch.tensor([[1, 2], [3, 4.]])
y = torch.tensor([[5, 6], [7, 8.]])
z = loss(x, y)

2. L2损失

$MSELoss(x, y) = (x_i - y_i)^2.$

loss = torch.nn.MSELoss(reduction='sum')

3. 平滑L1损失

$\begin{cases} 0.5(x_i - y_i)^2\ &|x_i - y_i|<1 \\ |x_i - y_i|-0.5\ &|x_i - y_i| \geq 1 \end{cases}$

loss = torch.nn.SmoothL1Loss(reduction='sum')

对这三种损失函数而言：

L损失

L1 损失没有解析解，求解速度慢。L2 损失在差较大时容易梯度爆炸，但它有解析解，计算速度较快。SmoothL1Loss 损失在差较小(小于1)时使用 L2Loss 损失，在差较大时使用 L1Loss 损失，兼具两个的优点，系数 0.5 是为了使该分段函数连续。

5. 优化算法

Adam 和 SGD 是深度学习最常用的两个优化算法。

# 自适应矩估计
optimer1 = torch.optim.Adam(params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0, amsgrad=False)
# 随机梯度下降
optimer2 = torch.optim.SGD(params, lr=required, momentum=0, dampening=0, weight_decay=0, nesterov=False)

optimer3=torch.optim.AdamW(params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=1e-2, amsgrad=False)

5.1 Adam

自适应矩估计算法：鲁棒性强但可能找不到最优解。

betas：梯度的运行均值系数和方差系数。
eps：epsilon，防止除 0 而添加的小数。
amsgrad：是否采用 AMSGrad 优化方法。AMSGrad 是 Adam 的改进，通过添加额外的约束，使学习率始终为正值。

5.2 SGD

随机梯度下降算法：容易出现震荡导致下降速度慢。所以可以选择带动量的随机梯度下降算法 SGDM 和自适应梯度算法 AdaGrad。
动量为了避免陷入局部最优解。动量是带有

momentum：动量系数。
dampening：动量抑制系数。
nesterov：是否使用 Nesterov 动量。

6. 参考

qq_26697045

关注

1
点赞
踩
9

收藏

觉得还不错? 一键收藏
0
评论
损失函数与优化算法

损失函数与优化算法1. 损失函数1.1 多分类损失1.2 回归损失2. 优化算法1. 损失函数 Pytorch 损失函数的两个参数 size_average 和 reduce 已被废弃，采用 reduction 代替它们。reduction 的取值是 sum、mean、none 三个字符串，分别表示保留维度、取元素和、取元素均值，默认是 sum。1.1 多分类损失1. 负对数似然损失NLLLoss log softmax加负对数似然等于交叉熵。2. 交叉熵损失CrossEntropyLo
复制链接

扫一扫

专栏目录