Pytorch 自学笔记（二）：Pytorch中常用的激活函数与损失函数探究

JimmyTotoro

已于 2025-04-17 21:09:06 修改

阅读量4.5k

点赞数 34

分类专栏： Pytorch 自学笔记文章标签：神经网络 python 深度学习机器学习自然语言处理 pytorch 人工智能

于 2020-07-02 17:43:54 首次发布

本文链接：https://blog.csdn.net/duxiaodong1122/article/details/106979370

版权

Pytorch 自学笔记专栏收录该内容

3 篇文章

订阅专栏

欢迎大家关注作者的个人微信公众号【Jimmy的研习社】

Pytorch 自学笔记（二）

Pytorch 自学笔记系列的第二篇。针对Pytorch中常用的激活函数与损失函数进行一个简单的介绍，本文主要参考了Natural Language Processing with PyTorch一书 [第三章] (https://nlp-pt.apachecn.org/docs/3.html#categorical-cross-entropy-loss)的内容（链接是中文版，但是推荐看英文版，中文版像是机翻的）、周志华老师的”西瓜书“和 pytroch官方文档。

激活函数（Activation Functions）

激活函数是神经网络中引入的非线性函数，用于捕获数据中的复杂关系。引入非线性激活函数的必要性在于：如果使用线性激活函数（恒等激励函数），那么神经网络仅是将输入线性组合再输出，在这种情况下，深层（多个隐藏层）神经网络与只有一个隐藏层的神经网络没有任何区别，不如去掉多个隐藏层；因此，想要使神经网络的多个隐藏层有意义，需要使用非线性激活函数。常用的激活函数包括：sigmoid、tanh、relu。

Sigmoid

sigmoid 激活函数的数学形式为：
$\frac{1}{{1 + {e^{ - x}}}}$
该函数具有如下的特性：当x趋近于负无穷时，y趋近于0；当x趋近于正无穷时，y趋近于1；当x=0时，y=0.5。
在pytorch中，sigmoid实现为torch .sigmoid()，具体使用方式如下面代码所示：

import torch
import matplotlib.pyplot as plt

x = torch.range(-5., 5., 0.1)
y = torch.sigmoid(x)
plt.plot(x.numpy(), y.numpy())
plt.show()

以上代码的输出结果为：
在这里插入图片描述
由图中可以看出sigmoid函数的优缺点：
优点：

sigmoid函数的输出映射在(0,1)之间，单调连续，输出范围有限，优化稳定，可以用作输出层；
求导容易。

缺点：

由于其软饱和性（产生极值输出的速度过快），容易产生梯度爆炸和梯度消失，导致训练出现问题；
其输出并不是以0为中心的。

由于以上优缺点，在神经网络中，除了在输出端使用sigmoid单元外，很少看到其他使用sigmoid单元的情况。

Tanh

tanh激活函数相当于是sigmoid激活函数的改进，它俩的外观很相似，直接上图：
在这里插入图片描述
对比上面sigmoid激活函数的图像，可以发现，tanh函数和sigmoid函数的形状是一样的，只是尺度和范围不同。tanh激活函数的数学公式为：
$\frac{{{e^{x}} - {e^{ - x}}}}{{{e^{x}} + {e^{ - x}}}}$
pytorch将tanh函数封装为torch.tanh()方法，样例代码如下：

import torch
import matplotlib.pyplot as plt

x = torch.range(-5., 5., 0.1)
y = torch.tanh(x)
plt.plot(x.numpy(), y.numpy())
plt.show()

需要注意的是，相对于sigmoid函数，tanh函数的输出是以0为中心的，但是，依然会出现软饱和，导致梯度爆炸和梯度消失问题。

Relu及其变体

relu激活函数，全称为rectified linear unit，之所以叫这个名字，应该是它在输入为正数时，就是一个简单的线性函数（这块完全是瞎猜的），其数学表达式为： $f (x) = ma x (0, x)$
表达式很好理解，就是将线性函数f(x) = x的负值输出裁剪为0。
在pytorch中，利用torch.nn.ReLU()来操作relu函数：

import torch
import torch.nn as nn
import matplotlib.pyplot as plt

prelu = torch.nn.PReLU(num_parameters=1)
x = torch.range(-5., 5., 0.1)
y = prelu(x)

plt.plot(x.numpy(), y.detach().numpy())
plt.show()

代码的运行结果为：

相比于sigmoid函数和tanh函数，relu函数通过裁剪负值，消除了梯度问题，但同时带来了dying ReLU问题，即随着训练的进行，网络中的某些输出会变为0并且再也不会恢复。为了消除这一现象，有人有提出了Leaky ReLU或 Parametric ReLU (PReLU)等relu函数的变体，这两种变体可以统一用以下数学公式来表示：
$f (x) = ma x (a x, x)$
其中，Leaky ReLU中a固定为0.01，而Parametric ReLU (PReLU)中，a为一个可学习的参数。这里介绍一下PReLU在pytorch中的使用：

import torch
import torch.nn as nn
import matplotlib.pyplot as plt

prelu = torch.nn.PReLU(num_parameters=1)
x = torch.range(-5., 5., 0.1)
y = prelu(x)

plt.plot(x.numpy(), y.detach().numpy())
plt.show()

代码运行结果如下：
在这里插入图片描述
从图中可以看到，相对于relu函数，PReLU函数并没有将所有的负输出进行完全裁剪，因而解决了dying ReLU问题。

Softmax

softmax函数类似于sigmoid激活函数，将每个单元的输出都压缩为0到1之间，然而，softmax操作还将输出向量中的每个输出除以所有输出的和，从而得到一个K个类别上的离散概率分布，该概率分布上的所有概率的和为1。softmax函数的数学公式如下：
$softmax(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n}e^{x_j}}$
softmax函数现在经常被用作多分类场景中的输出层，其在pytorch中的使用例子如下：

import torch
import torch.nn as nn

softmax = nn.Softmax(dim=1)
x_input = torch.randn(1, 3)
y_output = softmax(x_input)
print(x_input)
print(y_output)
print(y_output.size())
print(torch.sum(y_output, dim=1))

输出示例为：

tensor([[-0.9071, -1.7802,  0.8074]])
tensor([[0.1434, 0.0599, 0.7967]])
torch.Size([1, 3])
tensor([1.])

损失函数（Loss Functions）

损失函数的作用，简单来说就是计算模型预测值与实际值之间的残差，模型训练（参数更新）的目标就是让降低损失函数的值。PyTorch在它的nn包中实现了许多损失函数，我们在这里只介绍几个常用的：Mean Squared Error Loss、Categorical Cross-Entropy Loss以及Binary Cross-Entropy。（对于损失函数概念不理解的可以去看这篇。）

Mean Squared Error Loss

Mean Squared Error Loss（MSE）就是预测值与目标值之差的平方的平均值，它常常被用在回归问题中。在Pytorch中，MSE函数被实现为torch.nn.MSELoss()，使用示例如下：

import torch
import torch.nn as nn

mse_loss = nn.MSELoss()
outputs = torch.randn(3, 5, requires_grad=True)
targets = torch.randn(3, 5)
loss = mse_loss(outputs, targets)

以上代码的输出为：

tensor(2.5741, grad_fn=<MseLossBackward>)

Categorical Cross-Entropy Loss

Categorical Cross-Entropy Loss 损失函数经常被用在多分类问题中，其计算公式可以表示为： $loss=-\sum_{c=1}^M y_c \log(p_c)$
其中，M为类别的数量；y是一个包含n个元素的向量，表示样本对应真实值的多项分布。如果只有一个类是正确的（即多类分类问题），那么这个向量就是one-hot向量；p_c为模型预测样本属于类别c的概率。主要注意的是：交叉熵和它的表达式起源于信息论，但是在这里，我们选择把它简单看作一种计算两个概率分布有多不同的方法。
为了加深大家的对于Categorical Cross-Entropy Loss 的理解，Natural Language Processing with PyTorch一书中给出了四条决定网络输出和损失函数之间微妙关系的信息，这里直接放上原文：

First, there is a limit to how small or how large a number can be.
Second, if input to the exponential function used in the softmax formula is a negative number, the resultant is an exponentially small number, and if it’s a positive number, the resultant is an exponentially large number.
Next, the network’s output is assumed to be the vector just prior to applying the softmax function.
Finally, the log function is the inverse of the exponential function, and log(exp(x)) is just equal to x.

大概解释一下：在多类分类问题中，一般作为模型输出层的是softmax层：
$softmax(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n}e^{x_j}}$ softmax层运算的核心是指数函数运算——如果输入到指数函数中的值为一个正数，则输出结果会是一个很大的数，而如果是一个负数，则输出结果会是一个很小的数。为了避免指数函数的这一特点给模型的训练带来负面影响，使该方法的输出在数值上更稳定，设计者在Categorical Cross-Entropy Loss中使用了log函数处理模型的输出（因为，log函数是指数函数的逆运算）。
Categorical Cross-Entropy Loss 损失函数在pytorch中被实现为nn.CrossEntropyLoss()方法，其使用示例如下：

import torch
import torch.nn as nn

ce_loss = nn.CrossEntropyLoss()
outputs = torch.randn(3, 5, requires_grad=True) 
targets = torch.tensor([1, 0, 3], dtype=torch.int64) 
loss = ce_loss(outputs, targets)

在这里要强调一下nn.CrossEntropyLoss()方法的输入输出shape。在pytorch documentation中，该方法的输入包含两部分——input和target。input shape为 (N, C) （如果要求有多维的损失函数，则shape与此不同，这里我们暂时不考虑这种情况，感兴趣的同学可以自己看一下文档），可以理解为一共有N行，每一行为每个实例在所有类别上的概率分布，其中的N为mini_batches的大小，即每次输入训练器的number_of_instances，C为number_of_classes，即类别的数量；target shape为 (N) ，其中的每个值为一个实例实际上的类别（ target_value，0 <= target_value <= C-1，C=number_of_classes ）。以上面代码中nn.CrossEntropyLoss()的输入为例，其中的变量outputs即为方法的input，其值为：

print(outputs)

tensor([[ 0.0740, -0.2800, -0.8952, -0.1498, -1.3669],
        [-1.7670, -1.0848,  0.0876, -0.3991, -0.1583],
        [-0.7523,  1.0118, -0.4698, -1.7559,  0.3190]], requires_grad=True) # 其中的每一行为每个实例在所有类别上的概率分布

变量targets即为方法的 target，其值为：

print(targets)

tensor([1, 0, 3]) # 第一个实例的实际类别为1，第二个实例的实际类别为0，第三个实例的实际类别为3

输出的output shape为一个标量（scalar），代表了预测类别和实际类别两个概率分布之间的差异（模型训练的目标就是最小化这个差异），例如上面代码中的输出为：

print(loss)

tensor(2.6130, grad_fn=<NllLossBackward>)

Binary Cross-Entropy Loss

上面介绍的Categorical Cross-Entropy Loss 损失函数用于解决多分类问题，对于二分类问题，我们则常利用Binary Cross-Entropy(BCE) Loss，其和Categorical Cross-Entropy Loss 在数学上的定义是相同的，可以简单被看做是Categorical Cross-Entropy Loss 在类别数量为2时的特例。
Binary Cross-Entropy(BCE) Loss在pytorch中被封装成nn.BCELoss()方法，需要注意的是，它的输出shape和nn.CrossEntropyLoss()相同，即为一个标量；但是，输入shape则是不同的。以下面代码为例：

import torch
import torch.nn as nn

bce_loss = nn.BCELoss()
sigmoid = nn.Sigmoid()

outputs = sigmoid(torch.randn(4, 1, requires_grad=True))
targets = torch.tensor([1, 0, 1, 0], dtype = torch.float32).view(4, 1)
loss = bce_loss(outputs, targets)

输入同样是包含input和target两部分，其中input shape为 (N, *)，N为mini_batches的大小，即每次输入训练器的number_of_instances，N中的每个值代表模型预测当前实例为正类别的概率，*代表可能的附加维度（没有附加维度即为1），上面代码中的outputs变量即为input，其值为：

print(outputs)

tensor([[0.5256],
        [0.3294],
        [0.4902],
        [0.7872]], grad_fn=<SigmoidBackward>) # 每行的值代表模型预测当前实例为正类别的概率

而target shape也是 (N, *)，N中每个值代表当前实例实际上的类别标签。上面代码中的targets变量即为target，其值为：

print(targets)

tensor([[1.],
        [0.],
        [1.],
        [0.]]) # 每行的值代表模型当前实例的实际类别

而输出的值同样为一个标量：

print(loss)

tensor(0.8258, grad_fn=<BinaryCrossEntropyBackward>)

Conclusion

完整代码链接在这里。这篇算是对pytorch中常用的激活函数和损失函数做了一个总结，但是还欠缺了对每个方法参数的描述，想了解的小伙伴只能辛苦下自己去看文档啦。