机器学习笔记：激活函数

UQI-LIUWJ

已于 2024-07-10 16:42:27 修改

阅读量943

点赞数

分类专栏：机器学习文章标签：神经网络深度学习机器学习

于 2021-09-09 18:54:56 首次发布

本文链接：https://blog.csdn.net/qq_40206371/article/details/120207049

版权

机器学习专栏收录该内容

150 篇文章 30 订阅

订阅专栏

1 激活函数综述

激活函数：对输入信号进行线性/非线性变换

2 为什么激活函数要是非线性函数

如果不用激活函数，在这种情况下你每一层节点的输入都是上层输出的线性函数，很容易验证，无论你神经网络有多少层，输出都是输入的线性组合，与没有隐藏层效果相当。那么网络的逼近能力就相当有限。

正因为上面的原因，我们决定引入非线性函数作为激活函数，这样深层神经网络表达能力就更加强大（不再是输入的线性组合，而是几乎可以逼近任意函数）

3 激活函数举例

3.1 softmax

一般最后一个输出层可能用它来归一化

经过softmax后每一个单元取值都是介于0和1之间的概率值 ,且和为1——>这是变换后可被解释成概率的基本前提。

选择概率值最大的类别作为最终的分类结果（对于多分类问题来说）

3.1.1 softmax 缺点

在零点不可微；
负输入的梯度为零，这意味着对于该区域的激活，权重不会在反向传播期间更新，因此会产生永不激活的死亡神经元

负输入的概率(aj)约等于0，所以无论如何梯度接近于0.

m = nn.Softmax(dim=1)
input = torch.randn(2, 3)
output = m(input)

3.2 tanh

和sigmoid类似，也是将(-∞,∞)压缩到一个有限的空间中。

和sigmoid不一样的地方是，sigmoid 压缩至[0，1]，tanh压缩至(-1,1)

在一般的二元分类问题中，tanh 函数用于隐藏层，而 sigmoid 函数用于输出层，但这并不是固定的，需要根据特定问题进行调整。

m = nn.Tanh()
input = torch.randn(2)
output = m(input)

3.3 sigmoid

将所有结果压缩到[0~1]上——可以用来进行二元分类，σ(x)表示了一个类的概率

、

$f(x)=\frac{1}{1+e^{-z}}$

m = torch.nn.Sigmoid()
input = torch.randn(2)
output = m(input)

3.3.1 sigmoid优点

Sigmoid 函数的输出范围是 0 到 1。相当于对每个神经元的输出进行了归一化
用于将预测概率作为输出的模型。
梯度平滑，避免「跳跃」的输出值
函数是可微的。这意味着可以找到任意点的 sigmoid 曲线的斜率

3.3.2 sigmoid缺点

倾向于梯度消失；
函数输出不是以 0 为中心的，这会降低权重更新的效率；
Sigmoid 函数执行指数运算，计算机运行得较慢。

3.3.3 swish激活函数（SiLU）

y = x * sigmoid (x)

有点类似于LSTM中 sigmoid gate的设计

m = torch.nn.SiLU()
input = torch.randn(2)
output = m(input)

3.3.4 h-swish

在Swish函数中，由于sigmoid函数的指数计算，特别耗时，不适用于部署在移动端的网络。

h-swish使用ReLU6(x+3)/6来近似替代sigmoid

3.3.5 Mish

m = nn.Mish()
input = torch.randn(2)
output = m(input)

3.4 ReLU（线性整流单元）

便于计算
可以解决梯度衰减和梯度爆炸问题（斜率是1）

小于0的输出值，经过激活函数之后，输出为0，那么这些值我们可以去除：变成一个细长的线性网络

但这并不是说明通过relu之后，我们得到了一个线性模型，因为随着input的不同，模型是一直在变的（经过后，那些神经元有值是变化的。也就是说，relu连接的边会发生变化）

m = torch.nn.ReLU()
input = torch.randn(2)
output = m(input)

3.4.1 relu的优势

当输入为正时，不存在梯度饱和问题。
计算速度快得多。ReLU 函数中只存在线性关系，因此它的计算速度比 sigmoid 和 tanh 更快。

3.4.2relu的劣势

Dead ReLU 问题。
- 当输入为负时，ReLU 完全失效，在正向传播过程中，这不是问题。但是在反向传播过程中，如果输入负数，则梯度将完全为零
- ——>一种解决方法就是leaky relu和parametric relu
ReLU 函数不是以 0 为中心的函数

3.4.3 leaky relu （LRelu）

Leaky ReLU 通过把 x 的非常小的线性分量给予负输入（0.01x）来调整负值的零梯度（zero gradients）问题
leak 有助于扩大 ReLU 函数的范围，通常 a 的值为 0.01 左右；
Leaky ReLU 的函数范围是（负无穷到正无穷）

m = torch.nn.LeakyReLU(0.01)
input = torch.randn(2)
output = m(input)

从理论上讲，Leaky ReLU 具有 ReLU 的所有优点，而且 Dead ReLU 不会有任何问题，但在实际操作中，尚未完全证明 Leaky ReLU 总是比 ReLU 更好。

3.4.4 parametric relu （PRelu）

m = torch.nn.PReLU(num_parameters=1)
input = torch.randn(2)
output = m(input)

num_parameters要么是1，要么是input的channel数量

3.4.5 ERelu

$ELU(x)=\left\{\begin{matrix} x,x>0\\ a(e^x-1),x \le 0 \end{matrix}\right.$

没有 Dead ReLU 问题，输出的平均值接近 0，以 0 为中心
ELU 在较小的输入下会饱和至负值，从而减少前向传播的变异和信息

问题是它的计算强度更高。与 Leaky ReLU 类似，尽管理论上比 ReLU 要好，但目前在实践中没有充分的证据表明 ELU 总是比 ReLU 好。

默认参数为括号内参数（上下同）

m = torch.nn.ELU(alpha=1)
input = torch.randn(2)
output = m(input)

3.4.6 Relu6

m = torch.nn.ReLU6()
input = torch.randn(2)
output = m(input)

3.4.7 RRELU

α在不同的位置是从U(lower,upper)中随机采样的

m =torch.nn.RReLU(lower=0.125, upper=0.3333333333333333)
input = torch.randn(2)
output = m(input)

3.4.8 SELU

相当于X大于0的时候还是x，x小于0的时候是α(exp(x)-1)

m = torch.nn.SELU()
input = torch.randn(2)
output = m(input)

3.4.9 CELU

相当于X大于0的时候还是x，x小于0的时候是α(exp(x/α)-1)

m = nn.CELU(alpha=1.0,)
input = torch.randn(2)
output = m(input)

3.4.10 GELU

m = nn.GELU()
input = torch.randn(2)
output = m(input)

3.4.11 SiLU

3.5 Maxout

relu是特殊的maxout，相当于一个虚拟神经元的输出肯定为0，然后进行比较，看保留哪个

maxout对应的分段函数有几段，取决于一次性比较几个输出

3.5.1 maxout的训练

每一次给定不同的input，连接的边都是不一样的，训练的方式也自然不同。（每一次只更新目前连着的这些参数）

但因为不同的input对应不同的连接方式，所以每个weight实际上都会被train到（CNN中的max pooling的训练方式同理）

3.6 MISH

m = torch.nn.Mish()
input = torch.randn(2)
output = m(input)

3.7 HardShrink


m = torch.nn.Hardshrink(lambd=0.5)
input = torch.randn(2)
output = m(input)

3.7.1 SoftShrink

m = nn.Softshrink(lambd=0.5)
input = torch.randn(2)
output = m(input)

3.8 HardSigmoid

m = torch.nn.Hardsigmoid()
input = torch.randn(2)
output = m(input)

3.9 HardTanh

m=torch.nn.Hardtanh(min_val=- 1.0, max_val=1.0)
input = torch.randn(2)
output = m(input)

3.10 HardSwish

m = torch.nn.Hardswish()
input = torch.randn(2)
output = m(input)

3.11 LogSigmoid()

m = torch.nn.LogSigmoid()
input = torch.randn(2)
output = m(input)

3.12 SoftPlus

RELU的软估计

当β*input＞threshold的时候，SoftPlus变成线性函数Softplus(x)=x

m = nn.Softplus(beta=1, threshold=20)
input = torch.randn(2)
output = m(input)

3.13 SoftSign

m = torch.nn.Softsign()
input = torch.randn(2)
output = m(input)

3.14 TanhShrink

Tanhshrink(x)=x−tanh(x)

m = nn.Tanhshrink()
input = torch.randn(2)
output = m(input)

3.15 Threshold

m = nn.Threshold(0.1, 20)
#第一个是threshold，第二个是value
input = torch.randn(2)
output = m(input)

3.16 SofrMin

元素也是[0,1]区间，和为1

m = nn.Softmin(dim=1)
input = torch.randn(2, 3)
output = m(input)

3.17 LogSoftMax

m = nn.LogSoftmax(dim=1)
input = torch.randn(2, 3)
output = m(input)

参考内容

深度学习最常用的10个激活函数！（数学原理+优缺点） (qq.com)

UQI-LIUWJ

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录