【Machine Learning】18.Softmax函数

KiraFenvy

已于 2022-10-25 20:34:10 修改

阅读量1.3k

点赞数

分类专栏： Machine Learning 文章标签：机器学习 python 深度学习

于 2022-10-25 18:20:10 首次发布

本文链接：https://blog.csdn.net/m0_51371693/article/details/127516921

版权

Machine Learning 专栏收录该内容

23 篇文章 20 订阅

订阅专栏

Softmax函数

1.导入
2.Softmax函数
- 2.1 算法简介
- 2.2 损失函数
3.Tensorflow
4. Softmax的数值稳定性
5.课后题

1.导入

import numpy as np
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from IPython.display import display, Markdown, Latex
from sklearn.datasets import make_blobs
%matplotlib widget
from matplotlib.widgets import Slider
from lab_utils_common import dlc
from lab_utils_softmax import plt_softmax
import logging
logging.getLogger("tensorflow").setLevel(logging.ERROR)
tf.autograph.set_verbosity(0)

2.Softmax函数

2.1 算法简介

在softmax回归和具有softmax输出的神经网络中，生成N个输出，并选择一个输出作为预测类别。在这两种情况下，向量 $\mathbf｛z｝$ 由应用于softmax函数的线性函数生成。softmax函数将 $\mathbf｛z｝$ 转换为如下所述的概率分布。应用softmax后，每个输出将介于0和1之间，并且输出将相加到1，因此它们可以被解释为概率。较大的输入将对应于较大的输出概率。经过使用指数形式的Softmax函数能够将差距大的数值距离拉的更大。
在这里插入图片描述
The softmax function can be written:
$a_j = \frac{e^{z_j}}{ \sum_{k=1}^{N}{e^{z_k} }} \tag{1}$
其中 $z_i$ 为第i个节点的输出值，N为输出节点的个数，即分类的类别个数。The output $\mathbf{a}$ is a vector of length N, so for softmax regression, you could also write:

$\mathbf{a}(x)=\begin{bmatrix}P(y=1|\mathbf{x};\mathbf{w},b)\\ \vdots\\ P(y=N|\mathbf{x};\mathbf{w},b)\end{bmatrix}=\frac{1}{\sum_{k=1}^N e^{z_k}}\begin{bmatrix}e^{z_1}\\ \vdots\\ e^{z_N}\end{bmatrix}$

输出是y=不同值的概率的向量，numpy实现如下：

def my_softmax(z):
    ez = np.exp(z)              #element-wise exponenial
    sm = ez/np.sum(ez)
    return(sm)

有几点需要注意：

softmax分子中的指数放大了数值的微小差异
输出值总和为1
softmax跨越所有输出。例如，更改“z0”将更改“a0”-“a3”的值。将其与ReLuSigmoid等具有单个输入和单个输出的其他激活进行比较。

2.2 损失函数

当使用Softmax函数作为输出节点的激活函数的时候，一般使用cross-entropy loss交叉熵作为损失函数。

逻辑回归和softmax对比：
在这里插入图片描述
交叉熵损失函数：
$\begin{equation} L(\mathbf{a},y)=\begin{cases} -log(a_1), & \text{if $y=1$}.\\ &\vdots\\ -log(a_N), & \text{if $y=N$} \end{cases} \tag{3} \end{equation}$
其中y是本例的目标类别， $\mathbf{a}$ 是softmax函数的输出。特别是， $\mathbf{a}$ 中的值是总和为1的概率。

注意：在本课程中，loss损失是一个example，而cost涵盖了所有examples。

请注意，在上面的（3）中，只有与目标对应的行会导致损失，其他行为零。为了编写成本方程，我们需要一个“指标函数”，当指标与目标匹配时，该函数为1，否则为0。

$\mathbf{1}\{y == n\} = =\begin{cases} 1, & \text{if $y==n$}.\\ 0, & \text{otherwise}. \end{cases}$
Now the cost is:
$\begin{align} J(\mathbf{w},b) = - \left[ \sum_{i=1}^{m} \sum_{j=1}^{N} 1\left\{y^{(i)} == j\right\} \log \frac{e^{z^{(i)}_j}}{\sum_{k=1}^N e^{z^{(i)}_k} }\right] \tag{4} \end{align}$
Where $m$ is the number of examples, $N$ is the number of outputs. This is the average of all the losses.

3.Tensorflow

制造数据

# make  dataset for example
centers = [[-5, 2], [-2, -2], [1, 2], [5, -2]]
X_train, y_train = make_blobs(n_samples=2000, centers=centers, cluster_std=1.0,random_state=30)

3.1 The Obvious organization

下面的模型使用softmax作为最终致密层中的激活来实现。

损失函数在“compile”指令中单独指定。

损失函数“稀疏分类交叉熵”。上述（3）中所述的损失。在这个模型中，softmax发生在最后一层。损失函数采用作为概率向量的softmax输出。

model = Sequential(
    [ 
        Dense(25, activation = 'relu'),
        Dense(15, activation = 'relu'),
        Dense(4, activation = 'softmax')    # < softmax activation here
    ]
)
model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(0.001),
)

model.fit(
    X_train,y_train,
    epochs=10
)

因为softmax被集成到输出层中，所以输出是概率向量。

预测：

p_nonpreferred = model.predict(X_train)
print(p_nonpreferred [:2])
print("largest value", np.max(p_nonpreferred), "smallest value", np.min(p_nonpreferred))

3.2 preferred

3.2.1 算法简介

如果在训练过程中将softmax和loss结合起来，可以获得更稳定、更准确的结果。这是由此处显示的“preferred”组织启用的。

在这里插入图片描述
在preferred organization中，最终层具有线性激活函数linear activation（相当于没用激活函数）。出于历史原因，此表单中的输出称为“逻辑logits”。loss函数还有一个额外的参数：from_logits=True。这将通知损失函数，softmax操作应包含在损失计算中。这允许优化实现。

preferred_model = Sequential(
    [ 
        Dense(25, activation = 'relu'),
        Dense(15, activation = 'relu'),
        Dense(4, activation = 'linear')   #<-- Note
    ]
)
preferred_model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),  #<-- Note
    optimizer=tf.keras.optimizers.Adam(0.001),# Adam一种梯度下降的算法那
)

preferred_model.fit(
    X_train,y_train,
    epochs=10
)

3.2.2 输出处理

请注意，在preferred模型中，输出不是概率，而是从大负数到大正数。当执行预期概率的预测时，必须通过softmax发送输出。

让我们看看preferred模型输出：

p_preferred = preferred_model.predict(X_train)
print(f"two example output vectors:\n {p_preferred[:2]}")
print("largest value", np.max(p_preferred), "smallest value", np.min(p_preferred))

two example output vectors:
 [[-2.94 -2.33  2.86 -1.25]
 [ 1.5  -4.28 -7.08 -7.93]]
largest value 8.857447 smallest value -13.404879

如果期望的输出是概率，则应通过softmax.处理输出

sm_preferred = tf.nn.softmax(p_preferred).numpy()
print(f"two example output vectors:\n {sm_preferred[:2]}")
print("largest value", np.max(sm_preferred), "smallest value", np.min(sm_preferred))

two example output vectors:
 [[2.97e-03 5.46e-03 9.75e-01 1.62e-02]
 [9.97e-01 3.08e-03 1.86e-04 8.00e-05]]
largest value 0.99999774 smallest value 1.0387312e-07

要选择最可能的类别，不需要softmax。可以使用np.argmax().]找到最大输出的索引

for i in range(5):
    print( f"{p_preferred[i]}, category: {np.argmax(p_preferred[i])}")

[-2.94 -2.33  2.86 -1.25], category: 2
[ 1.5  -4.28 -7.08 -7.93], category: 0
[ 1.02 -2.93 -5.43 -6.26], category: 0
[-2.19  3.48 -1.81 -2.91], category: 1
[-2.32 -6.31  3.67 -4.91], category: 2

argmax函数：
- y = f(t) 是一般常见的函数式，如果给定一个t值，f（t）函数式会赋一个值给y。
- y = max f(t) 代表：y 是f(t)函式所有的值中最大的output。
- y = argmax f(t) 代表：y 是f(t)函式中，会产生最大output的那个参数t。例如：
假设有一个函式 f(t)，t 的可能范围是 {0,1,2}，f(t=0) = 10 ; f(t=1) = 20 ; f(t=2) = 7，那分别对应的y如下：
- y = max f(t) = 20
- y= argmax f(t) = 1

3.3 SparseCategorialCrossentropy or CategoricalCrossEntropy

Tensorflow有两种潜在的目标值格式，损失的选择决定了预期值。

SparseCategorialCrossentropy：期望目标是与索引对应的整数。例如，如果有10个潜在目标值，y将介于0和9之间。
CategorialCrossEntropy：期望示例的目标值为一个热编码，其中目标索引处的值为1，而其他N-1项为0。一个具有10个潜在目标值的示例，其中目标值为2，将为[0,0,1,0,0,0，0,0,10]。

4. Softmax的数值稳定性

4.1 问题描述

当使用Softmax函数作为输出节点的激活函数的时候，一般使用交叉熵作为损失函数。由于Softmax函数的数值计算过程中，很容易因为输出节点的输出值比较大而发生数值溢出的现象，在计算交叉熵的时候也可能会出现数值溢出的问题。为了数值计算的稳定性，TensorFlow提供了一个统一的接口，将Softmax与交叉熵损失函数同时实现，同时也处理了数值不稳定的异常，使用TensorFlow深度学习框架的时候，一般推荐使用这个统一的接口，避免分开使用Softmax函数与交叉熵损失函数。

softmax的输入是线性层 $z_j = \mathbf{w_j} \cdot \mathbf{x}^{(i)}+b$ 的输出。值有可能太大，softmax算法的第一步计算 $e^{z_j}$ 。如果数字太大，这可能会导致溢出错误。
例如：

for z in [500,600,700,800]:
    ez = np.exp(z)
    zs = "{" + f"{z}" + "}"
    print(f"e^{zs} = {ez:0.2e}")
    
e^{500} = 1.40e+217
e^{600} = 3.77e+260
e^{700} = 1.01e+304
e^{800} = inf

调用前面写的mysoftmax函数，一样导致溢出

z_tmp = np.array([[500,600,700,800]])
my_softmax(z_tmp)

4.2 解决办法

Numerical stability can be improved by reducing the size of the exponent. 通过减小指数的大小可以提高数值稳定性。
Recall
$e^{a + b} = e^ae^b$
if the $b$ were the opposite sign of $a$ , this would reduce the size of the exponent. 如果 $b$ 是 $a$ 的相反符号，这将减小指数的大小。Specifically, if you multiplied the softmax by a fraction:
$a_j = \frac{e^{z_j}}{ \sum_{i=1}^{N}{e^{z_i} }} \frac{e^{-b}}{ {e^{-b}}}$
the exponent would be reduced and the value of the softmax would not change. If $b$ in $e^b$ were the largest value of the $z_j$ ’s, $max_j(\mathbf{z})$ , the exponent would be reduced to its smallest value. 指数将减小并且softmax的值将不改变。
$\begin{align} a_j &= \frac{e^{z_j}}{ \sum_{i=1}^{N}{e^{z_i} }} \frac{e^{-max_j(\mathbf{z})}}{ {e^{-max_j(\mathbf{z})}}} \\ &= \frac{e^{z_j-max_j(\mathbf{z})}}{ \sum_{i=1}^{N}{e^{z_i-max_j(\mathbf{z})} }} \end{align}$
习惯说 $C=max_j(\mathbf{z})$ 因为方程对于任何常数C都是正确的。

$a_j = \frac{e^{z_j-C}}{ \sum_{i=1}^{N}{e^{z_i-C} }} \quad\quad\text{where}\quad C=max_j(\mathbf{z})\tag{5}$

If we look at our troublesome example where $\mathbf{z}$ contains 500,600,700,800, $C=max_j(\mathbf{z})=800$

$\mathbf{a}(x)=\dfrac{1}{e^{500-800}+e^{650+800}+e^{700-800}+e^{2009-80}}\begin{bmatrix}e^{50-300}\\ e^{200-300}\\ e^{100-80}\\ e^{200-80}\\ e^{200-80}\end{bmatrix}=\begin{bmatrix}5.15e-131\\ 1.35e-87\\ 3.75e-44\\ 1.0\end{bmatrix}$

提高稳定性之后的softmax：

def my_softmax_ns(z):
    """numerically stablility improved"""
    bigz = np.max(z)
    ez = np.exp(z-bigz)              # minimize exponent
    sm = ez/np.sum(ez)
    return(sm)

调用：

z_tmp = np.array([500.,600,700,800])
print(tf.nn.softmax(z_tmp).numpy(), "\n", my_softmax_ns(z_tmp))

[5.15e-131 1.38e-087 3.72e-044 1.00e+000] 
 [5.15e-131 1.38e-087 3.72e-044 1.00e+000]

4.3 交叉熵损失函数的稳定性

The loss function associated with Softmax, the cross-entropy loss, is repeated here:
$\begin{equation} L(\mathbf{a},y)=\begin{cases} -log(a_1), & \text{if $y=1$}.\\ &\vdots\\ -log(a_N), & \text{if $y=N$} \end{cases} \end{equation}$

Where y is the target category for this example and $\mathbf{a}$ is the output of a softmax function. In particular, the values in $\mathbf{a}$ are probabilities that sum to one.
Let’s consider a case where the target is two ( $y = 2$ ) and just look at the loss for that case. This will result in the loss being:
其中y是本例的目标类别， $\mathbf｛a｝$ 是softmax函数的输出。特别是， $\mathbf｛a｝$ 中的值是总和为1的概率。
让我们考虑一个目标为2（ $y ＝ 2$ ）的情况，然后看看该情况下的损失。这将导致以下损失：
$L(\mathbf{a})= -log(a_2)$

Recall that $a_2$ is the output of the softmax function described above, so this can be written: $a_2$ 是上面描述的softmax函数的输出，因此可以这样写
$L(\mathbf{z})= -log\left(\frac{e^{z_2}}{ \sum_{i=1}^{N}{e^{z_i} }}\right) \tag{6}$
This can be optimized. However, to make those optimizations, the softmax and the loss must be calculated together as shown in the ‘preferred’ Tensorflow implementation you saw above.这是可以优化的。然而，要进行这些优化，softmax和损失必须一起计算，如上面看到的“preferred”方法

Starting from (6) above, the loss for the case of y=2:
$log(\frac{a}{b}) = log(a) - log(b)$ , so (6) can be rewritten:
$L(\mathbf{z})= -\left[log(e^{z_2}) - log \sum_{i=1}^{N}{e^{z_i} }\right] \tag{7}$
The first term can be simplified to just $z_2$ :
$L(\mathbf{z})= -\left[z_2 - log( \sum_{i=1}^{N}{e^{z_i} })\right] = \underbrace{log \sum_{i=1}^{N}{e^{z_i} }}_\text{logsumexp()} -z_2 \tag{8}$
It turns out that the $\sum_{i=1}^{N}{e^{z_i} }$ term in the above equation is so often used, many libraries have an implementation. In Tensorflow this is tf.math.reduce_logsumexp(). An issue with this sum is that the exponent in the sum could overflow if $z_i$ is large. To fix this, we might like to subtract $e^{max_j(\mathbf{z})}$ as we did above, but this will require a bit of work:
事实证明，上述等式中的 $log\sum_{i=1}^{N}{e^{z_i}}$ 项经常使用，许多库都有实现。在Tensorflow中，这是tf.math.reduce_logsumexp（）。此总和的一个问题是，如果 $z_i$ 较大，则总和中的指数可能溢出。为了解决这个问题，我们可能需要像上面那样减去 $e^{max_j(\mathbf{z})}$ ，但这需要一些工作：
$\begin{align} log \sum_{i=1}^{N}{e^{z_i} } &= log \sum_{i=1}^{N}{e^{(z_i - max_j(\mathbf{z}) + max_j(\mathbf{z}))}} \tag{9}\\ &= log \sum_{i=1}^{N}{e^{(z_i - max_j(\mathbf{z}))} e^{max_j(\mathbf{z})}} \\ &= log(e^{max_j(\mathbf{z})}) + log \sum_{i=1}^{N}{e^{(z_i - max_j(\mathbf{z}))}} \\ &= max_j(\mathbf{z}) + log \sum_{i=1}^{N}{e^{(z_i - max_j(\mathbf{z}))}} \end{align}$
Now, the exponential is less likely to overflow. It is customary to say $C=max_j(\mathbf{z})$ since the equation would be correct with any constant C. We can now write the loss equation:现在，指数不太可能溢出。习惯上说 $C=max_j(\mathbf{z})$ ，因为方程对于任何常数C都是正确的

$L(\mathbf{z})= C+ log( \sum_{i=1}^{N}{e^{z_i-C} }) -z_2 \;\;\;\text{where } C=max_j(\mathbf{z}) \tag{10}$
A computationally simpler, more stable version of the loss. The above is for an example where the target, y=2 but generalizes to any target.
计算上更简单、更稳定的损失版本。上面是一个例子，其中目标y＝2，但一般适用于任何目标。

5.课后题

在这里插入图片描述

注意第二种方法最后输出使用的是linear激活函数（相当于没有）

在这里插入图片描述

使用adam优化器进行

在这里插入图片描述

卷积神经网络一个节点会重复使用多个输入值

KiraFenvy

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
【Machine Learning】18.Softmax函数

在softmax回归和具有softmax输出的神经网络中，生成N个输出，并选择一个输出作为预测类别。在这两种情况下，向量｛z｝\mathbf｛z｝｛z｝由应用于softmax函数的线性函数生成。softmax函数将｛z｝\mathbf｛z｝｛z｝转换为如下所述的概率分布。应用softmax后，每个输出将介于0和1之间，并且输出将相加到1，因此它们可以被解释为概率。较大的输入将对应于较大的输出概率。经过使用指数形式的Softmax函数能够将差距大的数值距离拉的更大。aj=e。
复制链接

扫一扫

专栏目录