深度学习笔记: 最详尽解释Softmax 回归 Softmax Regression

Purepisces

于 2024-07-21 07:37:29 发布

阅读量989

点赞数 23

分类专栏：机器学习文章标签：深度学习笔记 python 人工智能机器学习

本文链接：https://blog.csdn.net/weixin_53765658/article/details/140582445

版权

机器学习专栏收录该内容

19 篇文章 1 订阅

订阅专栏

欢迎收藏Star我的Machine Learning Blog:https://github.com/purepisces/Wenqing-Machine_Learning_Blog。如果收藏star, 有问题可以随时与我交流, 谢谢大家！

Softmax 回归

Softmax 回归，也称为多项逻辑回归，是逻辑回归的广义，用于多类分类问题。虽然逻辑回归用于二分类，softmax 回归则用于类别超过两个的情况。

Softmax 回归的关键点：

多类分类：当因变量可以有两个以上的类别时，使用 softmax 回归。例如，在 MNIST 数据集中对数字（0-9）进行分类。
逻辑回归的广义：逻辑回归预测两个类别的概率，而 softmax 回归则预测所有可能类别的概率。
Softmax 函数：softmax 回归的核心是 softmax 函数，用于将原始预测分数（logits）转换为概率。softmax 函数定义如下：

$\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$

其中， $\mathbf{z}$ 是原始分数（logits）的输入向量， $K$ 是类别数， $\sigma(\mathbf{z})_i$ 是输入属于类别 $i$ 的概率。
损失函数：softmax 回归使用交叉熵损失函数，衡量分类模型输出为0到1之间概率值的性能。单个样本的交叉熵损失函数如下：

$\ell_{\mathrm{softmax}}(\mathbf{z}, y) = -\log(\sigma(\mathbf{z})_y)$

其中， $y$ 是真实类别标签。

与线性回归和逻辑回归的区别：

线性回归：用于预测连续因变量。它拟合自变量和因变量之间的线性关系。

$h(\mathbf{x}) = \mathbf{x}^\top \mathbf{\theta}$
逻辑回归：用于二分类。它将线性预测应用于逻辑（sigmoid）函数以建模二元结果的概率。

$P(y=1|\mathbf{x}) = \sigma(\mathbf{x}^\top \mathbf{\theta}) = \frac{1}{1 + e^{-\mathbf{x}^\top \mathbf{\theta}}}$
Softmax 回归：扩展逻辑回归到多类。它使用 softmax 函数将线性预测转换为每个类别的概率。

$P(y=i|\mathbf{x}) = \sigma(\mathbf{z})_i = \frac{e^{\mathbf{x}^\top \mathbf{\theta}_i}}{\sum_{j=1}^{K} e^{\mathbf{x}^\top \mathbf{\theta}_j}}$

总之，softmax 回归是一种为多类分类任务设计的逻辑回归。它使用 softmax 函数预测每个类别的概率，从而实现实例的多类别分类。

Softmax（即交叉熵）损失：

在 src/simple_ml.py 文件中的 softmax_loss() 函数中实现 softmax（即交叉熵）损失。回顾一下（希望这是复习，但我们也将在 9 月 1 日的讲座中讨论），对于可以取值为 $\in \{1,\ldots,k\}$ 的多类输出，softmax 损失以一个 logits 向量 $\in \mathbb{R}^k$ 和真实类别 $\in \{1,\ldots,k\}$ 作为输入，返回定义的损失

$\begin{equation} \ell_{\mathrm{softmax}}(z, y) = \log\sum_{i=1}^k \exp z_i - z_y. \end{equation}$

请注意，如其 docstring 中所述，softmax_loss() 接受一个 logits 的二维数组（即批量不同样本的 $k$ 维 logits），以及相应的一维真实标签数组，并应输出整个批量的平均 softmax 损失。请注意，为了正确地做到这一点，您应该避免使用任何循环，而是使用 numpy 向量化操作进行所有计算（为此设定期望，我们应该注意到，例如我们的参考解决方案由一行代码组成）。

请注意，对于 softmax 损失的“真实”实现，您需要缩放 logits 以防止数值溢出，但在这里我们不会担心这个问题（即使您不担心这个问题，其余的作业也会正常工作）。

def softmax_loss(Z, y):
    """ Return softmax loss.  Note that for the purposes of this assignment,
    you don't need to worry about "nicely" scaling the numerical properties
    of the log-sum-exp computation, but can just compute this directly.

    Args:
        Z (np.ndarray[np.float32]): 2D numpy array of shape
            (batch_size, num_classes), containing the logit predictions for
            each class.
        y (np.ndarray[np.uint8]): 1D numpy array of shape (batch_size, )
            containing the true label of each example.

    Returns:
        Average softmax loss over the sample.
    """
    ### BEGIN YOUR CODE
    # Formula for one training sample: \begin{equation} \ell_{\mathrm{softmax}}(z, y) = \log\sum_{i=1}^k \exp z_i - z_y. \end{equation}
    
    # Compute the log of the sum of exponentials of logits for each sample
    log_sum_exp = np.log(np.sum(np.exp(Z), axis = 1))
    # Extract the logits corresponding to the true class for each sample
    # np.arange(Z.shape[0]) generates array [0, 1, 2, ..., batch_size-1]
    # Z[np.arange(Z.shape[0]), y] = Z[[row_indices], [col_indices]]
    # This selects the logits Z[i, y[i]] for each i which is each row
    correct_class_logits = Z[np.arange(Z.shape[0]), y]
    losses = log_sum_exp - correct_class_logits
    return np.mean(losses)
    ### END YOUR CODE

例子:

import numpy as np

# Logits for a batch of 3 samples and 4 classes
Z = np.array([[2.0, 1.0, 0.1, 0.5],
              [1.5, 2.1, 0.2, 0.7],
              [1.1, 1.8, 0.3, 0.4]])

# True labels for the 3 samples
y = np.array([0, 1, 2])

# np.arange(Z.shape[0]) creates an array [0, 1, 2]
row_indices = np.arange(Z.shape[0])
print("Row indices:", row_indices)  # Output: [0 1 2]

# y is [0, 1, 2]
print("True class labels:", y)  # Output: [0 1 2]

# Advanced indexing: Z[np.arange(Z.shape[0]), y] selects Z[0, 0], Z[1, 1], Z[2, 2]
correct_class_logits = Z[row_indices, y]

print("Correct class logits:", correct_class_logits)
# Output: [2.0, 2.1, 0.3]

数学证明

所有训练样本的公式：

$-\sum_{i=1}^k Y_i \log(P_i) = H(Y, \sigma(z)) = -\sum\limits_{i=1}^k Y_i \log(\sigma(z)_i)$

单个训练样本的公式：

$\sigma(z)) = -\log(\sigma(z)_y) = -\log\left( \frac{\exp(z_y)}{\sum\limits_{j=1}^k \exp(z_j)} \right)$

单个训练样本的简化公式：

$\sigma(z)) = -z_y + \log\left( \sum\limits_{j=1}^k \exp(z_j) \right)$

Softmax 函数

Softmax 函数将 logits（原始分数）转换为概率。对于长度为 $k$ 的 logits 向量 $z$ ，softmax 函数 $\sigma(z)$ 定义为：

$\sigma(z)_i = \frac{\exp(z_i)}{\sum\limits_{j=1}^k \exp(z_j)}$

其中 $\ldots, k$ 。

交叉熵损失

交叉熵损失衡量真实标签和预测概率之间的差异。对于真实标签向量 $Y$ （独热编码）和预测概率向量 $P$ （softmax 函数的输出），交叉熵损失 $H (Y, P)$ 定义为：

$-\sum_{i=1}^k Y_i \log(P_i)$

Softmax 和交叉熵的联系

在神经网络中使用 softmax 函数作为多类分类的最后一层时，预测概率向量 $P$ 定义为：

$P_i = \sigma(z)_i = \frac{\exp(z_i)}{\sum\limits_{j=1}^k \exp(z_j)}$

交叉熵损失变为：

$\sigma(z)) = -\sum_{i=1}^k Y_i \log(\sigma(z)_i)$

对于单个训练样本，其中真实类别为 $y$ ， $Y$ 是一个独热编码向量， $Y_y = 1$ 且 $Y_i = 0$ 对于 $\neq y$ 。因此，交叉熵损失简化为：

$\sigma(z)) = -\log(\sigma(z)_y) = -\log\left( \frac{\exp(z_y)}{\sum\limits_{j=1}^k \exp(z_j)} \right)$

利用对数的性质，可以重写为：

$\sigma(z)) = -\left( \log(\exp(z_y)) - \log\left( \sum\limits_{j=1}^k \exp(z_j) \right) \right)$

$\sigma(z)) = -z_y + \log\left( \sum\limits_{j=1}^k \exp(z_j) \right)$

Softmax 回归的随机梯度下降

在这个问题中，您将实现用于（线性）softmax 回归的随机梯度下降（SGD）。换句话说，如我们在 9 月 1 日的讲座中讨论的那样，我们将考虑一个假设函数，该函数通过以下函数将 $n$ 维输入转换为 $k$ 维 logits：

$\begin{equation} h(x) = \Theta^T x \end{equation}$

其中， $\in \mathbb{R}^n$ 是输入， $\Theta \in \mathbb{R}^{n \times k}$ 是模型参数。给定数据集 $\{(x^{(i)} \in \mathbb{R}^n, y^{(i)} \in \{1,\ldots,k\})\}$ ，对于 $i=1,\ldots,m$ ，softmax 回归相关的优化问题如下：

$KaTeX parse error: Undefined control sequence: \minimize at position 18: …egin{equation} \̲m̲i̲n̲i̲m̲i̲z̲e̲_{\Theta} \; \f…$

回顾课堂内容，线性 softmax 目标的梯度如下：

$\begin{equation} \nabla_\Theta \ell_{\mathrm{softmax}}(\Theta^T x, y) = x (z - e_y)^T \end{equation}$

其中，

$KaTeX parse error: Undefined control sequence: \normalize at position 75: …a^T x)} \equiv \̲n̲o̲r̲m̲a̲l̲i̲z̲e̲(\exp(\Theta^T …$

（即， $z$ 只是归一化后的 softmax 概率）， $e_y$ 表示第 $y$ 个单位基，即在第 $y$ 个位置为 1 其余为 0 的向量。

我们还可以用课堂上讨论的更紧凑的符号来表示。即，如果我们让 $\in \mathbb{R}^{m \times n}$ 表示一些 $m$ 个输入的设计矩阵（整个数据集或一个小批量）， $\in \{1,\ldots,k\}^m$ 是相应的标签向量，并且扩展 $\ell_{\mathrm{softmax}}$ 以指代平均 softmax 损失，那么

$\begin{equation} \nabla_\Theta \ell_{\mathrm{softmax}}(X \Theta, y) = \frac{1}{m} X^T (Z - I_y) \end{equation}$

其中，

$KaTeX parse error: Undefined control sequence: \normalize at position 22: …{equation} Z = \̲n̲o̲r̲m̲a̲l̲i̲z̲e̲(\exp(X \Theta)…$

表示 logits 矩阵， $I_y \in \mathbb{R}^{m \times k}$ 代表标签 $y$ 的独热基的拼接。

使用这些梯度，实现 softmax_regression_epoch() 函数，该函数使用指定的学习率/步长 lr 和小批量大小 batch 运行 SGD 的单个周期（对数据集的一次遍历）。如 docstring 所述，您的函数应原地修改 Theta 数组。实现后，运行测试。

代码:

def softmax_regression_epoch(X, y, theta, lr = 0.1, batch=100):
    """ Run a single epoch of SGD for softmax regression on the data, using
    the step size lr and specified batch size.  This function should modify the
    theta matrix in place, and you should iterate through batches in X _without_
    randomizing the order.

    Args:
        X (np.ndarray[np.float32]): 2D input array of size
            (num_examples x input_dim).
        y (np.ndarray[np.uint8]): 1D class label array of size (num_examples,)
        theta (np.ndarrray[np.float32]): 2D array of softmax regression
            parameters, of shape (input_dim, num_classes)
        lr (float): step size (learning rate) for SGD
        batch (int): size of SGD minibatch

    Returns:
        None
    """
    ### BEGIN YOUR CODE
    num_examples = X.shape[0]
    num_classes = theta.shape[1]
 
    for start in range(0, num_examples, batch):
        end = min(start + batch, num_examples)
        X_batch = X[start:end]
        y_batch = y[start:end]

        # Compute the logits
        logits = X_batch @ theta

        # Compute the softmax probabilities
        exp_logits = np.exp(logits)
        probabilities = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)

        # Create a one-hot encoded matrix of the true labels
        I_y = np.zeros_like(probabilities)
        I_y[np.arange(y_batch.size), y_batch] = 1

        # Compute the gradient
        gradient = X_batch.T @ (probabilities - I_y) / y_batch.size

        # Update the parameters
        theta -= lr * gradient
   
    ### END YOUR CODE

数学证明（Softmax 损失相对于参数的梯度）

Softmax 函数和损失

给定 logits $z$ 和真实标签 $y$ ，softmax 函数和相应的损失函数定义如下：

Softmax 函数：

$\sigma(z_i) = \frac{\exp(z_i)}{\sum_j \exp(z_j)}$

我们将类别 $i$ 的 softmax 输出（概率）表示为 $p_i = \sigma(z_i)$ 。

Softmax 损失：

真实类别 $y$ 的 softmax 损失为：
$\ell_{\mathrm{softmax}}(z, y) = \log \left( \sum_{i=1}^k \exp(z_i) \right) - z_y$

梯度推导

为了推导 softmax 损失相对于参数 $\Theta$ 的梯度，我们需要遵循以下步骤：

损失相对于 logits 的梯度：

我们需要计算每个 $i$ 的 $\frac{\partial \ell_{\mathrm{softmax}}}{\partial z_i}$ 。

对于 $i = y$ （真实类别）：

$\frac{\partial \ell_{\mathrm{softmax}}}{\partial z_y} = \frac{\partial}{\partial z_y} \left( \log \left( \sum_{i=1}^k \exp(z_i) \right) - z_y \right) = \frac{\exp(z_y)}{\sum_{i=1}^k \exp(z_i)} - 1 = p_y - 1$

对于 $\neq y$ ：

$\frac{\partial \ell_{\mathrm{softmax}}}{\partial z_i} = \frac{\partial}{\partial z_i} \log \left( \sum_{i=1}^k \exp(z_i) \right) = \frac{\exp(z_i)}{\sum_{i=1}^k \exp(z_i)} = p_i$

将这些结合起来，我们得到：

$\frac{\partial \ell_{\mathrm{softmax}}}{\partial z_i} = p_i - \delta_{iy}$

其中 $\delta_{iy}$ 是克罗内克函数，如果 $i = y$ 则为 1，否则为 0。

相对于参数 $\Theta$ 的梯度：

使用链式法则，损失相对于 $\Theta$ 的梯度为：

$\frac{\partial \ell_{\mathrm{softmax}}}{\partial \Theta} = \frac{\partial \ell_{\mathrm{softmax}}}{\partial z} \cdot \frac{\partial z}{\partial \Theta}$

我们知道 $\Theta^T x$ ，所以：

$\frac{\partial z_i}{\partial \Theta_{jk}} = \frac{\partial (\Theta_{ki} x_k)}{\partial \Theta_{jk}} = x_j \delta_{ik}$

因此，对于单个输入 $x$ ，梯度为：

$\frac{\partial \ell_{\mathrm{softmax}}}{\partial \Theta_{jk}} = \sum_i \frac{\partial \ell_{\mathrm{softmax}}}{\partial z_i} \frac{\partial z_i}{\partial \Theta_{jk}} = \sum_i (p_i - \delta_{iy}) x_j \delta_{ik}$

这简化为：

$\frac{\partial \ell_{\mathrm{softmax}}}{\partial \Theta_{jk}} = x_j (p_k - \delta_{ky})$

矩阵形式

在矩阵形式中，这变为：
$\nabla_{\Theta} \ell_{\mathrm{softmax}}(\Theta^T x, y) = x (z - e_y)^T$

其中 $\sigma(\Theta^T x)$ 是 softmax 概率向量， $e_y$ 是真实类别 $y$ 的独热编码向量。

克罗内克函数 $\delta_{ij}$ 是两个变量（通常是整数）的函数，如果两个变量相等则为1，否则为0。它以德国数学家 Leopold Kronecker 的名字命名。数学上定义为：

$\delta_{ij} = \begin{cases} 1 & \text{如果 } i = j \\ 0 & \text{如果 } i \neq j \end{cases}$

$\frac{\partial \ell_{\mathrm{softmax}}}{\partial \Theta}$ 的例子

考虑一个简单的例子：

$n = 2$ 个特征，
$k = 3$ 个类别。

输入向量 $x$ 和参数矩阵 $\Theta$ 给定如下：

$\begin{pmatrix} x_1 \\ x_2 \end{pmatrix}$

$\Theta = \begin{pmatrix} \Theta_{11} & \Theta_{12} & \Theta_{13} \\ \Theta_{21} & \Theta_{22} & \Theta_{23} \end{pmatrix}$

假设真实类别 $y$ 是2（即第二类）。

计算 Logits $z$

首先，计算 logits $z$ ：

$\Theta^T x = \begin{pmatrix} \Theta_{11} & \Theta_{21} \\ \Theta_{12} & \Theta_{22} \\ \Theta_{13} & \Theta_{23} \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} = \begin{pmatrix} \Theta_{11} x_1 + \Theta_{21} x_2 \\ \Theta_{12} x_1 + \Theta_{22} x_2 \\ \Theta_{13} x_1 + \Theta_{23} x_2 \end{pmatrix}$

计算 Softmax 概率 $\sigma(z)$

接下来，计算 softmax 概率：

$\sigma(z_i) = \frac{\exp(z_i)}{\sum_{j=1}^k \exp(z_j)}$

令：

$z_1 = \Theta_{11} x_1 + \Theta_{21} x_2, \quad z_2 = \Theta_{12} x_1 + \Theta_{22} x_2, \quad z_3 = \Theta_{13} x_1 + \Theta_{23} x_2$

然后，softmax 概率为：

$\sigma(z_1) = \frac{\exp(z_1)}{\exp(z_1) + \exp(z_2) + \exp(z_3)}$

$\sigma(z_2) = \frac{\exp(z_2)}{\exp(z_1) + \exp(z_2) + \exp(z_3)}$

$\sigma(z_3) = \frac{\exp(z_3)}{\exp(z_1) + \exp(z_2) + \exp(z_3)}$

计算 $\delta_{ky}$

假设真实类别 $y = 1$ （第二类），则独热编码向量 $e_y$ 为：

$e_y = \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix}$

偏导数 $\frac{\partial \ell_{\mathrm{softmax}}}{\partial \Theta_{jk}}$

我们要计算每个元素 $\Theta_{jk}$ 的偏导数：

$\frac{\partial \ell_{\mathrm{softmax}}}{\partial \Theta_{jk}} = x_j (\sigma(z_k) - \delta_{ky})$

让我们使用符号表达式明确计算几个：

当 $j = 1$ , $k = 1$ 时：

$\frac{\partial \ell_{\mathrm{softmax}}}{\partial \Theta_{11}} = x_1 (\sigma(z_1) - \delta_{1y}) = x_1 \left(\frac{\exp(z_1)}{\exp(z_1) + \exp(z_2) + \exp(z_3)} - 0\right) = x_1 \left(\frac{\exp(z_1)}{\sum_{j=1}^3 \exp(z_j)}\right)$

当 $j = 1$ , $k = 2$ 时：

$\frac{\partial \ell_{\mathrm{softmax}}}{\partial \Theta_{12}} = x_1 (\sigma(z_2) - \delta_{2y}) = x_1 \left(\frac{\exp(z_2)}{\exp(z_1) + \exp(z_2) + \exp(z_3)} - 1\right) = x_1 \left(\frac{\exp(z_2)}{\sum_{j=1}^3 \exp(z_j)} - 1\right)$

当 $j = 1$ , $k = 3$ 时：

$\frac{\partial \ell_{\mathrm{softmax}}}{\partial \Theta_{13}} = x_1 (\sigma(z_3) - \delta_{3y}) = x_1 \left(\frac{\exp(z_3)}{\exp(z_1) + \exp(z_2) + \exp(z_3)} - 0\right) = x_1 \left(\frac{\exp(z_3)}{\sum_{j=1}^3 \exp(z_j)}\right)$

当 $j = 2$ , $k = 1$ 时：

$\frac{\partial \ell_{\mathrm{softmax}}}{\partial \Theta_{21}} = x_2 (\sigma(z_1) - \delta_{1y}) = x_2 \left(\frac{\exp(z_1)}{\exp(z_1) + \exp(z_2) + \exp(z_3)} - 0\right) = x_2 \left(\frac{\exp(z_1)}{\sum_{j=1}^3 \exp(z_j)}\right)$

当 $j = 2$ , $k = 2$ 时：

$\frac{\partial \ell_{\mathrm{softmax}}}{\partial \Theta_{22}} = x_2 (\sigma(z_2) - \delta_{2y}) = x_2 \left(\frac{\exp(z_2)}{\exp(z_1) + \exp(z_2) + \exp(z_3)} - 1\right) = x_2 \left(\frac{\exp(z_2)}{\sum_{j=1}^3 \exp(z_j)} - 1\right)$

当 $j = 2$ , $k = 3$ 时：

$\frac{\partial \ell_{\mathrm{softmax}}}{\partial \Theta_{23}} = x_2 (\sigma(z_3) - \delta_{3y}) = x_2 \left(\frac{\exp(z_3)}{\exp(z_1) + \exp(z_2) + \exp(z_3)} - 0\right) = x_2 \left(\frac{\exp(z_3)}{\sum_{j=1}^3 \exp(z_j)}\right)$

梯度总结

总结这些， $\Theta$ 的各个元素的梯度为：

$\frac{\partial \ell_{\mathrm{softmax}}}{\partial \Theta} = \begin{pmatrix} \frac{\partial \ell_{\mathrm{softmax}}}{\partial \Theta_{11}} & \frac{\partial \ell_{\mathrm{softmax}}}{\partial \Theta_{12}} & \frac{\partial \ell_{\mathrm{softmax}}}{\partial \Theta_{13}} \\ \frac{\partial \ell_{\mathrm{softmax}}}{\partial \Theta_{21}} & \frac{\partial \ell_{\mathrm{softmax}}}{\partial \Theta_{22}} & \frac{\partial \ell_{\mathrm{softmax}}}{\partial \Theta_{23}} \end{pmatrix} = \begin{pmatrix} x_1 \left(\frac{\exp(z_1)}{\sum_{j=1}^3 \exp(z_j)}\right) & x_1 \left(\frac{\exp(z_2)}{\sum_{j=1}^3 \exp(z_j)} - 1\right) & x_1 \left(\frac{\exp(z_3)}{\sum_{j=1}^3 \exp(z_j)}\right) \\ x_2 \left(\frac{\exp(z_1)}{\sum_{j=1}^3 \exp(z_j)}\right) & x_2 \left(\frac{\exp(z_2)}{\sum_{j=1}^3 \exp(z_j)} - 1\right) & x_2 \left(\frac{\exp(z_3)}{\sum_{j=1}^3 \exp(z_j)}\right) \end{pmatrix}$

矩阵维度示例

让我们用一个例子来说明这一点。

权重矩阵 $\Theta$ ：

$\Theta = \begin{pmatrix} \Theta_{11} & \Theta_{12} & \Theta_{13} \\ \Theta_{21} & \Theta_{22} & \Theta_{23} \end{pmatrix}$

这里， $\Theta$ 是一个 $\times 3$ 矩阵，用于具有 2 个特征和 3 个类别的模型。

输入向量 $x$ ：

$\begin{pmatrix} x_1 \\ x_2 \end{pmatrix}$

Softmax 概率 $\sigma(z)$ ：

$\sigma(z) = \begin{pmatrix} \sigma(z_1) \\ \sigma(z_2) \\ \sigma(z_3) \end{pmatrix}$

这是一个 $\times 1$ 向量。

独热编码真实类别 $e_y$ ：

如果真实类别 $y$ 是2（即第二类），则：

$e_y = \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix}$

梯度矩阵 $\nabla_\Theta \ell_{\mathrm{softmax}}$ ：

梯度矩阵为：

$\nabla_\Theta \ell_{\mathrm{softmax}} = x (\sigma(z) - e_y)^T$

计算外积 $(\sigma(z) - e_y)^T$ ：

$\sigma(z) - e_y = \begin{pmatrix} \sigma(z_1) \\ \sigma(z_2) \\ \sigma(z_3) \end{pmatrix} - \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix} = \begin{pmatrix} \sigma(z_1) \\ \sigma(z_2) - 1 \\ \sigma(z_3) \end{pmatrix}$

外积 $(\sigma(z) - e_y)^T$ 为：

$(\sigma(z) - e_y)^T = \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} \begin{pmatrix} \sigma(z_1) & \sigma(z_2) - 1 & \sigma(z_3) \end{pmatrix} = \begin{pmatrix} x_1 \sigma(z_1) & x_1 (\sigma(z_2) - 1) & x_1 \sigma(z_3) \\ x_2 \sigma(z_1) & x_2 (\sigma(z_2) - 1) & x_2 \sigma(z_3) \end{pmatrix}$

更新权重矩阵 $\Theta$

使用梯度矩阵，我们更新权重矩阵 $\Theta$ 如下：

$\Theta := \Theta - \eta \nabla_\Theta \ell_{\mathrm{softmax}}$

对于我们的例子，如果学习率 $\eta$ 是 0.01，更新后的权重为：

$\Theta = \Theta - 0.01 \begin{pmatrix} x_1 \sigma(z_1) & x_1 (\sigma(z_2) - 1) & x_1 \sigma(z_3) \\ x_2 \sigma(z_1) & x_2 (\sigma(z_2) - 1) & x_2 \sigma(z_3) \end{pmatrix}$

结论

梯度矩阵 $\nabla_\Theta \ell_{\mathrm{softmax}}$ 与权重矩阵 $\Theta$ 是分开的。它用于计算在训练过程中需要最小化损失函数的更新方向和幅度。在训练过程中，使用梯度下降或其他优化算法迭代更新权重矩阵 $\Theta$ 。

单个样本的梯度

回顾单个样本的梯度：

$\nabla_\Theta \ell_{\mathrm{softmax}}(\Theta^T x, y) = x (z - e_y)^T$

其中：

$x$ 是单个样本的输入向量。
$z$ 是输入样本的 softmax 概率向量。
$e_y$ 是真实类别 $y$ 的独热编码向量。

扩展到多个样本（批量）

$\nabla_\Theta \ell_{\mathrm{softmax}}(X \Theta, y) = \frac{1}{m} X^T (Z - I_y)$

当我们有多个样本时，我们可以将输入表示为矩阵 $X$ ，其中每一行都是第 $i$ 个样本的输入向量 $x_i$ 。同样，我们可以将真实标签表示为矩阵 $I_y$ ，其中每一行是第 $i$ 个标签的独热编码向量 $e_{y_i}$ 。

矩阵 $Z$ 是所有样本的 softmax 概率矩阵，其中每一行包含相应样本的 softmax 概率。

符号和维度

让我们定义矩阵及其维度：

$\in \mathbb{R}^{m \times n}$ ：设计矩阵，包含 $m$ 个输入向量，每个向量有 $n$ 个特征。
$\Theta \in \mathbb{R}^{n \times k}$ ：权重矩阵，将 $n$ 个特征映射到 $k$ 个类别。
$\in \mathbb{R}^{m \times k}$ ：包含 $m$ 个样本和 $k$ 个类别的 softmax 概率矩阵。
$I_y \in \mathbb{R}^{m \times k}$ ：包含 $m$ 个样本和 $k$ 个类别的真实标签的独热编码矩阵。

计算梯度

$m$ 个样本的平均 softmax 损失梯度可以通过对每个样本的梯度求和然后除以 $m$ 来计算：

$\nabla_\Theta \ell_{\mathrm{softmax}} = \frac{1}{m} \sum_{i=1}^m x_i (z_i - e_{y_i})^T$

在矩阵形式中，可以表示为：

$\nabla_\Theta \ell_{\mathrm{softmax}}(X \Theta, y) = \frac{1}{m} X^T (Z - I_y)$

这个公式为什么有效

矩阵 $X$ ：
$X$ 的每一行是输入向量 $x_i$ 。
矩阵 $Z$ ：
$Z$ 的每一行是相应输入 $x_i$ 的 softmax 概率 $z_i$ 。
矩阵 $I_y$ ：
$I_y$ 的每一行是真实类别 $y_i$ 的独热编码向量 $e_{y_i}$ 。
矩阵乘法：
$X^T (Z - I_y)$ 计算了所有批量样本的外积 $x_i (z_i - e_{y_i})^T$ 的和。除以 $m$ 得到平均梯度。

用 softmax 回归训练 MNIST

虽然这不是测试的一部分，但既然你已经写了这段代码，你也可以尝试使用 SGD 训练一个完整的 MNIST 线性分类器。为此，你可以使用 src/simple_ml.py 文件中的 train_softmax() 函数（我们已经为你编写了这个函数，所以你不需要自己编写，但你可以看看它是如何工作的）。

你可以使用以下代码看看它是如何工作的。作为参考，如下所示，我们的实现在 Colab 上运行大约需要 3 秒，并且达到了 7.97% 的错误率。

def loss_err(h,y):
    """ Helper funciton to compute both loss and error"""
    # h: (np.ndarray[np.float32]): 2D numpy array of shape (batch_size x num_classes), containing the logit predictions for each class.
    return softmax_loss(h,y), np.mean(h.argmax(axis=1) != y)


def train_softmax(X_tr, y_tr, X_te, y_te, epochs=10, slr=0.5, batch=100,
                  cpp=False):
    """ Example function to fully train a softmax regression classifier """
    # X_tr.shape[1]: the number of features in the training data
    # y_tr.max()+1 : the number of classes
    # weight matrix theta's shape (number of features x number of classes)
    theta = np.zeros((X_tr.shape[1], y_tr.max()+1), dtype=np.float32)
    print("| Epoch | Train Loss | Train Err | Test Loss | Test Err |")
    for epoch in range(epochs):
        if not cpp:
            softmax_regression_epoch(X_tr, y_tr, theta, lr=lr, batch=batch)
        else:
            softmax_regression_epoch_cpp(X_tr, y_tr, theta, lr=lr, batch=batch)
        # Computes the loss and error for the entire training dataset
        # X_tr @ theta ((num_examples x number of features)@(number of features x number of classes)) = (num_examples x num_classes)
        train_loss, train_err = loss_err(X_tr @ theta, y_tr)
        test_loss, test_err = loss_err(X_te @ theta, y_te)
        print("|  {:>4} |    {:.5f} |   {:.5f} |   {:.5f} |  {:.5f} |"\
              .format(epoch, train_loss, train_err, test_loss, test_err))

解释 np.mean(h.argmax(axis=1) != y)

import numpy as np

# Predicted logits for 5 samples and 3 classes
h = np.array([[0.2, 0.5, 0.3],
              [0.1, 0.3, 0.6],
              [0.7, 0.2, 0.1],
              [0.4, 0.4, 0.2],
              [0.1, 0.6, 0.3]])

# True class labels
y = np.array([1, 2, 0, 1, 2])

# Predicted classes
predicted_classes = h.argmax(axis=1)
print("Predicted classes:", predicted_classes)
# Output: [1 2 0 0 1]

# Comparison of predicted classes with true classes
comparison = predicted_classes != y
print("Comparison (predicted != true):", comparison)
# Output: [False False False  True  True]

# Calculate the mean (classification error rate)
error_rate = np.mean(comparison)
print("Error rate:", error_rate)
# Output: 0.4 since (0 + 0 + 0 + 1 + 1) / 5 = 2 / 5 = 0.4

Softmax 回归作为线性分类器

让我们通过一个具体的例子来说明为什么软最大回归（多项逻辑回归）被认为是线性分类器。我们将使用一个简单的二维数据集以便于可视化。

示例设置

假设我们有一个包含两个特征 $X_1$ 和 $X_2$ 的数据集，以及三个类别（A、B 和 C）。让我们将输入特征向量表示为 $\mathbf{x} = [X_1, X_2]$ 。

Softmax 回归模型

线性变换

我们有一个形状为 $(2, 3)$ 的权重矩阵 $\Theta$ ，其中每列对应一个类别的权重。

假设：

$\Theta = \begin{bmatrix} \theta_{11} & \theta_{12} & \theta_{13} \\ \theta_{21} & \theta_{22} & \theta_{23} \end{bmatrix}$

Logit 计算

每个类别的 logit 计算如下：

$\mathbf{x} \Theta = [X_1, X_2] \begin{bmatrix} \theta_{11} & \theta_{12} & \theta_{13} \\ \theta_{21} & \theta_{22} & \theta_{23} \end{bmatrix} = [Z_1, Z_2, Z_3]$

其中：

$Z_1 = \theta_{11} X_1 + \theta_{21} X_2$

$Z_2 = \theta_{12} X_1 + \theta_{22} X_2$

$Z_3 = \theta_{13} X_1 + \theta_{23} X_2$

Softmax 函数

softmax 函数将这些 logit 转换为概率：

$\mid \mathbf{x}) = \frac{e^{Z_i}}{e^{Z_1} + e^{Z_2} + e^{Z_3}}$

决策边界

决策边界是分类器在两个类别之间无差别的地方。例如：

类别 A 和类别 B 之间的边界是 $Z_1 = Z_2$ ：

$\theta_{11} X_1 + \theta_{21} X_2 = \theta_{12} X_1 + \theta_{22} X_2$

整理后，我们得到：

$(\theta_{11} - \theta_{12}) X_1 + (\theta_{21} - \theta_{22}) X_2 = 0$

这是一个表示在 $X_1$ - $X_2$ 平面上直线的线性方程。
类别 B 和类别 C 之间的边界是：

$(\theta_{12} - \theta_{13}) X_1 + (\theta_{22} - \theta_{23}) X_2 = 0$
类别 A 和类别 C 之间的边界是：

$(\theta_{11} - \theta_{13}) X_1 + (\theta_{21} - \theta_{23}) X_2 = 0$

参考：

CMU 10714 深度学习系统

Purepisces

关注

23
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
深度学习笔记: 最详尽解释Softmax 回归 Softmax Regression

欢迎收藏Star我的Machine Learning Blog:https://github.com/purepisces/Wenqing-Machine_Learning_Blog。如果收藏star, 有问题可以随时与我交流, 谢谢大家！Softmax 回归Softmax 回归，也称为多项逻辑回归，是逻辑回归的广义，用于多类分类问题。虽然逻辑回归用于二分类，softmax 回归则用于类别超过两个的情况。Softmax 回归的关键点：多类分类：当因变量可以有两个以上的类别时，使用 softmax
复制链接

扫一扫