【MindSpore】【深度学习Step by Step】（1-2）Softmax回归

武陵入

已于 2022-10-15 23:09:33 修改

阅读量596

点赞数

分类专栏： MindSpore 文章标签：深度学习回归机器学习

于 2022-10-15 00:05:25 首次发布

本文链接：https://blog.csdn.net/muronglengjing/article/details/127329856

版权

MindSpore 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

深度学习Step by Step

Softmax回归

Softmax回归

Softmax

我们都知道概率是分布在0到1之间，但在第一节中，在不限定作用域的情况下，线性函数的值域分布在无穷的区间内，那么有没有一个函数能够压缩其值域到0~1之间呢？首先我们来看一个Sigmoid函数

$Sigmoid(x)=\frac1{1+e^{-x}}=\frac{e^x}{1+e^x}$

很明显，该函数最大值不会超过1，而最小值不会小于0，我们可以将其视为二分类的概率；而Softmax函数，就是多分类的概率函数，它的表达式如下所示。

$Softmax(z_i) = \frac{e^{z_i}}{\sum_j^n e^{z_j}}$

我们可以使用下面的代码求出矩阵的Softmax函数值。在第二行，为了避免数据溢出，根据Softmax函数的性质，我们将矩阵所有的元素都减去它的最大值，函数值保持不变。第四行。为了避免除零错误，我们在求和项上加上一个很小的数。

def softmax(x):
    x -= np.max(x)
    x_exp = np.exp(x)
    partition = np.sum(x_exp, axis=1, keepdims=True) + 1e-10
    return x_exp / partition

全连接网络

在第一节的基础上，我们对线性网络的输出进行一定的处理，即使

$y = f_{activation}(WX^T+B)$

即可得到全连接网络。其中， $f_{activation}(X)$ 被称为激活函数，既可以是线性的，也可以是非线性的，如果使用非线性的激活函数后，能够将原本的线性输出变为非线性输出，能够适应更多类型的函数。

如果激活函数选择ReLu，则该网络可以称为全连接神经网络；现在我们选择Softmax函数作为激活函数，所以只能得到普通的非线性网络

$Z=WX^T+B$

$Y = S o f t ma x (Z)$

def softmax_net(x, w, b):
    z = np.matmul(w.T, x.T) + b
    return softmax(z)

交叉熵损失函数

在多分类问题中我们一般使用交叉熵损失函数，主要因为它能够优化包含指数型的函数，表达式如下

$loss(\hat y, y)=-\sum_{j=1}^ny_j\ln \hat y_j$

def cross_entropy(y_hat, y):
    o = y_hat[y[0], range(y_hat.shape[1])]
    return -np.log(o + 1e-5)

梯度下降

由于我们使用交叉熵损失函数，且使用独热编码，当i为正确的分类时，有 $y_i=1,y_j=0,i\neq j$ ，故损失函数中只剩下正确的一项，损失函数退化为 $l(\hat y, y)=-\ln \hat y_i$ ，对 $\hat y_i$ 求梯度，有

$\frac{\partial loss}{\partial \hat y_i}=-\frac{1}{\hat y_i}$

在这里我们设所有错误项的指数和为 $\sum_{j \neq i} e_j=A$ ，根据Softmax公式，有 $\hat y_i=\frac{e^{z_i}}{e^{z_i}+A}$ ，然后我们对其求梯度，可以得到正确分类项目的梯度为

$\frac{\partial \hat y_i}{\partial z_i}=\frac{e^{z_i}(e^{z_i}+A)-e^{z_i}e^{z_i}}{(e^{z_i}+A)^2}=\frac{e^{z_i}}{e^{z_i}+A}\frac{A}{e^{z_i}+A}=\hat y_i (1- \hat y_i)$

然后我们对错误分类的第k项求导，假设其他包括正确分类的指数和为 $\sum_{j \neq k} e_k=B$ ，与上面的式子相同，我们对其求导，可以得到下面的梯度

$\frac{\partial \hat y_i}{\partial z_k}=-\frac{e^{z_i}}{B+e^{z_k}}\frac{e^{z_k}}{B+e^{z_k}}=-\hat y_i \hat y_k$

根据链式法则，我们可以得到交叉熵损失函数对softmax函数的梯度，这个式子是很有规律的，即对正确分类的项，将Softmax的输出结果减一，其他的不变，我们将其记为 $\hat y'$ 。

$\frac{\partial loss}{\partial \hat z_i}=\frac{\partial loss}{\partial \hat y_i}\frac{\partial \hat y_i}{\partial z_i}=\hat y_i-1$

$\frac{\partial loss}{\partial \hat z_k}=\frac{\partial loss}{\partial \hat y_i}\frac{\partial \hat y_i}{\partial z_k}=\hat y_k$

最后可以很轻松地得到交叉熵损失函数对权重的梯度

$\frac{\partial z}{\partial w}=x, \frac{\partial z}{\partial b}=1$

$\frac{\partial loss}{\partial w}=\hat y' x, \frac{\partial loss}{\partial b}=\hat y'$

根据上面的原理，我们使用numpy来更新权重以及偏置对交叉熵损失函数以及Softmax函数的梯度。

def softmax_sgd(y_hat, x, y, w, b, lr, batch_size):
    # manual calculate the gradient of cross entropy
    y_hat[y[0], range(y_hat.shape[1])] -= 1

    grad_w = np.matmul(y_hat, x).squeeze(axis=0).T
    new_w = w - lr * grad_w / batch_size

    grad_b = y_hat
    grad_b = np.sum(grad_b, axis=1, keepdims=True)
    new_b = b - lr * grad_b / batch_size

    return new_w, new_b

评估精确度

我们将Softmax函数输出的最大值的列，视为输出的标签值，故仅需判断其与标签数据的不同，能够得到判断正确的数据的个数

def evaluate_accuracy(y_hat, y):
    right_item = np.sum(np.argmax(y_hat, axis=0) == y)
    return np.sum(right_item)

Mnist数据集

现在，我们在真实的数据集上做测试。

mindvision是mindspore的工具包，可以通过pip install mindvision获取。

Mnist是一个手写数据集，我们利用mindvision.classification.dataset中的Mnist可以获取Mnist的数据集，利用以下两行代码，将其下载到指定位置

    Mnist(path='/shareData/mindspore-dataset/Mnist',
          split="train", download=True).download_dataset()
    Mnist(path='/shareData/mindspore-dataset/Mnist',
          split="test", download=True).download_dataset()

然后，我们可以通过使用NumPy读取MNIST数据提到的方式，使用以下代码（这一段代码在util.datasets内），读取数据，并转换为numpy矩阵

def load_mnist(path, split='train', reshape=False):
    """
    reference:
    https://blog.csdn.net/justidle/article/details/103146658
    """
    labels_path = os.path.join(path, f'{split}-labels-idx1-ubyte')
    images_path = os.path.join(path, f'{split}-images-idx3-ubyte')
    with open(labels_path, 'rb') as lb_path:
        magic, n = struct.unpack('>II', lb_path.read(8))
        labels = np.fromfile(lb_path, dtype=np.uint8)

    with open(images_path, 'rb') as img_path:
        magic, num, rows, cols = struct.unpack('>IIII', img_path.read(16))
        if reshape:
            images = np.fromfile(img_path, dtype=np.uint8).reshape(len(labels), 28, 28, 1)
        else:
            images = np.fromfile(img_path, dtype=np.uint8).reshape(len(labels), 28 * 28)
    return images, labels

然后，就可以通过以下的代码读取数据

    features_t, labels_t = load_mnist('/shareData/mindspore-dataset/Mnist/train')
    features_v, labels_v = load_mnist('/shareData/mindspore-dataset/Mnist/test', split='t10k')

如果要将数据进行显示，可以通过使用PIL的Image库，即from PIL import Image，然后使用以下代码，将其显示出来

    for _ in range(10):
        im = features_v[_, :].reshape(28, 28)
        im = Image.fromarray(im)
        im.show()

我们将它的图片数据进行归一化，将其从0-255变为0-1之间的分布。

    features_t = features_t.astype(np.float64) / 255
    features_v = features_v.astype(np.float64) / 255

批处理

这里用到的批处理与第一节基本类似，唯一的区别是索引可能不同。

def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    np.random.shuffle(indices)
    for i in range(0, num_examples, batch_size):
        batch_indices = np.mat(indices[i:min(i + batch_size, num_examples)])
        yield features[batch_indices, :], labels[batch_indices]

处理流程

首先，我们要确定模型的超参数（注意学习率的选择，过高的学习率会导致无法学习，之前我设置为0.1时精确度基本没有发生变化，损失函数反而越来越高了，晕）。

    batch_size = 256
    lr = 0.0001
    epochs = 10

    net = softmax_net
    loss = cross_entropy

然后，进行初始化权重与偏置矩阵。

w = np.random.normal(0, 0.01, (num_input, num_output))
    b = np.zeros((num_output, 1))

最后是训练代码，与第一节类似，唯一的不同是计算精确度是在小循环中进行累加计算，避免计算重复的数值浪费时间。

    for epoch in range(epochs):
        train_loss, right, total = 0, 0, 0
        for x, y in data_iter(batch_size, features_t, labels_t):
            y_hat_t = net(x.squeeze(axis=0), w, b)
            train_loss += loss(y_hat_t, y).sum()

            right += evaluate_accuracy(y_hat_t, y)
            total += y.shape[1]

            w, b = softmax_sgd(y_hat_t, x, y, w, b, lr, batch_size)

        train_acc = right / total
        y_hat_v = net(features_v, w, b)
        valid_acc = evaluate_accuracy(y_hat_v, labels_v) / len(labels_v)

        print(
            f'epoch [{epoch + 1}/{epochs}], loss is {float(train_loss / len(labels_t)):f}, train accuracy is {train_acc}, valid accuracy is {valid_acc}')

运行结果

最后通过运行的结果可以看到，精确率大概在60%左右，不是很高。

In epoch 1, loss is 5.538449, train accuracy is 0.10166666666666667, valid accuracy is 0.133
In epoch 2, loss is 5.513865, train accuracy is 0.18388333333333334, valid accuracy is 0.2307
In epoch 3, loss is 5.489963, train accuracy is 0.2880333333333333, valid accuracy is 0.3373
In epoch 4, loss is 5.466770, train accuracy is 0.38366666666666666, valid accuracy is 0.4284
In epoch 5, loss is 5.444432, train accuracy is 0.46321666666666667, valid accuracy is 0.4989
In epoch 6, loss is 5.422886, train accuracy is 0.5247, valid accuracy is 0.5511
In epoch 7, loss is 5.402007, train accuracy is 0.5696333333333333, valid accuracy is 0.5903
In epoch 8, loss is 5.381965, train accuracy is 0.6034666666666667, valid accuracy is 0.6209
In epoch 9, loss is 5.362727, train accuracy is 0.62905, valid accuracy is 0.6408
In epoch 10, loss is 5.344362, train accuracy is 0.6508333333333334, valid accuracy is 0.6578