深度学习入门 (四)：误差反向传播

最新推荐文章于 2023-07-10 17:48:21 发布

连理o

最新推荐文章于 2023-07-10 17:48:21 发布

阅读量569

点赞数 2

分类专栏：深度学习文章标签：深度学习 python 神经网络人工智能

本文链接：https://blog.csdn.net/weixin_42437114/article/details/106502026

版权

深度学习专栏收录该内容

27 篇文章 18 订阅

订阅专栏

本文为《深度学习入门 – 基于 Python 的理论与实现》的读书笔记

计算图

用计算图求解

问题1：太郎在超市买了 2 个 100 日元一个的苹果，消费税是 10%，请计算支付金额

局部计算

计算图的特征是可以通过传递“局部计算”获得最终结果。换言之，各个节点处只需进行与自己有关的计算，不用考虑全局

反向传播

在这里插入图片描述

加法节点

在这里插入图片描述

乘法节点

在这里插入图片描述

注意到，乘法的反向传播需要正向传播时的输入信号值。因此，实现乘法节点的反向传播时，要保存正向传播的输入信号

分支节点

在这里插入图片描述

分支节点的反向传播是上游传来的梯度之和。可以把两个分支分别记为 $a$ 和 $b$ ，其中 $a = b = x$ ，这样 $\frac{\partial L}{\partial x}=\frac{\partial L}{\partial a}\frac{\partial a}{\partial x}+\frac{\partial L}{\partial b}\frac{\partial b}{\partial x}=\frac{\partial L}{\partial a}+\frac{\partial L}{\partial b}$

Repeat 节点

分支节点有两个分支，但也可以扩展为 $N$ 个分支（副本），这里称为 Repeat 节点

# 反向传播示意代码
import numpy as np

D, N = 8, 7
x = np.random.randn(1, D) # 输入
y = np.repeat(x, N, axis=0) # 正向传播

dy = np.random.randn(N, D) # 假设的梯度
dx = np.sum(dy, axis=0, keepdims=True) # 反向传播

Sum 节点

在这里插入图片描述

# 反向传播示意代码
import numpy as np

D, N = 8, 7
x = np.random.randn(N, D) 				# 输入
y = np.sum(x, axis=0, keepdims=True) 	# 正向传播

dy = np.random.randn(1, D) 				# 假设的梯度
dx = np.repeat(dy, N, axis=0) 			# 反向传播

链式法则与计算图

比如, $z=(x+y)^{2}$ 是由两个式子构成的
$\begin{array}{r} z=t^{2} \\ t=x+y\\\end{array}$ 有 $\frac{\partial z}{\partial t}=2 t \\ \frac{\partial t}{\partial x}=1$ 最后要计算的 $\frac{\partial z}{\partial x}$ 可由式 $(5.3)$ 求得的导数的乘积计算出来。
$\frac{\partial z}{\partial x}=\frac{\partial z}{\partial t} \frac{\partial t}{\partial x}=2 t \cdot 1=2(x+y)\ \ \ \ \ (5.3)$

在这里插入图片描述

通过计算图进行反向传播

使用计算图最大的原因是，可以通过反向传播高效计算导数
- 例：求问题一中“支付金额关于苹果的价格的导数“，反向传播使用与正方向相反的箭头（粗线）表示。反向传播传递“局部导数”，将导数的值写在箭头的下方。从这个结果中可知，“支付金额关于苹果的价格的导数”的值是 2.2

激活函数层的实现

`Relu` 层

在这里插入图片描述

import numpy as np

class Relu:
    def __init__(self):
        self.mask = None

    def forward(self, x):
        self.mask = x < 0
        x[self.mask] = 0

        return x

    def backward(self, dout):
        dout[self.mask] = 0
        return dout

if __name__ == "__main__":
    layer = Relu()
    x = np.random.randn(3, 3)
    print('x:', x, sep='\n')
    
    save = x.copy()
    out = layer.forward(save)
    print('out:', out, sep='\n')
    
    dout = layer.backward(np.ones_like(save))
    print('dout:', dout, sep='\n')

代码输出：

x:
[[ 0.08621289  1.20328454  1.81030439]
 [-1.31113673 -0.11453987  0.88408891]
 [ 0.14068574 -0.479992   -1.73015689]]
out:
[[0.08621289 1.20328454 1.81030439]
 [0.         0.         0.88408891]
 [0.14068574 0.         0.        ]]
dout:
[[1. 1. 1.]
 [0. 0. 1.]
 [1. 0. 0.]]

`Sigmoid` 层

在这里插入图片描述

另外, $\frac{\partial L}{\partial y} y^{2} \exp (-x)$ 可以进一步整理如下。
$\begin{aligned} \frac{\partial L}{\partial y} y^{2} \exp (-x) &=\frac{\partial L}{\partial y} \frac{1}{(1+\exp (-x))^{2}} \exp (-x) \\ &=\frac{\partial L}{\partial y} \frac{1}{1+\exp (-x)} \frac{\exp (-x)}{1+\exp (-x)} \\ &=\frac{\partial L}{\partial y} y(1-y) \end{aligned}$ 因此，Sigmoid 层的反向传播，只根据正向传播的输出就能计算出来

class Sigmoid:
    def __init__(self):
        self.out = None

    def forward(self, x):
        out = 1 / (1 + np.exp(-x))
        self.out = out.copy()
        return out

    def backward(self, dout):
        return dout * self.out * (1 - self.out)

`Tanh` 层

在这里插入图片描述
$\begin{aligned}y=tanh(x)&=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}} \\&=\frac{2e^x}{e^{x}+e^{-x}}-1 \\&=\frac{2}{1+e^{-2x}}-1 \\\therefore e^{-2x}&=\frac{1-y}{1+y} \end{aligned}$
$\frac{\partial y}{\partial x}=\frac{-2\cdot -2\cdot e^{-2x}}{(1+e^{-2x})^2}=\frac{4e^{-2x}}{(1+e^{-2x})^2}=\frac{4\frac{1-y}{1+y}}{\frac{4}{(1+y)^2}}=1-y^2$

class Sigmoid:
    def __init__(self):
        self.out = None

    def forward(self, x):
        out = 1 / (1 + np.exp(-x))
        self.out = out.copy()
        return out

    def backward(self, dout):
        return dout * self.out * (1 - self.out)

`Affine` / `Softmax` 层的实现

`Affine` 层

以批版本的 Affine 层为例进行推导：

设批处理的样本数量为 $N$ ，上一层神经元数量为 $a$ ，本层神经元数量为 $b$ ; $X$ 为输入， $W$ 为本层权重， $B$ 为本层偏置， $Y=X\cdot W + B$
- $x_{ij}$ 表示第 $i$ 个样本的第 $j$ 个输入
- $w_{ij}$ 表示前一层第 $i$ 个神经元与后一层第 $j$ 个神经元连接的权重
- $y_{ij}$ 表示第 $i$ 个样本的第 $j$ 个输出
- $b_{i}$ 表示第 $i$ 个神经元的偏置
则
$\begin{aligned} \frac{\partial L}{\partial w_{ij}} &= \sum_{k=1}^{N} \frac{\partial L}{\partial y_{kj}} \cdot \frac{\partial y_{kj}}{\partial w_{ij}} \\&= \sum_{k=1}^{N} \frac{\partial L}{\partial y_{kj}} \cdot \frac{\partial (\sum_{m=1}^{a} (x_{km} \cdot w_{mj}) + b_{j})}{\partial w_{ij}} \\&= \sum_{k=1}^{N} \frac{\partial L}{\partial y_{kj}} \cdot x_{ki} \\ \therefore \frac{\partial L}{\partial W} &= X^T \cdot \frac{\partial L}{\partial Y} \end{aligned}$

推导完毕之后还可以对上式进行形状检查: $a,b)=(N,a)^T(N,b)$

$\begin{aligned} \frac{\partial L}{\partial x_{ij}} &= \sum_{k=1}^{b} \frac{\partial L}{\partial y_{ik}} \cdot \frac{\partial y_{ik}}{\partial x_{ij}} \\&= \sum_{k=1}^{b} \frac{\partial L}{\partial y_{ik}} \cdot \frac{\partial (\sum_{m=1}^{a} (x_{im} \cdot w_{mk}) + b_{k})}{\partial x_{ij}} \\&= \sum_{k=1}^{b} \frac{\partial L}{\partial y_{ik}} \cdot w_{jk} \\ \therefore \frac{\partial L}{\partial X} &= \frac{\partial L}{\partial Y} \cdot W^T \end{aligned}$
$\begin{aligned} \frac{\partial L}{\partial b_{i}} &= \sum_{k=1}^{N} \frac{\partial L}{\partial y_{ki}} \cdot \frac{\partial y_{ki}}{\partial b_{i}} \\&= \sum_{k=1}^{N} \frac{\partial L}{\partial y_{ki}} \cdot \frac{\partial (\sum_{m=1}^{a} (x_{km} \cdot w_{mi}) + b_{i})}{\partial b_{i}} \\&= \sum_{k=1}^{N} \frac{\partial L}{\partial y_{ki}} \\ \therefore \frac{\partial L}{\partial B} &= \frac{\partial L}{\partial Y} 的第0轴上的和 \end{aligned}$

如果想要记忆一下的话，可以直接根据矩阵的形状进行反推：

在这里插入图片描述

class Affine:
    def __init__(self, w, b):
        self.w = w
        self.b = b

    def forward(self, x):
        # 对应张量要reshape为二维矩阵进行全连接层计算
        self.original_x_shape = x.shape
        x = x.reshape(x.shape[0], -1)

        self.x = x
        return np.dot(self.x, self.w) + self.b

    def backward(self, dout):
        self.dw = np.dot(self.x.T, dout)
        self.db = dout if dout.ndim == 1 else np.sum(dout, axis=0)
        return np.dot(dout, self.w.T).reshape(*self.original_x_shape)  # 还原输入数据的形状（对应张量） # dx

`Softmax`-with-loss (cross enrtopy loss) 层

正向传播

在这里插入图片描述

反向传播

在这里插入图片描述

正向传播时若有分支流出，则反向传播时它们的反向传播的值会相加；以右上角的 “/” 结点为例：
$\frac{\partial L}{\partial S} = \frac{\partial L}{\partial y_1} * \frac{\partial y_1}{\partial \frac{1}{S} } * \frac{\partial \frac{1}{S}}{\partial S} + \frac{\partial L}{\partial y_2} * \frac{\partial y_2}{\partial \frac{1}{S} } * \frac{\partial \frac{1}{S}}{\partial S} + \frac{\partial L}{\partial y_3} * \frac{\partial y_3}{\partial \frac{1}{S} } * \frac{\partial \frac{1}{S}}{\partial S}$

不使用计算图进行推导：

为了推理上方便书写，先引入克罗内克符号：
$\delta_{i,j} = 1 \ \ \ \ \ \ \ \ \ \ \ \ \ if \ i = j$ $\delta_{i,j} = 0 \ \ \ \ \ \ \ \ \ \ \ \ \ if \ i \neq j$ 下面正式进行推导：
$\begin{aligned} 设S=\sum_m e^{a_m} , 则y_p = \frac {e^{a_p}}{S} \end{aligned}$ $\begin{aligned} \frac {\partial L}{\partial y_p} &= -\frac{t_p}{y_p} \end{aligned}$ $\begin{aligned} \therefore \frac {\partial L}{\partial a_k} &= \sum_p \frac {\partial L}{\partial y_p} \frac {\partial y_p}{\partial a_k} \\&= \sum_p -\frac{t_p}{y_p} \frac{Se^{a_p}\delta _{pk} - {e^{a_p}} \frac {\partial S}{\partial a_k} }{S^2} \end{aligned}$ $\begin{aligned} \because \frac {\partial S}{\partial a_k} &= \frac {\partial \sum_m e^{a_m}}{\partial a_k} \\&= e^{a_k} \end{aligned}$ $\begin{aligned} \therefore \frac {\partial L}{\partial a_k} &= \sum_p -\frac{t_p}{e^{a_p}} \frac{Se^{a_p}\delta _{pk} - e^{a_p} e^{a_k} }{S} \\&=\sum_p -\frac{t_p}{e^{a_p}} (e^{a_p}\delta _{pk} - \frac {e^{a_p} e^{a_k}}{S} ) \\&= \sum_p( -t_p \delta_{pk} + \frac {t_p e^{a_k}}{S}) \\&= -t_k + \sum_p t_py_k \\&= -t_k + y_k \end{aligned}$

使用交叉熵误差作为 $s o f t m a x$ 函数的损失函数后，反向传播得到 $y_1 − t_1, y_2 − t_2, y_3 − t_3)$ 这样“ 漂亮”的结果。实际上，这样“漂亮”的结果并不是偶然的，而是为了得到这样的结果，特意设计了交叉熵误差函数
回归问题中输出层使用“恒等函数”，损失函数使用“平方和误差”，也是出于同样的理由。也就是说，使用“平方和误差”作为“恒等函数”的损失函数，反向传播才能得到 $y_1 − t_1, y_2 − t_2, y_3 − t_3)$ 这样“漂亮”的结果

class SoftmaxWithLoss:
    def __init__(self):
        pass

    def forward(self, x, t):
        self.t = t.copy()
        self.y = softmax(x)
        return cross_entropy_error(self.y, t)

    def backward(self, dout=1):
        batch_size = self.t.shape[0]

        if self.t.size == self.y.size: # 监督数据是one-hot-vector的情况
            dx = self.y - self.t
        else:
            dx = self.y.copy()
            dx[np.arange(batch_size), self.t] -= 1
        
        # 这里除以batch_size后，传递给前面的层的是单个数据的误差
        return dx / batch_size

注意：损失函数求得的导数 $\frac{\partial L}{\partial Y}$ 都应该除以 batch_size 以得到单个数据的误差！

梯度确认 (gradient check)

数值微分的优点是实现简单，因此，一般情况下不太容易出错。而误差反向传播法的实现很复杂，容易出错。所以，经常会比较数值微分的结果和误差反向传播法的结果，以确认误差反向传播法的实现是否正确
- 可以通过矩阵的欧几里得范数来判断，分母使得该式成比例，不会太大也不会太小; 当计算值比较大时，应逐一比较数值微分计算的梯度与反向传播计算的梯度中的每一项，看看是哪一个参数的梯度计算出了问题

在这里插入图片描述

Doesn’t work with dropout (dropout 随机删去一些神经元，使得损失函数难以计算)
Run at random initialization perhaps again after some training (有可能(极小的可能性) $w$ 和 $b$ 只有在接近于 0 的时候梯度确认是正常的，但训练一段时间后 $w$ 和 $b$ 远离 0 后反向传播计算的梯度就不正常了，因此可以在网络训练一段时间之后再进行梯度确认)

def gradient_check(net, x_batch, t_batch):
    grad_numerical = net.numerical_gradient(x_batch, t_batch)
    grad_backprop = net.gradient(x_batch, t_batch)

    for key in grad_numerical.keys():
        print(key, ':')

        diff1 = np.mean(np.abs(grad_numerical[key] - grad_backprop[key]))
        print('diff1:', diff1)

        # diff2 = 1e-7 -> correct
        # diff2 > 1e-5 -> please check again!
        # diff2 > 1e-3 -> concerned
        diff2 = np.linalg.norm(grad_numerical[key] - grad_backprop[key], 2) / (np.linalg.norm(grad_numerical[key], 2) + np.linalg.norm(grad_backprop[key], 2))
        print('diff2:', diff2)

这里 $diff_1$ 的实现是使用的本书中的方法; $diff_2$ 的实现是通过计算矩阵的欧几里得范数，判断的标准写在了上面代码的注释里

通过组装各个层重新实现二层神经网络

import sys
file_path = __file__.replace('\\', '/')
dir_path = file_path[: file_path.rfind('/')] # 当前文件夹的路径
pardir_path = dir_path[: dir_path.rfind('/')]
sys.path.append(pardir_path) # 添加上上级目录到python模块搜索路径

import numpy as np
from func.gradient import numerical_gradient, gradient_check
from layer.activation import Relu, Affine, SoftmaxWithLoss, Sigmoid
import matplotlib.pyplot as plt
from collections import OrderedDict

class TwoLayerNet:
    """
    2 Fully Connected layers
    softmax with cross entropy error
    """
    def __init__(self, input_size, hidden_size, output_size, weight_init_std=0.01):
        self.params = {}
        self.params['w1'] = np.random.randn(input_size, hidden_size) * weight_init_std
        self.params['b1'] = np.zeros(hidden_size)
        self.params['w2'] = np.random.randn(hidden_size, output_size) * weight_init_std
        self.params['b2'] = np.zeros(output_size)

        self.layers = OrderedDict()
        self.layers['affine1'] = Affine(self.params['w1'], self.params['b1'])
        self.layers['relu1'] = Relu()
        self.layers['affine2'] = Affine(self.params['w2'], self.params['b2'])
        
        self.lastLayer = SoftmaxWithLoss()

    def predict(self, x):
        for layer in self.layers.values():
            x = layer.forward(x)

        return x

    def loss(self, x, t):
        y = self.predict(x)
        return self.lastLayer.forward(y, t)

    def accuracy(self, x, t):
        y = self.predict(x)
        y = y.argmax(axis=1)

        if t.ndim != 1:
            t = t.argmax(axis=1)

        accuracy = np.sum(y == t) / x.shape[0]
        return accuracy

    def numerical_gradient(self, x, t):
        loss = lambda w: self.loss(x, t)

        grads = {}
        grads['w1'] = numerical_gradient(loss, self.params['w1'])
        grads['b1'] = numerical_gradient(loss, self.params['b1'])
        grads['w2'] = numerical_gradient(loss, self.params['w2'])
        grads['b2'] = numerical_gradient(loss, self.params['b2'])

        return grads

    def gradient(self, x, t):
        # forward
        self.loss(x, t)

        # backward
        dout = 1
        dout = self.lastLayer.backward(dout)
        for layer_name in reversed(self.layers):
            dout = self.layers[layer_name].backward(dout)

        grads = {}
        grads['w1'] = self.layers['affine1'].dw
        grads['b1'] = self.layers['affine1'].db
        grads['w2'] = self.layers['affine2'].dw
        grads['b2'] = self.layers['affine2'].db

        return grads


if __name__ == '__main__':
    from dataset.mnist import load_mnist
    import pickle
    import os

    (x_train, t_train),  (x_test, t_test) = load_mnist(normalize=True, flatten=True, one_hot_label=True)
    
    # hyper parameters
    lr = 0.1
    batch_size = 100
    iters_num = 10000

    # setting
    train_flag = 0 # 进行训练还是预测
    pretrain_flag = 0 # 加载上一次训练的参数
    gradcheck_flag = 1 # 对已训练的网络进行梯度检验
    
    pkl_file_name = dir_path + '/two_layer_net.pkl'
    train_size = x_train.shape[0]
    train_loss_list = []
    train_acc_list = []
    test_acc_list = []
    best_acc = 0

    iter_per_epoch = max(int(train_size / batch_size), 1)

    net = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)

    if (pretrain_flag == 1 or train_flag == 0) and os.path.exists(pkl_file_name):
        with open(pkl_file_name, 'rb') as f:
            net = pickle.load(f)
            print('params loaded!')

    if train_flag == 1:
        print('start training!')
        for i in range(iters_num):
            # 选出mini-batch
            batch_mask = np.random.choice(train_size, batch_size)
            x_batch = x_train[batch_mask]
            t_batch = t_train[batch_mask]

            # 计算梯度
            # grads_numerical = net.numerical_gradient(x_batch, t_batch)
            grads = net.gradient(x_batch, t_batch)

            # 更新参数
            for key in ('w1', 'b1', 'w2', 'b2'):
                net.params[key] -= lr * grads[key]
            
            train_loss_list.append(net.loss(x_batch, t_batch))

            # 记录学习过程
            if i % iter_per_epoch == 0:
                train_acc_list.append(net.accuracy(x_train, t_train))
                test_acc_list.append(net.accuracy(x_test, t_test))
                print("train acc, test acc | ", train_acc_list[-1], ", ", test_acc_list[-1])

                if test_acc_list[-1] > best_acc:
                    best_acc = test_acc_list[-1]
                    with open(pkl_file_name, 'wb') as f:
                        pickle.dump(net, f)
                        print('net params saved!')

        # 绘制图形
        fig, axis = plt.subplots(1, 1)

        x = np.arange(len(train_acc_list))
        axis.plot(x, train_acc_list, 'r', label='train acc')
        axis.plot(x, test_acc_list, 'g--', label='test acc')
        
        markers = {'train': 'o', 'test': 's'}
        axis.set_xlabel("epochs")
        axis.set_ylabel("accuracy")
        axis.set_ylim(0, 1.0)
        axis.legend(loc='best')
        plt.show()
    else:
        if gradcheck_flag == 1:
            gradient_check(net, x_train[:3], t_train[:3])
        print(net.accuracy(x_train[:], t_train[:]))

先进行梯度确认，设置 gradcheck_flag = 1，train_flag = 0; 代码输出如下:

w1 :
diff1: 2.6115099710177576e-11
diff2: 5.083025533280819e-08
b1 :
diff1: 2.2425837345792317e-10
diff2: 5.04159103684074e-08
w2 :
diff1: 1.3242105311984443e-10
diff2: 4.724560085528282e-08
b2 :
diff1: 2.712055815786953e-10
diff2: 5.095936065001308e-08

看起来反向传播计算得到的梯度应该是正确的, 那么下面就正式进入网络训练吧，设置 train_flag=1, 代码输出：

start training!
train acc, test acc |  0.10571666666666667 ,  0.1042
net params saved!
train acc, test acc |  0.9058833333333334 ,  0.9077
net params saved!
train acc, test acc |  0.9251666666666667 ,  0.9275
net params saved!
train acc, test acc |  0.9367166666666666 ,  0.9353
net params saved!
train acc, test acc |  0.9477166666666667 ,  0.9447
net params saved!
train acc, test acc |  0.9528833333333333 ,  0.9509
net params saved!
train acc, test acc |  0.9583 ,  0.9555
net params saved!
train acc, test acc |  0.9610333333333333 ,  0.9578
net params saved!
train acc, test acc |  0.9665 ,  0.9623
net params saved!
train acc, test acc |  0.9686333333333333 ,  0.9647
net params saved!
train acc, test acc |  0.9702333333333333 ,  0.9659
net params saved!
train acc, test acc |  0.9720833333333333 ,  0.9679
net params saved!
train acc, test acc |  0.9742166666666666 ,  0.9671
train acc, test acc |  0.9735666666666667 ,  0.9682
net params saved!
train acc, test acc |  0.9770833333333333 ,  0.9709
net params saved!
train acc, test acc |  0.9751 ,  0.9678
train acc, test acc |  0.9789166666666667 ,  0.9717
net params saved!