反向传播的一般情形

前提条件

  • 使用交叉熵损失函数
  • 使用sigmod激活函数
  • 输出层使用softmax函数

网络结构

  • 全连接,输出层没有sigmoid函数,使用softmax函数

目标

  • 推导前提条件下的四个反向传播方程
  • Python实现手写数字识别(两个隐藏层,第一隐藏层192个神经元,第二个隐藏层30个神经元)
    网络结构示意图
    其中, w i j l w^{l}_{ij} wijl表示第 l l l层的第 j j j个神经元到第 l + 1 l+1 l+1层第 i i i个神经元的权重,可以将 w i j l w^{l}_{ij} wijl理解为 ( W l ) T (W^{l})^{T} (Wl)T对应的元素; b i l b^{l}_{i} bil表示第 l l l层到第 l + 1 l+1 l+1层第 i i i个神经元的偏置。

BP方程推导过程

  • 推导方式:从某个神经元 j j j拓展到整个层的神经元
  • 引入中间变量: δ j l = ∂ E ∂ z j l \delta^{l}_{j}=\frac{\partial\\E}{\partial\\z^{l}_{j}} δjl=zjlE,表示第𝑙层第𝑗个神经元的误差。其中 E E E表示损失函数。
    在这里插入图片描述

BP1

定义网络的输出层为第 L L L层,则由 s o f t m a x softmax softmax函数可得,第 L L L层第 j j j个神经元的输出为:
h j L = e z j L ∑ j = 1 d L e z j L h^{L}_{j}=\frac{e^{z^{L}_{j}}}{\sum_{j=1}^{d_{L}}e^{z^{L}_{j}}\qquad} hjL=j=1dLezjLezjL(1)
因为损失函数为交叉熵损失函数,则
E = ∑ j = 1 d L − y j l n h j L E=\sum_{j=1}^{d_{L}}-y_{j}lnh^{L}_{j}\qquad E=j=1dLyjlnhjL(2)
又因为标签 y y y是one-hot( y i = 1 y_{i}=1 yi=1,其余项均为0),所以
E = − l n h i L E=-lnh^{L}_{i} E=lnhiL(3)
进一步地,由式(1)和式(3),可得
E = − l n h i L = − l n e z i L ∑ j = 1 d L e z j L = − z i L + l n ∑ j = 1 d L e z j L E=-lnh^{L}_{i}=-ln\frac{e^{z^{L}_{i}}}{\sum_{j=1}^{d_{L}}e^{z^{L}_{j}}\qquad}=-z^{L}_{i}+ln{\sum_{j=1}^{d_{L}}e^{z^{L}_{j}}\qquad} E=lnhiL=lnj=1dLezjLeziL=ziL+lnj=1dLezjL(4)
下面对 δ j l = ∂ E ∂ z j l \delta^{l}_{j}=\frac{\partial\\E}{\partial\\z^{l}_{j}} δjl=zjlE进行分类讨论,当 j = i j=i j=i时,有
δ i l = ∂ E ∂ z i l = − 1 + h i L \delta^{l}_{i}=\frac{\partial\\E}{\partial\\z^{l}_{i}}=-1+h^{L}_{i} δil=zilE=1+hiL(5)
j ≠ i j{\neq}i j=i时,有
δ j l = ∂ E ∂ z j l = 0 + h j L \delta^{l}_{j}=\frac{\partial\\E}{\partial\\z^{l}_{j}}=0+h^{L}_{j} δjl=zjlE=0+hjL(6)
联合式(5)和式(6)可得
δ j l = ∂ E ∂ z j l = h j L − y j \delta^{l}_{j}=\frac{\partial\\E}{\partial\\z^{l}_{j}}=h^{L}_{j}-y_{j} δjl=zjlE=hjLyj(7)
扩展到矩阵运算,得到BP1
δ L = ∂ E ∂ z L = h L − y \delta^{L}=\frac{\partial\\E}{\partial\\z^{L}}=h^{L}-y δL=zLE=hLy δ L ∈ R d L × 1 \delta^{L}{\in}R^{d_{L}{\times}1} δLRdL×1(8)

BP2

BP2是整个反向传播方程最关键的一步。BP2寻找网络层与层之间的联系,即推导 δ l + 1 \delta^{l+1} δl+1 δ l \delta^{l} δl之间的关系,又由 δ l + 1 \delta^{l+1} δl+1 δ l \delta^{l} δl的定义是 E E E关于 z l + 1 z^{l+1} zl+1 z l z^{l} zl的误差,最终转化为寻找 z l + 1 z^{l+1} zl+1 z l z^{l} zl的联系。
在这里插入图片描述
由上图可得,第 l + 1 l+1 l+1层的第 j j j个神经元 z j l + 1 z^{l+1}_{j} zjl+1接收由第 l l l层所有神经元发送的权重(多对一的关系),即
z j l + 1 = ∑ k = 1 d l w j k l h k l + b j l z^{l+1}_{j}={\sum_{k=1}^{d_{l}}{w^{l}_{jk}}{h_{k}^{l}}\qquad}+b^{l}_{j} zjl+1=k=1dlwjklhkl+bjl(9)
在这里插入图片描述

而第 l l l层第 j j j个神经元 z j l z^{l}_{j} zjl h j l h^{l}_{j} hjl之间的差别仅仅是一个sigmoid函数
h j l = σ ( z j l ) h^{l}_{j}=\sigma(z^{l}_{j}) hjl=σ(zjl)(10)
联合式(9)和式(10)可以得到 z l z^{l} zl z l + 1 z^{l+1} zl+1的关系
z j l + 1 = ∑ k = 1 d l w j k l σ ( z k l ) + b j l z^{l+1}_{j}={\sum_{k=1}^{d_{l}}{w^{l}_{jk}}\sigma(z^{l}_{k})\qquad}+b^{l}_{j} zjl+1=k=1dlwjklσ(zkl)+bjl(11)
这里需要注意的是,第 l l l层的第 j j j个神经元向第 l + 1 l+1 l+1层所有的神经元都发送了权重(一对多的关系),所以
δ j l = ∂ E ∂ z j l = ∑ k = 1 d l + 1 ∂ E ∂ z k l + 1 ∂ z l + 1 ∂ z j l = ∑ k = 1 d l + 1 δ k l + 1 w k j l σ , ( z j l ) \delta^{l}_{j}=\frac{\partial\\E}{\partial\\z^{l}_{j}}={\sum_{k=1}^{d_{l+1}}\frac{\partial\\E}{\partial\\z^{l+1}_{k}}\frac{\partial\\z^{l+1}}{\partial\\z^{l}_{j}}\qquad}\\={\sum_{k=1}^{d_{l+1}}\delta^{l+1}_{k}w^{l}_{kj}\sigma^{,}(z^{l}_{j})\qquad} δjl=zjlE=k=1dl+1zkl+1Ezjlzl+1=k=1dl+1δkl+1wkjlσ(zjl)(12)
在这里插入图片描述

扩展到矩阵运算,其中 δ l + 1 ∈ R d l + 1 × 1 \delta^{l+1}\in{R^{d_{l+1}\times1}} δl+1Rdl+1×1 W l ∈ R d l × d l + 1 W^{l}\in{R^{d_{l}\times{d_{l+1}}}} WlRdl×dl+1 z l ∈ R d l × 1 z^{l}\in{R^{d_{l}\times1}} zlRdl×1,得到BP2
δ l = ∂ E ∂ z l = σ , ( z l ) ⊙ ( W l δ l + 1 ) \delta^{l}=\frac{\partial\\E}{\partial\\z^{l}}=\sigma^{,}(z^{l})\odot(W^{l}\delta^{l+1}) δl=zlE=σ(zl)(Wlδl+1) δ l ∈ R d l × 1 \delta^{l}\in{R^{d_{l}\times1}} δlRdl×1(13)
⊙ \odot 表示哈达玛乘积,即矩阵对应元素相乘)
BP2比较复杂,字母下标很容易搞混淆,关键在于理解层与层之间多对多的关系。

BP3

得到BP2之后,开始正式推导 ∂ E ∂ b j l − 1 \frac{\partial\\E}{\partial\\b^{l-1}_{j}} bjl1E(写成 ∂ E ∂ b j l − 1 \frac{\partial\\E}{\partial\\b^{l-1}_{j}} bjl1E而不是 ∂ E ∂ b j l \frac{\partial\\E}{\partial\\b^{l}_{j}} bjlE的形式只是为了对应BP2),这是反向传播过程中真正需要更新的参数。
由上文中的式(9)
z j l + 1 = ∑ k = 1 d l w j k l h k l + b j l z^{l+1}_{j}={\sum_{k=1}^{d_{l}}{w^{l}_{jk}}{h_{k}^{l}}\qquad}+b^{l}_{j} zjl+1=k=1dlwjklhkl+bjl(9)
进行简单地变换得到
z j l = ∑ k = 1 d l − 1 w j k l − 1 h k l − 1 + b j l − 1 z^{l}_{j}={\sum_{k=1}^{d_{l-1}}{w^{l-1}_{jk}}{h_{k}^{l-1}}\qquad}+b^{l-1}_{j} zjl=k=1dl1wjkl1hkl1+bjl1(14)
推导可得
∂ E ∂ b j l − 1 = ∂ E ∂ z j l ∂ z j l ∂ b j l − 1 = δ j l \frac{\partial\\E}{\partial\\b^{l-1}_{j}}=\frac{\partial\\E}{\partial\\z^{l}_{j}}\frac{\partial\\z^{l}_{j}}{\partial\\b^{l-1}_{j}}=\delta^{l}_{j} bjl1E=zjlEbjl1zjl=δjl
扩展到矩阵运算,得到BP3
∂ E ∂ b l − 1 = ∂ E ∂ z l ∂ z l ∂ b l − 1 = δ l \frac{\partial\\E}{\partial\\b^{l-1}}=\frac{\partial\\E}{\partial\\z^{l}}\frac{\partial\\z^{l}}{\partial\\b^{l-1}}=\delta^{l} bl1E=zlEbl1zl=δl

BP4

BP4由式(14)推导 ∂ E ∂ w j k l − 1 \frac{\partial\\E}{\partial\\w^{l-1}_{jk}} wjkl1E
z j l = ∑ k = 1 d l − 1 w j k l − 1 h k l − 1 + b j l − 1 z^{l}_{j}={\sum_{k=1}^{d_{l-1}}{w^{l-1}_{jk}}{h_{k}^{l-1}}\qquad}+b^{l-1}_{j} zjl=k=1dl1wjkl1hkl1+bjl1(14)
对于具体的每一个参数 w j k l − 1 {w^{l-1}_{jk}} wjkl1,有
∂ E ∂ w j k l − 1 = ∂ E ∂ z j l ∂ z j l ∂ w j k l − 1 = δ j l h k l − 1 \frac{\partial\\E}{\partial\\w^{l-1}_{jk}}=\frac{\partial\\E}{\partial\\z^{l}_{j}}\frac{\partial\\z^{l}_{j}}{\partial\\w^{l-1}_{jk}}=\delta^{l}_{j}h_{k}^{l-1} wjkl1E=zjlEwjkl1zjl=δjlhkl1
扩展到矩阵运算, δ l ∈ R d l × 1 \delta^{l}\in{R^{d_{l}\times1}} δlRdl×1 h l − 1 ∈ R d l − 1 × 1 h^{l-1}\in{R^{d_{l-1}\times1}} hl1Rdl1×1,得到BP4
∂ E ∂ W l − 1 = ∂ E ∂ z l ∂ z l ∂ W l − 1 = h l − 1 ( δ l ) T \frac{\partial\\E}{\partial\\W^{l-1}}=\frac{\partial\\E}{\partial\\z^{l}}\frac{\partial\\z^{l}}{\partial\\W^{l-1}}=h^{l-1}(\delta^{l})^{T} Wl1E=zlEWl1zl=hl1(δl)T ∂ E ∂ W l − 1 ∈ R d l − 1 × d l \frac{\partial\\E}{\partial\\W^{l-1}}\in{R^{d_{l-1}\times{d_{l}}}} Wl1ERdl1×dl

BP方程

  • BP1: δ L = ∂ E ∂ z L = h L − y \delta^{L}=\frac{\partial\\E}{\partial\\z^{L}}=h^{L}-y δL=zLE=hLy δ L ∈ R d L × 1 \delta^{L}{\in}R^{d_{L}{\times}1} δLRdL×1
  • BP2: δ l = ∂ E ∂ z l = σ , ( z l ) ⊙ ( W l δ l + 1 ) \delta^{l}=\frac{\partial\\E}{\partial\\z^{l}}=\sigma^{,}(z^{l})\odot(W^{l}\delta^{l+1}) δl=zlE=σ(zl)(Wlδl+1) δ l ∈ R d l × 1 \delta^{l}\in{R^{d_{l}\times1}} δlRdl×1(13)
  • BP3: ∂ E ∂ b l − 1 = ∂ E ∂ z l ∂ z l ∂ b l − 1 = δ l \frac{\partial\\E}{\partial\\b^{l-1}}=\frac{\partial\\E}{\partial\\z^{l}}\frac{\partial\\z^{l}}{\partial\\b^{l-1}}=\delta^{l} bl1E=zlEbl1zl=δl
  • BP4: ∂ E ∂ W l − 1 = ∂ E ∂ z l ∂ z l ∂ W l − 1 = h l − 1 ( δ l ) T \frac{\partial\\E}{\partial\\W^{l-1}}=\frac{\partial\\E}{\partial\\z^{l}}\frac{\partial\\z^{l}}{\partial\\W^{l-1}}=h^{l-1}(\delta^{l})^{T} Wl1E=zlEWl1zl=hl1(δl)T ∂ E ∂ W l − 1 ∈ R d l − 1 × d l \frac{\partial\\E}{\partial\\W^{l-1}}\in{R^{d_{l-1}\times{d_{l}}}} Wl1ERdl1×dl

问题

  • 如果输出层使用sigmoid函数,均方误差损失函数,BP方程会有什么变化?
    答:仅BP1发生变化
  • 如果改用Relu函数作为激活函数,BP方程会有什么变化?
    答:仅BP2发生变化

代码实现

  • Python实现手写数字识别(两个隐藏层,第一隐藏层192个神经元,第二个隐藏层30个神经元)
  • 输入图像像素为 28 × 28 = 784 28\times28=784 28×28=784
import _pickle as cPickle
import gzip

import numpy as np


def softmax(x, axis=0):
    # 计算每行的最大值
    row_max = x.max(axis=axis)

    # 每行元素都需要减去对应的最大值,否则求exp(x)会溢出,导致inf情况
    row_max = row_max.reshape(-1, 1)
    x = x - row_max

    # 计算e的指数次幂
    x_exp = np.exp(x)
    x_sum = np.sum(x_exp, axis=axis, keepdims=True)
    s = x_exp / x_sum
    return s


def sigmoid(x):
    return 1.0 / (1 + np.exp(-x))


def sigmoid_derivation(x):
    result = sigmoid(x)
    return result * (1 - result)


class NeuralNetwork:
    """初始化网络"""

    def __init__(self, input_layer=784, hidden_layer1=192, hidden_layer2=30, output_layer=10,
                 learning_rate=0.1):
        """初始化网络结构"""
        # 输入层的节点数
        self.input_layer = input_layer
        # 第一个隐层的节点数
        self.hidden_layer1 = hidden_layer1
        # 第二个隐层的节点数
        self.hidden_layer2 = hidden_layer2
        # 输出层的节点数
        self.output_layer = output_layer
        # 损失函数
        self.activation = sigmoid
        self.activation_derivation = sigmoid_derivation
        # 学习率
        self.learning_rate = learning_rate

        """初始化权重和偏置矩阵"""
        # 输入层到第一个隐层
        self.W1 = np.random.randn(self.input_layer, self.hidden_layer1)
        self.b1 = np.random.randn(self.hidden_layer1, 1)
        # 第一个隐层到第二个隐层
        self.W2 = np.random.randn(self.hidden_layer1, self.hidden_layer2)
        self.b2 = np.random.randn(self.hidden_layer2, 1)
        # 第二个隐层到输出层
        self.W3 = np.random.randn(self.hidden_layer2, self.output_layer)
        self.b3 = np.random.randn(self.output_layer, 1)

    """前向传播"""

    def forward(self, x):
        z2 = np.dot(self.W1.T, x) + self.b1
        h2 = self.activation(z2)
        z3 = np.dot(self.W2.T, h2) + self.b2
        h3 = self.activation(z3)
        z4 = np.dot(self.W3.T, h3) + self.b3
        out = softmax(z4)
        return z2, h2, z3, h3, z4, out

    """训练网络"""

    def train(self, data, epochs=1):
        X, Y = data
        count = len(Y)
        # 将Y转换为one-hot
        new_Y = np.zeros((count, 10))
        # 训练
        for epoch in range(epochs):
            for i in range(count):
                new_Y[i][Y[i]] = 1.0
                # 将X,new_Y转化成二维矩阵
                x = np.array(X[i], ndmin=2).T
                target = np.array(new_Y[i], ndmin=2).T
                """前向传播"""
                z2, h2, z3, h3, z4, out = self.forward(x)
                """计算损失"""
                loss = (-np.log(out) * target).sum()
                print('loss:' + str(loss))
                # if loss < 1e-3:
                #     break
                """反向传播,根据BP方程更新W和b"""
                # BP1
                delta_z4 = out - target
                # 更新第二个隐层到输出层的W,b  z4 = W3.T @ h3 + b3
                delta_W3 = np.dot(h3, delta_z4.T)
                self.W3 -= self.learning_rate * delta_W3
                delta_b3 = delta_z4
                self.b3 -= self.learning_rate * delta_b3
                # 更新第一个隐层到第二个隐层的W,b z3 = W2.T @ h2 + b2, h3 = sigmoid(z3)
                delta_h3 = np.dot(self.W3, delta_z4)
                delta_z3 = np.multiply(self.activation_derivation(z3), delta_h3)
                delta_W2 = np.dot(h2, delta_z3.T)
                self.W2 -= self.learning_rate * delta_W2
                delta_b2 = delta_z3
                self.b2 -= self.learning_rate * delta_b2
                # 更新输入层到第一个隐层的W,b z2 = W1.T @ x + b1, h2 = sigmoid(z2)
                delta_h2 = np.dot(self.W2, delta_z3)
                delta_z2 = np.multiply(self.activation_derivation(z2), delta_h2)
                delta_W1 = np.dot(x, delta_z2.T)
                self.W1 -= self.learning_rate * delta_W1
                delta_b1 = delta_z2
                self.b1 -= self.learning_rate * delta_b1

    """测试"""

    def test(self, data):
        X, Y = data
        count = len(Y)
        # 预测
        res = 0
        for i in range(0, count):
            x = np.array(X[i], ndmin=2).T
            z2, h2, z3, h3, z4, out = self.forward(x)
            # 取out的最大值下标作为预测结果
            predict = np.argmax(out)
            # print(predict, Y[i])
            if int(predict) == int(Y[i]):
                res += 1
        rating = res / count * 100
        print("correct rating: %.2f" % rating + '%')


if __name__ == "__main__":
    # 实例化神经网络
    net = NeuralNetwork()
    # 读取数据
    f = gzip.open('../data/mnist.pkl.gz', 'rb')
    training_data, validation_data, test_data = cPickle.load(f, encoding='latin1')
    f.close()
    net.train(training_data)
    net.test(test_data)

逻辑回归

  • 10个分类器实现手写数字识别
import _pickle as cPickle
import gzip

import numpy as np
import torch
from torch import nn

nets = []


class LogisticRegression(nn.Module):
    def __init__(self, input_layer=784, hidden_layer1=192, hidden_layer2=30, output_layer=1):
        super(LogisticRegression, self).__init__()
        self.fc1 = nn.Linear(input_layer, hidden_layer1)
        self.fc2 = nn.Linear(hidden_layer1, hidden_layer2)
        self.fc3 = nn.Linear(hidden_layer2, output_layer)
        self.activation = nn.Sigmoid()

    def forward(self, x):
        z2 = self.fc1(x)
        h2 = self.activation(z2)
        z3 = self.fc2(h2)
        h3 = self.activation(z3)
        z4 = self.fc3(h3)
        out = self.activation(z4)
        return out


def train(data, epochs=1):
    X, Y = data
    count = len(Y)
    for i in range(10):
        net = LogisticRegression()
        # 定义损失函数和优化器
        criterion = nn.BCELoss()
        optimizer = torch.optim.SGD(net.parameters(), lr=1e-3, momentum=0.9)
        tmp_Y = np.where(Y == i, np.ones_like(Y), np.zeros_like(Y))
        for epoch in range(epochs):
            for j in range(count):
                x = torch.from_numpy(X[j])
                y = torch.from_numpy(tmp_Y[j:j + 1].astype(np.float32))
                out = net(x)
                loss = criterion(out, y)
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
                print("loss:{:.4f}".format(loss))
        nets.append(net)


def test(data):
    X, Y = data
    count = len(Y)
    res = 0
    with torch.no_grad():
        for n in range(count):
            out = 0
            index = 0
            print(Y[n])
            for i in range(len(nets)):
                x = torch.from_numpy(X[n])
                x_out = nets[i](x)
                print('{} {}'.format(i, x_out.item()))
                if x_out > out:
                    out = x_out
                    index = i
            if index == Y[n]:
                res += 1
    print(res / count)


if __name__ == "__main__":
    # 实例化神经网络
    # 读取数据
    f = gzip.open('../data/mnist.pkl.gz', 'rb')
    training_data, validation_data, test_data = cPickle.load(f, encoding='latin1')
    f.close()
    train(training_data)
    test(test_data)

评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值