机器学习 | 超详细推导全连接神经网络梯度更新并手动Python实现多分类任务

圆姜

已于 2024-05-18 00:03:34 修改

阅读量576

点赞数 11

分类专栏：机器学习文章标签：机器学习分类人工智能 python 神经网络算法深度学习

于 2024-02-21 21:03:45 首次发布

本文链接：https://blog.csdn.net/ningia/article/details/136216428

版权

机器学习专栏收录该内容

1 篇文章 0 订阅

订阅专栏

本文代码在Python 基于BP神经网络的鸢尾花分类基础上稍加修改（包括激活函数和损失函数等），并为了方便理解，加上了计算梯度的演算过程。

step1 加载和认识数据集

from sklearn.datasets import load_iris
iris_dataset=load_iris()
#iris数据集有150个样本，每个样本包含4个特征
iris_dataset['data'].shape
-------------------------------------
(150, 4)

#具体来看，4个特征分别是：
iris_dataset['feature_names']
-------------------------------------
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

#标签数据集为：
iris_dataset['target']
-------------------------------------
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

#iris数据集包含有三种类别，分别是
iris_dataset['target_names']
-------------------------------------
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

step2 划分训练集和测试集

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(iris_dataset['data'],iris_dataset['target'],random_state=0)
print("X_train shape:{}".format(X_train.shape))
print("y_train shape:{}".format(y_train.shape))
print("X_test shape:{}".format(X_test.shape))
print("y_test shape:{}".format(y_test.shape))
---------------------------------------------
X_train shape:(112, 4)
y_train shape:(112,)
X_test shape:(38, 4)
y_test shape:(38,)

step3 搭建神经网络模型

import pandas as pd
import numpy as np
import datetime

构建一个具有1个隐藏层的神经网络，隐层的大小为10,输入层为4个特征，输出层为3个分类
(1,0,0)为第一类，(0,1,0)为第二类，(0,0,1)为第三类。

step3.1 初始化参数

def initialize_parameters(n_x, n_h, n_y):
    '''
    n_x: 输入层的维数
    n_h: 隐藏层的维数
    n_y: 输出层的维数
    '''
    np.random.seed(2)
    
    # 权重和偏置矩阵
    w1 = np.random.randn(n_h, n_x) * 0.01
    b1 = np.zeros(shape=(n_h, 1))
    w2 = np.random.randn(n_y, n_h) * 0.01
    b2 = np.zeros(shape=(n_y, 1))
    
    # 通过字典存储参数
    parameters = {'w1': w1, 'b1': b1, 'w2': w2, 'b2': b2}

    return parameters

为了公式的简洁，下面公式用h代表代码中的n_h，代表隐藏层的维数； $w^1$ 代表代码中的w1，依此类推。

$w^1 = \begin{pmatrix} w^1_{11} & \cdots & w^1_{14} \\ \vdots & &\vdots \\ w^1_{h1} & \cdots & w^1_{h4} \end{pmatrix}_{h\times 4}$ $b^1 = \begin{pmatrix} b^1_1\\ \vdots\\ b^1_h \end{pmatrix}_{h\times1}$

$w^2 = \begin{pmatrix} w^2_{11} & \cdots & w^2_{1h} \\ w^2_{21} & \cdots&w^2_{2h} \\ w^2_{31} & \cdots & w^2_{3h} \end{pmatrix}_{3\times h}$ $b^2 = \begin{pmatrix} b^2_1\\ b^2_2\\ b^2_3 \end{pmatrix}_{3\times1}$

step3.2 前向传播

def forward_propagation(X,parameters):
    w1 = parameters['w1']
    b1 = parameters['b1']
    w2 = parameters['w2']
    b2 = parameters['b2']
    
    # 计算输出a2
    z1 = np.dot(w1,X)+b1 # np.dot表示矩阵乘法，这个地方需注意矩阵加法：虽然(w1*X)和b1的维度不同，但可以按列相加
    a1 = np.tanh(z1)     # 使用tanh作为第一层的激活函数 
    z2 = np.dot(w2, a1) + b2
    a2 = np.exp(z2)/np.sum(np.exp(z2), axis=0)  # 使用softmax作为第二层的激活函数（多分类任务中多用softmax作为输出层的激活函数）
    
    # 通过字典存储参数
    cache = {'z1': z1, 'a1': a1, 'z2': z2, 'a2': a2}
    
    return a2, cache

前文提到iris数据集的样本数m=150。forward_propagation的参数 $X\in\mathbb{R}^{4\times m}$ 。

$z^1 = w^1X+b^1 = \begin{pmatrix} z^1_{11}& z^1_{12}& \cdots & z^1_{1m}\\ z^1_{21}& z^1_{22} & \cdots & z^1_{2m} \\ \vdots& \vdots & &\vdots \\ z^1_{h1}& z^1_{h2} & \cdots & z^1_{hm} \end{pmatrix}_{h\times m}$

(注意这里 $b^1$ 前面的+指的是按列依次做加法)

$a^1 = \begin{pmatrix} a^1_{11}& a^1_{12}& \cdots & a^1_{1m}\\ a^1_{21}& a^1_{22} & \cdots & a^1_{2m} \\ \vdots& \vdots & &\vdots \\ a^1_{h1}& a^1_{h2} & \cdots & a^1_{hm} \end{pmatrix}_{h\times m} \\{~~~}= \begin{pmatrix} tanh(z^1_{11})& tanh(z^1_{12})& \cdots & tanh(z^1_{1m})\\ tanh(z^1_{21})& tanh(z^1_{22}) & \cdots & tanh(z^1_{2m}) \\ \vdots& \vdots & &\vdots \\ tanh(z^1_{h1})& tanh(z^1_{h2}) & \cdots & tanh(z^1_{hm}) \end{pmatrix}_{h\times m}$

$z^2 = w^2a^1+b^2 = \begin{pmatrix} z^2_{11}& z^2_{12}& \cdots & z^2_{1m}\\ z^2_{21}& z^2_{22}& \cdots & z^2_{2m} \\ z^2_{31}& z^2_{32}& \cdots & z^2_{3m} \end{pmatrix}_{3\times m}$

$a^2 = \begin{pmatrix} a^2_{11}& a^2_{12}& \cdots & a^2_{1m}\\ a^2_{21}& a^2_{22}& \cdots & a^2_{2m} \\ a^2_{31}& a^2_{32}& \cdots & a^2_{3m} \end{pmatrix}_{3\times m}$ , 其中 $a^2_{ij} = \frac{exp(z^2_{ij})}{\sum^{3}_{k=1} exp(z^2_{kj})}$

step3.3 计算损失函数

def compute_cost(a2, Y, parameters):
    m = Y.shape[1]      # Y的列数即为总的样本数
    
    # 采用交叉熵（cross-entropy）作为损失函数 
    cost = -np.sum(np.multiply(np.log(a2),Y))/m # np.multiply表示矩阵对应元素相乘

    return cost

$Y = \begin{pmatrix} y_{11}& y_{12}& \cdots & y_{1m}\\ y_{21}& y_{22}& \cdots & y_{2m} \\ y_{31}& y_{32}& \cdots & y_{3m} \end{pmatrix}_{3\times m}$

多分类中的交叉熵损失函数如下：

${L} = -\frac{1}{m}\sum_{j=1}^{m}\sum_{i=1}^{3}y_{ij}\ln a_{ij}$

step3.4 计算梯度

def backward_propagation(parameters, cache, X, Y):
    m = Y.shape[1]

    w2 = parameters['w2']

    a1 = cache['a1']
    a2 = cache['a2']

    # 反向传播，计算dw1、db1、dw2、db2
    dz2 = (1 / m) *( a2 - Y)
    dw2 = np.dot(dz2, a1.T)
    db2 = np.sum(dz2, axis=1, keepdims=True)
    dz1 = np.multiply(np.dot(w2.T, dz2), 1 - np.power(a1, 2))
    dw1 = np.dot(dz1, X.T)
    db1 = np.sum(dz1, axis=1, keepdims=True)

    grads = {'dw1': dw1, 'db1': db1, 'dw2': dw2, 'db2': db2}

    return grads

参考sigmoid函数、tanh函数、softmax函数及求导_tanh函数导数-CSDN博客

可知

$\frac{\partial L}{\partial z^2_{ij}} =\frac{1}{m}(a^2_{ij}-y_{ij})\\\Rightarrow \frac{\partial L}{\partial z^2} = \frac{1}{m}(a^2-Y) = \frac{1}{m}\begin{pmatrix} a^2_{11}-y_{11}& \cdots & a^2_{1m}-y_{1m}\\ a^2_{21}-y_{21}& \cdots & a^2_{2m}-y_{2m} \\ a^2_{31}-y_{31}& \cdots & a^2_{3m}-y_{3m} \end{pmatrix}_{3\times m}$

而

$z^2 = w^2a^1+b^2 = \begin{pmatrix} z^2_{11}& z^2_{12}& \cdots & z^2_{1m}\\ z^2_{21}& z^2_{22}& \cdots & z^2_{2m} \\ z^2_{31}& z^2_{32}& \cdots & z^2_{3m} \end{pmatrix}_{3\times m} \\{~~~~~~~~~~~~~~~~~~~}= \begin{pmatrix} w^2_{11}a^1_{11} + \cdots+w^2_{1h}a^1_{h1}+b^2_{11}& w^2_{11}a^1_{12} + \cdots+w^2_{1h}a^1_{h2}+b^2_{11}& \cdots &w^2_{11}a^1_{1m} + \cdots+w^2_{1h}a^1_{hm}+b^2_{11}\\ w^2_{21}a^1_{11} + \cdots+w^2_{2h}a^1_{h1}+b^2_{21}& w^2_{21}a^1_{12} + \cdots+w^2_{2h}a^1_{h2}+b^2_{21}& \cdots &w^2_{21}a^1_{1m} + \cdots+w^2_{2h}a^1_{hm}+b^2_{21}\\ w^2_{31}a^1_{11} + \cdots+w^2_{3h}a^1_{h1}+b^2_{31}& w^2_{31}a^1_{12} + \cdots+w^2_{3h}a^1_{h2}+b^2_{31}& \cdots &w^2_{31}a^1_{1m} + \cdots+w^2_{3h}a^1_{hm}+b^2_{31} \end{pmatrix}$

因此

$\frac{\partial L }{\partial w^2_{11}} = \frac{\partial L }{\partial z^2_{11}}\frac{\partial z^2_{11} }{\partial w^2_{11}} + \frac{\partial L }{\partial z^2_{12}}\frac{\partial z^2_{12} }{\partial w^2_{11}} + \cdots + \frac{\partial L }{\partial z^2_{1m}}\frac{\partial z^2_{1m} }{\partial w^2_{11}}\\{~~~~~~~~~}=\frac{1}{m}\bigl(\begin{smallmatrix} a^2_{11}-y_{11},& \cdots,& a^2_{1m}-y_{1m}\end{smallmatrix}\bigr)\begin{pmatrix} a^1_{11}\\ \vdots\\ a^1_{1m} \end{pmatrix}$

$\Rightarrow \frac{\partial L }{\partial w^2} = \frac{1}{m}\begin{pmatrix} a^2_{11}-y_{11}& \cdots & a^2_{1m}-y_{1m}\\ a^2_{21}-y_{21}& \cdots & a^2_{2m}-y_{2m} \\ a^2_{31}-y_{31}& \cdots & a^2_{3m}-y_{3m} \end{pmatrix}_{3\times m}\begin{pmatrix} a^1_{11}& a^1_{21}& \cdots & a^1_{h1}\\ a^1_{12}& a^1_{22} & \cdots & a^1_{h2} \\ \vdots& \vdots & &\vdots \\ a^1_{1m}& a^1_{2m} & \cdots & a^1_{hm} \end{pmatrix}_{m\times h} \\{~~~~~~~~}=\frac{\partial L }{\partial z^2}\cdot(a^1)^T$

同理可知

$\frac{\partial L }{\partial b^2_{11}} = \frac{\partial L }{\partial z^2_{11}}\frac{\partial z^2_{11} }{\partial b^2_{11}} + \frac{\partial L }{\partial z^2_{12}}\frac{\partial z^2_{12} }{\partial b^2_{11}} + \cdots + \frac{\partial L }{\partial z^2_{1m}}\frac{\partial z^2_{1m} }{\partial b^2_{11}}\\{~~~~~~~~~}=\frac{1}{m}\bigl(\begin{smallmatrix} a^2_{11}-y_{11},& \cdots,& a^2_{1m}-y_{1m}\end{smallmatrix}\bigr)\begin{pmatrix} 1\\ \vdots\\ 1 \end{pmatrix}_{m\times 1}$

$\Rightarrow \frac{\partial L }{\partial b^2} = \frac{\partial L }{\partial z^2}\begin{pmatrix} 1\\ \vdots\\ 1 \end{pmatrix}_{m\times 1}$

同理亦可知

$\frac{\partial L }{\partial a^1_{11}} = \frac{\partial L }{\partial z^2_{11}}\frac{\partial z^2_{11} }{\partial a^1_{11}} + \frac{\partial L }{\partial z^2_{21}}\frac{\partial z^2_{21} }{\partial a^1_{11}} + \frac{\partial L }{\partial z^2_{31}}\frac{\partial z^2_{31} }{\partial a^1_{11}}\\{~~~~~~~~~}=\bigl(\begin{smallmatrix} w^2_{11},& w^2_{21},& w^2_{21}\end{smallmatrix}\bigr)\begin{pmatrix} \frac{\partial L }{\partial z^2_{11}} \\ \frac{\partial L }{\partial z^2_{21}}\\ \frac{\partial L }{\partial z^2_{31}} \end{pmatrix}$

$\Rightarrow \frac{\partial L }{\partial a^1_{ij}} =\bigl(\begin{smallmatrix} w^2_{1i},& w^2_{2i},& w^2_{2i}\end{smallmatrix}\bigr)\begin{pmatrix} \frac{\partial L }{\partial z^2_{1j}} \\ \frac{\partial L }{\partial z^2_{2j}}\\ \frac{\partial L }{\partial z^2_{3j}} \end{pmatrix}$

$\Rightarrow \frac{\partial L}{\partial a^1}= \begin{pmatrix} w^2_{11} & w^2_{21}& w^2_{31} \\ \vdots& & \vdots \\ w^2_{1h} & w^2_{2h}& w^2_{3h} \end{pmatrix}_{h\times 3}\begin{pmatrix} \frac{\partial L}{\partial z^2_{11}} & \cdots& \frac{\partial L}{\partial z^2_{1m}} \\ \frac{\partial L}{\partial z^2_{21}} & \cdots& \frac{\partial L}{\partial z^2_{2m}} \\ \frac{\partial L}{\partial z^2_{31}} & \cdots& \frac{\partial L}{\partial z^2_{3m}} \end{pmatrix}_{3\times m} \\{~~~~}= (w^2)^T \cdot \frac{\partial L}{\partial z^2}$

再由sigmoid函数、tanh函数、softmax函数及求导_tanh函数导数-CSDN博客

以及 $a^1 = \begin{pmatrix} a^1_{11}& a^1_{12}& \cdots & a^1_{1m}\\ a^1_{21}& a^1_{22} & \cdots & a^1_{2m} \\ \vdots& \vdots & &\vdots \\ a^1_{h1}& a^1_{h2} & \cdots & a^1_{hm} \end{pmatrix}_{h\times m} \\{~~~}= \begin{pmatrix} tanh(z^1_{11})& tanh(z^1_{12})& \cdots & tanh(z^1_{1m})\\ tanh(z^1_{21})& tanh(z^1_{22}) & \cdots & tanh(z^1_{2m}) \\ \vdots& \vdots & &\vdots \\ tanh(z^1_{h1})& tanh(z^1_{h2}) & \cdots & tanh(z^1_{hm}) \end{pmatrix}_{h\times m}$

可知

$\frac{\partial a^1}{\partial z^1} = \begin{pmatrix} 1-(a^1_{11})^2& 1-(a^1_{12})^2 & \cdots & 1-(a^1_{1m})^2 \\ 1-(a^1_{21})^2& 1-(a^1_{22})^2 & \cdots & 1-(a^1_{2m})^2 \\ \vdots & \vdots & & \vdots\\ 1-(a^1_{h1})^2& 1-(a^1_{h2})^2 & \cdots & 1-(a^1_{hm})^2 \end{pmatrix}_{h\times m} \\{~~~~}= \begin{pmatrix} 1 &\cdots & 1\\ \vdots& &\vdots \\ 1& \cdots & 1 \end{pmatrix}_{h\times m} - (a^1)^2$

从而

$\frac{\partial L }{\partial z^1} =\frac{\partial L }{\partial a^1} \bigotimes \frac{\partial a^1 }{\partial z^1}$ ( $\bigotimes$ 为Hadamard积，表示矩阵对应元素相乘)

由 $\frac{\partial L}{\partial w^2}$ 和 $\frac{\partial L}{\partial b^2}$ 的公式的简单推导及得 $\frac{\partial L}{\partial w^1}$ 和 $\frac{\partial L}{\partial b^1}$ ，不再赘述。

step3.5 更新参数

def update_parameters(parameters, grads, learning_rate=0.4):
    w1 = parameters['w1']
    b1 = parameters['b1']
    w2 = parameters['w2']
    b2 = parameters['b2']

    dw1 = grads['dw1']
    db1 = grads['db1']
    dw2 = grads['dw2']
    db2 = grads['db2']

    # 更新参数
    w1 = w1 - dw1 * learning_rate
    b1 = b1 - db1 * learning_rate
    w2 = w2 - dw2 * learning_rate
    b2 = b2 - db2 * learning_rate

    parameters = {'w1': w1, 'b1': b1, 'w2': w2, 'b2': b2}

    return parameters

step3.6 模型评估

def predict(parameters, x_test, y_test):
    w1 = parameters['w1']
    b1 = parameters['b1']
    w2 = parameters['w2']
    b2 = parameters['b2']
    
    z1 = np.dot(w1, x_test) + b1
    a1 = np.tanh(z1)
    z2 = np.dot(w2, a1) + b2
    a2 = np.exp(z2)/np.sum(np.exp(z2), axis=0)
    
    # 结果的维度
    n_rows = y_test.shape[0]
    n_cols = y_test.shape[1]
    
    # 预测值结果存储
    output = np.empty(shape=(n_rows, n_cols), dtype=int)
    
    for i in range(n_rows):
        for j in range(n_cols):
            if a2[i][j] > 0.5:
                output[i][j] = 1
            else:
                output[i][j] = 0
   
    print('预测结果：')
    print(output)
    print('真实结果：')
    print(y_test)
    
    count = 0
    for k in range(0, n_cols):
        if output[0][k] == y_test[0][k] and output[1][k] == y_test[1][k] and output[2][k] == y_test[2][k]:
            count = count + 1
        else:
            print(k)

    acc = count / int(y_test.shape[1]) * 100
    print('准确率：%.2f%%' % acc)
    
    return output

step3.7 建立神经网络

def nn_model(X, Y, n_h, n_input, n_output, num_iterations=10000, print_cost=False):
    np.random.seed(3)

    n_x = n_input           # 输入层节点数
    n_y = n_output          # 输出层节点数

    # 1.初始化参数
    parameters = initialize_parameters(n_x, n_h, n_y)

    # 梯度下降循环
    for i in range(0, num_iterations):
        # 2.前向传播
        a2, cache = forward_propagation(X, parameters)
        # 3.计算代价函数
        cost = compute_cost(a2, Y, parameters)
        # 4.反向传播
        grads = backward_propagation(parameters, cache, X, Y)
        # 5.更新参数
        parameters = update_parameters(parameters, grads)

        # 输出损失函数和参数的值
        if print_cost: 
            print('迭代第%i次，代价函数为：%f' % (i, cost))
            print(parameters)

    return parameters

step4 训练模型

#读取训练数据
train_sample_sum = len(y_train)

X_train = X_train.T

Y_train = np.zeros(shape=(train_sample_sum, 3))
for i in range(train_sample_sum):
    Y_train[i][y_train[i]] = 1 
    
Y_train = Y_train.T

#读取测试数据
test_sample_sum = len(y_test)

X_test = X_test.T

Y_test = np.zeros(shape=(test_sample_sum, 3))
for i in range(test_sample_sum):
    Y_test[i][y_test[i]] = 1 
    
Y_test = Y_test.T

#开始训练
start_time = datetime.datetime.now()
    # 输入4个节点，隐层10个节点，输出3个节点，迭代10000次
parameters = nn_model(X_train,
                          Y_train,
                          n_h=10,
                          n_input=4,
                          n_output=3,
                          num_iterations=1000,
                          print_cost=True)
    
end_time = datetime.datetime.now()
print("用时：" + str((end_time - start_time).seconds) + 's' + str(round((end_time - start_time).microseconds / 1000)) + 'ms')

result = predict(parameters, X_test, Y_test)
--------------------------------------------

预测结果：
[[0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1 1 0 1 1 0 0 1 0 0 1 0 0 0
  1 0]
 [0 1 0 0 0 0 0 1 1 1 0 1 1 1 1 0 1 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1
  0 0]
 [1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 1 1 0
  0 1]]
真实结果：
[[0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1. 1.
  0. 1. 1. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0.
  0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 1. 0. 1.]
 [1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
  1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 0.]]
31
37
准确率：94.74%