深度学习——反向传播算法推导

前言

之前写过单层前馈神经网络,但是其中的推导是针对sigmoid函数的,本篇博客使用矩阵向量求导方式进行反向传播算法的推导


符号约定

符号含义
S i n i S_{in}^i Sini i i i层神经元的输入,若一层有n个神经元,则 S i n i S_{in}^i Sini是一个 n ∗ 1 n*1 n1的向量
S o u t i S_{out}^i Souti i i i层神经元的输出,若一层有n个神经元,则 S o u t i S_{out}^i Souti是一个 n ∗ 1 n*1 n1的向量
W i W^i Wi i i i层神经元对应的权重矩阵,若 i − 1 i-1 i1层有 m m m个神经元,第 i i i层有 n n n个神经元,则 W i W^i Wi n ∗ m n*m nm的矩阵
B i B^i Bi i i i层的偏移矩阵,若一层有n个神经元,则 B i B^i Bi是一个 n ∗ 1 n*1 n1的向量
c o s t cost cost损失函数值

x x x表示 [ x 1 x 2 . . . . x n ] \begin{bmatrix} x_1\\ x_2\\ ....\\ x_n \end{bmatrix} x1x2....xn,第i层的激活函数向量 f i ( x ) f^i(x) fi(x)表示为 [ f ( x 1 ) f ( x 2 ) . . . . f ( x n ) ] \begin{bmatrix} f(x_1)\\ f(x_2)\\ ....\\ f(x_n) \end{bmatrix} f(x1)f(x2)....f(xn) f ( x ) f(x) f(x)为激活函数, ( f i ( x ) ) ′ (f^i(x))' (fi(x))表示为 [ ∂ f ( x 1 ) ∂ x 1 ∂ f ( x 2 ) ∂ x 2 . . . . ∂ f ( x n ) ∂ x n ] \begin{bmatrix} \frac{\partial{f(x_1)}}{{\partial x_1}}\\ \frac{\partial{f(x_2)}}{{\partial x_2}}\\ ....\\ \frac{\partial{f(x_n)}}{{\partial x_n}} \end{bmatrix} x1f(x1)x2f(x2)....xnf(xn)


基于上述符号约定,对于第 i i i层的神经元,我们有
S o u t i − 1 = f i ( S i n i − 1 ) S i n i = W i S o u t i − 1 + B i \begin{aligned} S_{out}^{i-1}=&f^i(S_{in}^{i-1})\\ S_{in}^i=&W^iS_{out}^{i-1}+B^i \end{aligned} Souti1=Sini=fi(Sini1)WiSouti1+Bi


标量对向量求导的链式法则

对于 n n n层前馈神经网络,我们有
c o s t ← S i n n ← S i n n − 1 . . . . . ← S i n 1 cost\leftarrow S_{in}^n\leftarrow S_{in}^{n-1}.....\leftarrow S_{in}^1 costSinnSinn1.....Sin1
左箭头表示映射,对于前馈神经网络,映射即为
S i n i + 1 = W i + 1 f i ( S i n i ) + B i + 1 \begin{aligned} S_{in}^{i+1}=W^{i+1}f^{i}(S_{in}^{i})+B^{i+1} \end{aligned} Sini+1=Wi+1fi(Sini)+Bi+1
损失函数与最后一层的映射需要依据损失函数的类型决定(例如均方误差、交叉熵),在上述映射关系的基础上,标量对向量求导的链式法则定义为
∂ c o s t ∂ S i n i = ( ∂ S i n n ∂ S i n n − 1 ∗ ∂ S i n n − 1 ∂ S i n n − 2 ∗ . . . . . . . ∗ ∂ S i n i + 1 ∂ S i n i ) T ∗ ∂ c o s t ∂ S i n n = ( ∂ S i n i + 1 ∂ S i n i ) T ∗ . . . . . ∗ ( ∂ S i n n − 1 ∂ S i n n − 2 ) T ∗ ( ∂ S i n n ∂ S i n n − 1 ) T ∗ ∂ c o s t ∂ S i n n = ( ∂ S i n i + 1 ∂ S i n i ) T ∗ . . . . . ∗ ( ∂ S i n n − 1 ∂ S i n n − 2 ) T ∗ ∂ c o s t ∂ S i n n − 1 = . . . . . . . = ( ∂ S i n i + 1 ∂ S i n i ) T ∂ c o s t ∂ S i n i + 1 \begin{aligned} \frac{\partial cost}{\partial S_{in}^i}&=(\frac{\partial S_{in}^n}{\partial S_{in}^{n-1}}*\frac{\partial S_{in}^{n-1}}{\partial S_{in}^{n-2}}*.......*\frac{\partial S_{in}^{i+1}}{\partial S_{in}^{i}})^T*\frac{\partial cost}{\partial S_{in}^n}\\ &=(\frac{\partial S_{in}^{i+1}}{\partial S_{in}^{i}})^T*.....*(\frac{\partial S_{in}^{n-1}}{\partial S_{in}^{n-2}})^T*(\frac{\partial S_{in}^n}{\partial S_{in}^{n-1}})^T*\frac{\partial cost}{\partial S_{in}^n}\\ &=(\frac{\partial S_{in}^{i+1}}{\partial S_{in}^{i}})^T*.....*(\frac{\partial S_{in}^{n-1}}{\partial S_{in}^{n-2}})^T*\frac{\partial cost}{\partial S_{in}^{n-1}}\\ &=.......\\ &=(\frac{\partial S_{in}^{i+1}}{\partial S_{in}^{i}})^T\frac{\partial cost}{\partial S_{in}^{i+1}} \end{aligned} Sinicost=(Sinn1SinnSinn2Sinn1.......SiniSini+1)TSinncost=(SiniSini+1)T.....(Sinn2Sinn1)T(Sinn1Sinn)TSinncost=(SiniSini+1)T.....(Sinn2Sinn1)TSinn1cost=.......=(SiniSini+1)TSini+1cost


常用向量对向量求导的公式

Y = A X + B Y=AX+B Y=AX+B Y Y Y X 、 B X、B XB为向量, A A A为矩阵,使用分子布局,则有 ∂ Y ∂ X = A \frac{\partial{Y}}{\partial{X}}=A XY=A


反向传播算法推导

假设有一个n层前馈神经网络,则第 i i i层的梯度为
∂ c o s t ∂ S i n i = ( ∂ S i n i + 1 ∂ S i n i ) T ∗ ∂ c o s t ∂ S i n i + 1 = ( W i + 1 ) T ∗ ∂ c o s t ∂ S i n i + 1 ☉ ( f i ( S i n i ) ) ′ (式1) \begin{aligned} \frac{\partial cost}{\partial S_{in}^i}&=(\frac{\partial S_{in}^{i+1}}{\partial S_{in}^{i}})^T*\frac{\partial cost}{\partial S_{in}^{i+1}}\\ &=(W^{i+1})^T*\frac{\partial cost}{\partial S_{in}^{i+1}} ☉ (f^{i}(S_{in}^i))' \end{aligned}\tag{式1} Sinicost=(SiniSini+1)TSini+1cost=(Wi+1)TSini+1cost(fi(Sini))(1)
☉为Hadamard乘积,用于矩阵或向量之间点对点的乘法运算,即相同位置的元素相乘,对于最后一步,具体的理解如下,假设第 i i i层有n个神经元
( W i + 1 ) T ∗ ∂ c o s t ∂ S i n i + 1 ☉ ( f i ( S i n i ) ) ′ = ( ∂ S i n i + 1 ∂ f ( S i n i ) ) T ∗ ∂ c o s t ∂ S i n i + 1 ☉ ( f i ( S i n i ) ) ′ = ∂ c o s t ∂ f ( S i n i ) ☉ ( f i ( S i n i ) ) ′ = [ ∂ c o s t ∂ f ( ( S i n i ) 1 ) ∂ c o s t ∂ f ( ( S i n i ) 2 ) . . . . . . ∂ c o s t ∂ f ( ( S i n i ) n ) ] ☉ ( f i ( S i n i ) ) ′ = [ ∂ c o s t ∂ f ( ( S i n i ) 1 ) ∂ c o s t ∂ f ( ( S i n i ) 2 ) . . . . . . ∂ c o s t ∂ f ( ( S i n i ) n ) ] ☉ [ ∂ f ( ( S i n i ) 1 ) ∂ ( ( S i n i ) 1 ) ∂ f ( ( S i n i ) 2 ) ∂ ( ( S i n i ) 2 ) . . . . . . ∂ f ( ( S i n i ) n ) ∂ ( ( S i n i ) n ) ] = [ ∂ c o s t ∂ f ( ( S i n i ) 1 ) ∗ ∂ f ( ( S i n i ) 1 ) ∂ ( ( S i n i ) 1 ) ∂ c o s t ∂ f ( ( S i n i ) 2 ) ∗ ∂ f ( ( S i n i ) 2 ) ∂ ( ( S i n i ) 2 ) . . . . . . ∂ c o s t ∂ f ( ( S i n i ) n ) ∗ ∂ f ( ( S i n i ) n ) ∂ ( ( S i n i ) n ) ] = [ ∂ c o s t ∂ ( ( S i n i ) 1 ) ∂ c o s t ∂ ( ( S i n i ) 2 ) . . . . . . ∂ c o s t ∂ ( ( S i n i ) n ) ] = ∂ c o s t ∂ S i n i \begin{aligned} (W^{i+1})^T*\frac{\partial cost}{\partial S_{in}^{i+1}}☉ (f^{i}(S_{in}^i))'=&(\frac{\partial S_{in}^{i+1}}{\partial f(S_{in}^{i})})^T*\frac{\partial cost}{\partial S_{in}^{i+1}}☉ (f^{i}(S_{in}^i))'\\ =&\frac{\partial cost}{\partial f(S_{in}^{i})}☉ (f^{i}(S_{in}^i))'\\ =& \begin{bmatrix} \frac{\partial cost}{\partial f((S_{in}^{i})_1)}\\ \frac{\partial cost}{\partial f((S_{in}^{i})_2)}\\ ......\\ \frac{\partial cost}{\partial f((S_{in}^{i})_n)} \end{bmatrix}☉ (f^{i}(S_{in}^i))'\\ =& \begin{bmatrix} \frac{\partial cost}{\partial f((S_{in}^{i})_1)}\\ \frac{\partial cost}{\partial f((S_{in}^{i})_2)}\\ ......\\ \frac{\partial cost}{\partial f((S_{in}^{i})_n)} \end{bmatrix}☉ \begin{bmatrix} \frac{\partial {f((S_{in}^{i})_1)}}{\partial((S_{in}^{i})_1)}\\ \frac{\partial {f((S_{in}^{i})_2)}}{\partial((S_{in}^{i})_2)}\\ ......\\ \frac{\partial {f((S_{in}^{i})_n)}}{\partial((S_{in}^{i})_n)} \end{bmatrix}\\ =&\begin{bmatrix} \frac{\partial cost}{\partial f((S_{in}^{i})_1)}*\frac{\partial {f((S_{in}^{i})_1)}}{\partial((S_{in}^{i})_1)}\\ \frac{\partial cost}{\partial f((S_{in}^{i})_2)}*\frac{\partial {f((S_{in}^{i})_2)}}{\partial((S_{in}^{i})_2)}\\ ......\\ \frac{\partial cost}{\partial f((S_{in}^{i})_n)}*\frac{\partial {f((S_{in}^{i})_n)}}{\partial((S_{in}^{i})_n)} \end{bmatrix}\\ =&\begin{bmatrix} \frac{\partial cost}{\partial((S_{in}^{i})_1)}\\ \frac{\partial cost}{\partial((S_{in}^{i})_2)}\\ ......\\ \frac{\partial cost}{\partial((S_{in}^{i})_n)} \end{bmatrix}\\ =&\frac{\partial cost}{\partial S_{in}^i} \end{aligned} (Wi+1)TSini+1cost(fi(Sini))=======(f(Sini)Sini+1)TSini+1cost(fi(Sini))f(Sini)cost(fi(Sini))f((Sini)1)costf((Sini)2)cost......f((Sini)n)cost(fi(Sini))f((Sini)1)costf((Sini)2)cost......f((Sini)n)cost((Sini)1)f((Sini)1)((Sini)2)f((Sini)2)......((Sini)n)f((Sini)n)f((Sini)1)cost((Sini)1)f((Sini)1)f((Sini)2)cost((Sini)2)f((Sini)2)......f((Sini)n)cost((Sini)n)f((Sini)n)((Sini)1)cost((Sini)2)cost......((Sini)n)costSinicost
接下来就是权重更新的梯度,推出第 i i i层的梯度后,对权重梯度与偏移的求导可以使用定义法求得到:
∂ c o s t ∂ W i = ∂ c o s t ∂ S i n i ∗ ( S o u t i − 1 ) T (式2) \begin{aligned} \frac{\partial cost}{\partial W^i}&=\frac{\partial cost}{\partial S_{in}^i}*(S_{out}^{i-1})^T\tag{式2} \end{aligned} Wicost=Sinicost(Souti1)T(2)
∂ c o s t ∂ B i = ∂ c o s t ∂ S i n i (式3) \begin{aligned} \frac{\partial cost}{\partial B^i}&=\frac{\partial cost}{\partial S_{in}^i}\tag{式3} \end{aligned} Bicost=Sinicost(3)
∂ c o s t ∂ S i n n \frac{\partial cost}{\partial S_{in}^n} Sinncost需要依据矩阵求导的定义法自己求出,求出后,即可依据式1、2、3求出各参数的梯度,关于矩阵求导的定义法,可以查看快点我,我等不及了

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
反向传播算法(Backpropagation)是一种用于训练神经网络的常见优化算法。它通过计算损失函数相对于每个参数的梯度,并使用梯度下降来更新参数。下面我将给出反向传播算法的公式推导及示例代码。 1. 反向传播算法公式推导: 首先,定义神经网络的损失函数为L,该函数是由网络输出和真实标签之间的差异计算得出。假设神经网络有多个隐藏层,每个隐藏层的参数为W和b。 1.1 前向传播: 首先,我们通过前向传播计算每一层的输出值。假设输入为x,第l层的输出为a[l],则有: a = x z[l] = W[l] * a[l-1] + b[l] a[l] = g(z[l]) 其中,g()是激活函数。 1.2 反向传播: 接下来,我们需要计算损失函数相对于每个参数的梯度,然后使用梯度下降更新参数。假设我们有L层神经网络,则有以下公式: 输出层的梯度: dz[L] = dL / da[L] * g'(z[L]) 隐藏层的梯度: dz[l] = (W[l+1]的转置 * dz[l+1]) * g'(z[l]) 参数梯度: dW[l] = dz[l] * a[l-1的转置] db[l] = dz[l] 更新参数: W[l] = W[l] - learning_rate * dW[l] b[l] = b[l] - learning_rate * db[l] 其中,dL / da[L]是损失函数对输出层输出的导数,g'()是激活函数的导数。 2. 反向传播算法示例代码: 下面是一个使用反向传播算法进行训练的示例代码: ```python # 假设网络有三个隐藏层 hidden_layers = [10, 20, 30] output_size = 2 # 初始化参数 parameters = {} layers_dims = [input_size] + hidden_layers + [output_size] L = len(layers_dims) - 1 for l in range(1, L + 1): parameters['W' + str(l)] = np.random.randn(layers_dims[l], layers_dims[l-1]) * 0.01 parameters['b' + str(l)] = np.zeros((layers_dims[l], 1)) # 前向传播 def forward_propagation(X, parameters): caches = [] A = X for l in range(1, L): Z = np.dot(parameters['W' + str(l)], A) + parameters['b' + str(l)] A = sigmoid(Z) cache = (Z, A) caches.append(cache) Z = np.dot(parameters['W' + str(L)], A) + parameters['b' + str(L)] AL = softmax(Z) cache = (Z, AL) caches.append(cache) return AL, caches # 反向传播 def backward_propagation(AL, Y, caches): grads = {} dZ = AL - Y m = AL.shape[1] grads['dW' + str(L)] = 1/m * np.dot(dZ, caches[-1][1].T) grads['db' + str(L)] = 1/m * np.sum(dZ, axis=1, keepdims=True) for l in reversed(range(1, L)): dA_prev = np.dot(parameters['W' + str(l+1)].T, dZ) dZ = dA_prev * sigmoid_derivative(caches[l-1][0]) grads['dW' + str(l)] = 1/m * np.dot(dZ, caches[l-1][1].T) grads['db' + str(l)] = 1/m * np.sum(dZ, axis=1, keepdims=True) return grads # 参数更新 def update_parameters(parameters, grads, learning_rate): for l in range(1, L+1): parameters['W' + str(l)] -= learning_rate * grads['dW' + str(l)] parameters['b' + str(l)] -= learning_rate * grads['db' + str(l)] return parameters # 训练模型 def train_model(X, Y, learning_rate, num_iterations): for i in range(num_iterations): AL, caches = forward_propagation(X, parameters) cost = compute_cost(AL, Y) grads = backward_propagation(AL, Y, caches) parameters = update_parameters(parameters, grads, learning_rate) if i % 100 == 0: print("Cost after iteration {}: {}".format(i, cost)) return parameters # 使用示例 parameters = train_model(X_train, Y_train, learning_rate=0.01, num_iterations=1000) ``` 这是一个简单的反向传播算法示例代码,其中的sigmoid()、softmax()、sigmoid_derivative()和compute_cost()函数需要根据具体情况自行实现。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值