第一部分: 一个简单的两层神经网络的反向传播
下面的代码是来自 pytorch tutorial 的一个 numpy 版本的(激活函数为relu的)两层全连接神经网络的实现, 包括网络的实现、梯度的反向传播计算和权重更新过程:
# -*- coding: utf-8 -*-
import numpy as np
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10
# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)
# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)
learning_rate = 1e-6
for t in range(500):
# Forward pass: compute predicted y
h = x.dot(w1)
h_relu = np.maximum(h, 0)
y_pred = h_relu.dot(w2)
# Compute and print loss
loss = np.square(y_pred - y).sum()
print(t, loss)
# Backprop to compute gradients of w1 and w2 with respect to loss
grad_y_pred = 2.0 * (y_pred - y)
grad_w2 = h_relu.T.dot(grad_y_pred)
grad_h_relu = grad_y_pred.dot(w2.T)
grad_h = grad_h_relu.copy()
grad_h[h < 0] = 0
grad_w1 = x.T.dot(grad_h)
# Update weights
w1 -= learning_rate * grad_w1
w2 -= learning_rate * grad_w2
这里我们主要关心其中的 反向传播 过程,核心代码如下:
h = x.dot(w1)
h_relu = np.maximum(h, 0)
y_pred = h_relu.dot(w2)
loss = np.square(y_pred - y).sum()
grad_y_pred = 2.0 * (y_pred - y) # 64 x 10
grad_w2 = h_relu.T.dot(grad_y_pred) # 100 x 10
grad_h_relu = grad_y_pred.dot(w2.T) # 64 x 100
grad_h = grad_h_relu.copy() # 64 x 100
grad_h[h < 0] = 0 # 64 x 100
grad_w1 = x.T.dot(grad_h) # 1000 x 100
0. 变量关系分析
首先画出依赖图:
各个变量之间的关系如下:
h=h_relu=y_pred=loss=x⋅w1ReLU(h)h_relu⋅w2∑(y_pred−y)2=sse(y_pred)
1. grad_y_pred: 实值函数对矩阵求导
∂loss∂y_pred=∂∑ij(y_predij−yij)2∂y_pred
其中loss是实数, y_pred 是矩阵, 根据实值函数对矩阵求导规则(详见本文第二部分)有:
(∂loss∂y_pred)ij====∂loss∂y_predij∂∑ij(y_predij−yij)2∂y_predij∂(y_predij−yij)2∂y_predij2⋅(y_predij−yij)
故:
grad_y_pred=∂loss∂y_pred=2⋅(y_pred−y)
2. grad_w2: 线性变换的导数
设有 f(Y):Rm×p→R 及线性映射 X↦Y=AX+B:Rn×p→Rm×p , 其中 A∈Rm×n,B∈Rm×p .则: