吴恩达《神经网络和深度学习》— 构建深度神经网络
※※※※※上一篇:【用一层隐藏层的神经网络分类二维数据】※※※※※下一篇:【深度神经网络应用–Cat or Not】※※※※※
在上一篇教程中我们已经训练了一个两层的神经网络(只有一个隐藏层)。这篇文章,我们将学会构建一个任意层数的深度神经网络,并实现构建深度神经网络所需的所有函数!
学完本篇文章将掌握的技能:
∙ \bullet ∙ 使用ReLU等非线性单位来改善模型
∙ \bullet ∙ 建立更深的神经网络(具有1个以上的隐藏层)
∙ \bullet ∙ 实现一个易于使用的神经网络类
本文所使用的资料:【点击下载】,提取码:hwwc。请在开始之前下载好所需资料,然后将文件解压到你的代码文件同一级目录下,请确保你的代码那里有dnn_utils.py
、testCases.py
和 lr_utils.py
文件。
【符号说明】:
∙
\bullet
∙ 上标
[
l
]
\left [ l \right ]
[l] 表示与
l
t
h
l^{th}
lth 层相关的数量。
- 示例:
a
[
L
]
a^{\left [ L \right ]}
a[L] 是
L
t
h
L^{th}
Lth 层的激活。
W
[
L
]
W^{\left [ L \right ]}
W[L] 和
b
[
L
]
b^{\left [ L \right ]}
b[L] 是
L
t
h
L^{th}
Lth 层参数。
∙
\bullet
∙ 上标
(
i
)
\left ( i \right )
(i) 表示与
i
t
h
i^{th}
ith 示例相关的数量。
- 示例:
x
(
i
)
x^{\left ( i \right )}
x(i) 是
i
t
h
i^{th}
ith 的训练数据。
∙
\bullet
∙ 下标
i
i
i 表示
i
t
h
i^{th}
ith 的向量。
- 示例:
a
i
[
l
]
a_{i}^{\left [ l \right ]}
ai[l] 表示
l
t
h
l^{th}
lth 层激活的
i
t
h
i^{th}
ith 输入。
1 安装包
在开始之前我们需要准备一些软件包:
import numpy as np
import h5py
import matplotlib.pyplot as plt
import testCases # 参见资料包
from dnn_utils import sigmoid, sigmoid_backward, relu, relu_backward # 参见资料包
import lr_utils # 参见资料包
为了和我的数据匹配,你需要指定随机种子。
np.random.seed(1)
2 构建深度神经网络的框架
为了构建深度神经网络,我们需要实现几个“辅助函数”。这些辅助函数将在下一篇文章【深度神经网络应用–图像分类】中使用,用来构建一个两层神经网络和一个L层的神经网络。
构建深度神经网络的流程如下所示:
∙ \bullet ∙ 初始化两层的神经网络和 L L L 层的神经网络的参数。
∙
\bullet
∙ 实现正向传播模块(在下图中以紫色显示)。
- 完成模型正向传播步骤的LINEAR
部分(
Z
[
l
]
Z^{\left [ l \right ]}
Z[l])。
- 提供使用的ACTIVATION
函数(relu / Sigmoid
)。
- 将前两个步骤合并为新的[LINEAR-> ACTIVATION]
前向函数。
- 堆叠[LINEAR-> RELU]
正向函数L-1
次(第1到L-1层),并在末尾添加[LINEAR-> SIGMOID]
(最后的层)。这合成了一个新的L_model_forward
函数。
∙ \bullet ∙ 计算损失。
∙
\bullet
∙ 实现反向传播模块(在下图中以红色表示)。
- 完成模型反向传播步骤的LINEAR
部分。
- 提供的ACTIVATE
函数的梯度(relu_backward / sigmoid_backward
)。
- 将前两个步骤组合成新的[LINEAR-> ACTIVATION]
反向函数。
- 将[LINEAR-> RELU]
向后堆叠L-1次,并在新的L_model_backward
函数中后向添加[LINEAR-> SIGMOID]
。
∙ \bullet ∙ 最后更新参数。
【注意】:对于每个正向函数,都有一个对应的反向函数。这也是为什么在正向传播模块的每一步都将一些值存储在缓存中的原因。在反向传播模块中,将使用缓存的值来计算梯度。
3 初始化
首先编写两个辅助函数用来初始化模型的参数。第一个函数将用于初始化两层模型的参数。第二个将把初始化过程推广到 L L L 层模型上。
3.1 两层神经网络参数的初始化
【说明】:
∙
\bullet
∙ 模型的结构为:LINEAR -> RELU -> LINEAR -> SIGMOID
。
∙
\bullet
∙ 随机初始化权重矩阵。确保准确的维度,使用 np.random.randn(shape)* 0.01
。
∙
\bullet
∙ 将偏差初始化为0。使用 np.zeros(shape)
。
【代码】:
# GRADED FUNCTION: initialize_parameters
def initialize_parameters(n_x, n_h, n_y):
"""
Argument:
n_x -- size of the input layer
n_h -- size of the hidden layer
n_y -- size of the output layer
Returns:
parameters -- python dictionary containing your parameters:
W1 -- weight matrix of shape (n_h, n_x)
b1 -- bias vector of shape (n_h, 1)
W2 -- weight matrix of shape (n_y, n_h)
b2 -- bias vector of shape (n_y, 1)
"""
np.random.seed(1)
W1 = np.random.randn(n_h, n_x) * 0.01
b1 = np.zeros((n_h, 1))
W2 = np.random.randn(n_y, n_h) * 0.01
b2 = np.zeros((n_y, 1))
assert (W1.shape == (n_h, n_x))
assert (b1.shape == (n_h, 1))
assert (W2.shape == (n_y, n_h))
assert (b2.shape == (n_y, 1))
parameters = {"W1": W1,
"b1": b1,
"W2": W2,
"b2": b2}
return parameters
初始化完成我们来测试一下:
【测试】:
print("==============测试initialize_parameters==============")
parameters = initialize_parameters(3, 2, 1)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))
【结果】:
==============测试initialize_parameters==============
W1 = [[ 0.01624345 -0.00611756 -0.00528172]
[-0.01072969 0.00865408 -0.02301539]]
b1 = [[0.]
[0.]]
W2 = [[ 0.01744812 -0.00761207]]
b2 = [[0.]]
3.2 L层神经网络参数的初始化
更深的L层神经网络的初始化更加复杂,因为存在更多的权重矩阵和偏差向量。完成 initialize_parameters_deep
后,应确保各层之间的维度匹配。回想一下,
n
[
l
]
n^{\left [ l \right ]}
n[l] 是
l
l
l 层中的神经元数量。 因此,如果我们输入的
X
X
X 的大小为
(
12288
,
209
)
(12288, 209)
(12288,209)(以
m
=
2009
m = 2009
m=2009 为例),则:
当我们在python中计算 ( W X + b ) (WX+b) (WX+b) 时,使用广播,比如:
则:
【说明】:
∙
\bullet
∙ 模型的结构为 [LINEAR -> RELU] (L-1) -> LINEAR -> SIGMOID
。也就是说,前
L
−
1
L-1
L−1 层使用ReLU
作为激活函数,最后一层采用sigmoid
激活函数输出。
∙
\bullet
∙ 随机初始化权重矩阵。使用np.random.rand(shape)* 0.01
。
∙
\bullet
∙ 零初始化偏差。使用np.zeros(shape)
。
∙
\bullet
∙ 我们将在layer_dims
变量中存储
n
[
l
]
n^{\left [ l \right ]}
n[l],即不同层中的神经元数。例如,上篇文章中“二维数据分类模型”的layer_dims
为[2,4,1]:即一个样本数据包含2个特征,一个隐藏层包含4个隐藏单元,一个输出层包含1个输出单元。因此,W1的维度为(4,2),b1的维度为(4,1),W2的维度为(1,4),而b2的维度为(1,1)。现在把它应用到
L
L
L 层。
【代码】:
# GRADED FUNCTION: initialize_parameters_deep
def initialize_parameters_deep(layer_dims):
"""
Arguments:
layer_dims -- python array (list) containing the dimensions of each layer in our network
Returns:
parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
Wl -- weight matrix of shape (layer_dims[l], layer_dims[l-1])
bl -- bias vector of shape (layer_dims[l], 1)
"""
np.random.seed(3)
parameters = {}
L = len(layer_dims) # number of layers in the network
for l in range(1, L):
parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l - 1]) * 0.01
parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))
assert (parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l - 1]))
assert (parameters['b' + str(l)].shape == (layer_dims[l], 1))
return parameters
测试一下:
【测试】:
# 测试initialize_parameters_deep
print("==============测试initialize_parameters_deep==============")
layers_dims = [5, 4, 3]
parameters = initialize_parameters_deep(layers_dims)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))
【结果】:
==============测试initialize_parameters_deep==============
W1 = [[ 0.01788628 0.0043651 0.00096497 -0.01863493 -0.00277388]
[-0.00354759 -0.00082741 -0.00627001 -0.00043818 -0.00477218]
[-0.01313865 0.00884622 0.00881318 0.01709573 0.00050034]
[-0.00404677 -0.0054536 -0.01546477 0.00982367 -0.01101068]]
b1 = [[0.]
[0.]
[0.]
[0.]]
W2 = [[-0.01185047 -0.0020565 0.01486148 0.00236716]
[-0.01023785 -0.00712993 0.00625245 -0.00160513]
[-0.00768836 -0.00230031 0.00745056 0.01976111]]
b2 = [[0.]
[0.]
[0.]]
我们分别构建了两层和多层神经网络的初始化参数的函数,现在我们开始构建正向传播函数。
4 前向传播模块
首先实现一些基本函数,用于稍后的模型实现。按以下顺序完成三个函数:
∙
\bullet
∙ LINEAR
∙
\bullet
∙ LINEAR -> ACTIVATION
,其中激活函数采用ReLU
或Sigmoid
。
∙
\bullet
∙ [LINEAR -> RELU] (L-1) -> LINEAR -> SIGMOID
(整个模型) 。
4.1 线性前向
线性前向模块(在所有数据中均进行向量化)的计算按照以下公式: Z [ l ] = W [ l ] A [ l − 1 ] + b [ l ] Z^{\left [ l \right ]}=W^{\left [ l \right ]}A^{\left [ l-1 \right ]}+b^{\left [ l \right ]} Z[l]=W[l]A[l−1]+b[l]其中 A [ 0 ] = X A^{\left [ 0 \right ]} = X A[0]=X。
前向传播中,线性部分计算如下:
【代码】:
# GRADED FUNCTION: linear_forward
def linear_forward(A, W, b):
"""
Implement the linear part of a layer's forward propagation.
Arguments:
A -- activations from previous layer (or input data): (size of previous layer, number of examples)
W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
b -- bias vector, numpy array of shape (size of the current layer, 1)
Returns:
Z -- the input of the activation function, also called pre-activation parameter
cache -- a python dictionary containing "A", "W" and "b" ; stored for computing the backward pass efficiently
"""
Z = np.dot(W, A) + b
assert (Z.shape == (W.shape[0], A.shape[1]))
cache = (A, W, b)
return Z, cache
测试一下线性部分:
【测试】:
# 测试linear_forward
print("==============测试linear_forward==============")
A, W, b = testCases.linear_forward_test_case()
Z, linear_cache = linear_forward(A, W, b)
print("A = " + str(A))
print("W = " + str(W))
print("b = " + str(b))
print("Z = " + str(Z))
【结果】:
==============测试linear_forward==============
A = [[ 1.62434536 -0.61175641]
[-0.52817175 -1.07296862]
[ 0.86540763 -2.3015387 ]]
W = [[ 1.74481176 -0.7612069 0.3190391 ]]
b = [[-0.24937038]]
Z = [[ 3.26295337 -1.23429987]]
4.2 前向线性激活
我们将使用两个激活函数:
∙ \bullet ∙ Sigmoid: σ ( Z ) = σ ( W A + b ) = 1 1 + e − ( W A + b ) \sigma \left ( Z \right )=\sigma \left ( WA+b \right )=\frac{1}{1+e^{-\left ( WA+b \right )}} σ(Z)=σ(WA+b)=1+e−(WA+b)1该函数返回两项值:激活值"a"和包含"Z"的"cache"(这是我们将馈入到相应的反向函数的内容,用于求解梯度)。可以按下述方式得到两项值:
A, activation_cache = sigmoid(Z)
∙ \bullet ∙ ReLU: A = R E L U ( Z ) = m a x ( 0 , Z ) A = RELU\left ( Z \right ) = max\left ( 0,Z \right ) A=RELU(Z)=max(0,Z)该函数返回两项值:激活值“A”和包含“Z”的“cache”(这是我们将馈入到相应的反向函数的内容,用于求解梯度)。 可以按下述方式得到两项值:
A, activation_cache = relu(Z)
为了更加方便,我们把两个函数(线性和激活)组合为一个函数(LINEAR-> ACTIVATION
)。 因此,我们将实现一个函数用以执行LINEAR
前向步骤和ACTIVATION
前向步骤。
【说明】:实现 LINEAR->ACTIVATION
层的前向传播。 数学表达式为:
A
[
l
]
=
g
[
l
]
(
Z
[
l
]
)
=
g
[
l
]
(
W
[
l
]
A
[
l
−
1
]
+
b
[
l
]
)
A^{\left [ l \right ]}=g^{\left [ l \right ]}\left ( Z^{\left [ l \right ]} \right )=g^{\left [ l \right ]}\left ( W^{\left [ l \right ]}A^{\left [ l-1 \right ]}+b^{\left [ l \right ]} \right )
A[l]=g[l](Z[l])=g[l](W[l]A[l−1]+b[l])其中激活"g" 可以是sigmoid()
或relu()
。
【代码】:
# GRADED FUNCTION: linear_activation_forward
def linear_activation_forward(A_prev, W, b, activation):
"""
Implement the forward propagation for the LINEAR->ACTIVATION layer
Arguments:
A_prev -- activations from previous layer (or input data): (size of previous layer, number of examples)
W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
b -- bias vector, numpy array of shape (size of the current layer, 1)
activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"
Returns:
A -- the output of the activation function, also called the post-activation value
cache -- a python dictionary containing "linear_cache" and "activation_cache";
stored for computing the backward pass efficiently
"""
if activation == "sigmoid":
# Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
Z, linear_cache = linear_forward(A_prev, W, b)
A, activation_cache = sigmoid(Z)
elif activation == "relu":
# Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
Z, linear_cache = linear_forward(A_prev, W, b)
A, activation_cache = relu(Z)
assert (A.shape == (W.shape[0], A_prev.shape[1]))
cache = (linear_cache, activation_cache)
return A, cache
【测试】:
# 测试linear_activation_forward
print("==============测试linear_activation_forward==============")
A_prev, W, b = testCases.linear_activation_forward_test_case()
print("A_prev = " + str(A_prev))
print("W = " + str(W))
print("b = " + str(b))
A, linear_activation_cache = linear_activation_forward(A_prev, W, b, activation="sigmoid")
print("sigmoid,A = " + str(A))
A, linear_activation_cache = linear_activation_forward(A_prev, W, b, activation="relu")
print("ReLU,A = " + str(A))
【结果】:
==============测试linear_activation_forward==============
A_prev = [[-0.41675785 -0.05626683]
[-2.1361961 1.64027081]
[-1.79343559 -0.84174737]]
W = [[ 0.50288142 -1.24528809 -1.05795222]]
b = [[-0.90900761]]
sigmoid,A = [[0.96890023 0.11013289]]
ReLU,A = [[3.43896131 0. ]]
【注意】:在深度学习中,"[LINEAR->ACTIVATION]
"计算被视为神经网络中的单个层,而不是两个层。
4.3 L层模型
我们把两层模型需要的前向传播函数做完了,那多层网络模型的前向传播是怎样的呢?我们调用上面的那两个函数来实现它,为了在实现L层神经网络时更加方便,我们需要一个函数来复制前一个函数(带有RELU
的linear_activation_forward
)L-1次,然后用一个带有SIGMOID
的linear_activation_forward
跟踪它,我们来看一下它的结构是怎样的:
在下面的代码中,变量AL
表示
A
[
L
]
=
σ
(
Z
[
L
]
)
=
σ
(
W
[
L
]
A
[
L
−
1
]
+
b
[
L
]
)
A^{\left [ L \right ]}=\sigma \left ( Z^{\left [ L \right ]} \right )=\sigma \left ( W^{\left [ L \right ]}A^{\left [ L-1 \right ]}+b^{\left [ L \right ]} \right )
A[L]=σ(Z[L])=σ(W[L]A[L−1]+b[L])有时也称为Yhat
,即
Y
^
\hat{Y}
Y^。
【代码】:
# GRADED FUNCTION: L_model_forward
def L_model_forward(X, parameters):
"""
Implement forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID computation
Arguments:
X -- data, numpy array of shape (input size, number of examples)
parameters -- output of initialize_parameters_deep()
Returns:
AL -- last post-activation value
caches -- list of caches containing:
every cache of linear_relu_forward() (there are L-1 of them, indexed from 0 to L-2)
the cache of linear_sigmoid_forward() (there is one, indexed L-1)
"""
caches = []
A = X
L = len(parameters) // 2 # number of layers in the neural network
# Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list.
for l in range(1, L):
A_prev = A
A, cache = linear_activation_forward(A_prev, parameters['W' + str(l)], parameters['b' + str(l)],
activation="relu")
caches.append(cache)
# Implement LINEAR -> SIGMOID. Add "cache" to the "caches" list.
AL, cache = linear_activation_forward(A, parameters['W' + str(L)], parameters['b' + str(L)], activation="sigmoid")
caches.append(cache)
assert (AL.shape == (1, X.shape[1]))
return AL, caches
【测试】:
# 测试L_model_forward
print("==============测试L_model_forward==============")
X, parameters = testCases.L_model_forward_test_case()
AL, caches = L_model_forward(X, parameters)
print("X = " + str(X))
print("parameters = " + str(parameters))
print("AL = " + str(AL))
print("caches 的长度为 = " + str(len(caches)))
print("caches = " + str(caches))
【结果】:
==============测试L_model_forward==============
X = [[ 1.62434536 -0.61175641]
[-0.52817175 -1.07296862]
[ 0.86540763 -2.3015387 ]
[ 1.74481176 -0.7612069 ]]
parameters = {'W1': array([[ 0.3190391 , -0.24937038, 1.46210794, -2.06014071],
[-0.3224172 , -0.38405435, 1.13376944, -1.09989127],
[-0.17242821, -0.87785842, 0.04221375, 0.58281521]]), 'b1': array([[-1.10061918],
[ 1.14472371],
[ 0.90159072]]), 'W2': array([[ 0.50249434, 0.90085595, -0.68372786]]), 'b2': array([[-0.12289023]])}
AL = [[0.17007265 0.2524272 ]]
caches 的长度为 = 2
caches = [((array([[ 1.62434536, -0.61175641],
[-0.52817175, -1.07296862],
[ 0.86540763, -2.3015387 ],
[ 1.74481176, -0.7612069 ]]), array([[ 0.3190391 , -0.24937038, 1.46210794, -2.06014071],
[-0.3224172 , -0.38405435, 1.13376944, -1.09989127],
[-0.17242821, -0.87785842, 0.04221375, 0.58281521]]), array([[-1.10061918],
[ 1.14472371],
[ 0.90159072]])), array([[-2.77991749, -2.82513147],
[-0.11407702, -0.01812665],
[ 2.13860272, 1.40818979]])), ((array([[0. , 0. ],
[0. , 0. ],
[2.13860272, 1.40818979]]), array([[ 0.50249434, 0.90085595, -0.68372786]]), array([[-0.12289023]])), array([[-1.58511248, -1.08570881]]))]
现在,我们有了一个完整的前向传播模块,它接受输入 X X X 并输出包含预测的行向量 A [ L ] A^{\left [ L \right ]} A[L]。它还将所有中间值记录在"caches"中以计算预测的损失值。
5 损失函数
我们已经把这两个模型的前向传播部分完成了,我们需要计算成本(损失),以确定它到底有没有在学习,使用以下公式计算交叉熵损失 J J J: − 1 m ∑ i = 1 m ( y ( i ) l o g ( a [ L ] ( i ) ) + ( 1 − y ( i ) ) l o g ( 1 − a [ L ] ( i ) ) ) -\frac{1}{m}\sum_{i=1}^{m}\left ( y^{\left ( i \right )}log\left ( a^{\left [ L \right ]\left ( i \right )} \right ) +\left ( 1-y^{\left ( i \right )} \right )log\left ( 1-a^{\left [ L \right ]\left ( i \right )} \right ) \right ) −m1i=1∑m(y(i)log(a[L](i))+(1−y(i))log(1−a[L](i)))
【代码】:
# GRADED FUNCTION: compute_cost
def compute_cost(AL, Y):
"""
Implement the cost function defined by equation (7).
Arguments:
AL -- probability vector corresponding to your label predictions, shape (1, number of examples)
Y -- true "label" vector (for example: containing 0 if non-cat, 1 if cat), shape (1, number of examples)
Returns:
cost -- cross-entropy cost
"""
m = Y.shape[1]
# Compute loss from aL and y.
cost = -1 / m * np.sum(Y * np.log(AL) + (1 - Y) * np.log(1 - AL), axis=1, keepdims=True)
cost = np.squeeze(cost) # To make sure your cost's shape is what we expect (e.g. this turns [[17]] into 17).
assert (cost.shape == ())
return cost
【测试】:
# 测试compute_cost
print("==============测试compute_cost==============")
Y, AL = testCases.compute_cost_test_case()
print("Y = " + str(Y))
print("AL = " + str(AL))
print("cost = " + str(compute_cost(AL, Y)))
【结果】:
==============测试compute_cost==============
Y = [[1 1 1]]
AL = [[0.8 0.9 0.4]]
cost = 0.41493159961539694
6 后向传播模块
后向传播用于计算损失函数相对于参数的梯度,我们来看看前向和后向传播的流程图:
如果对微积分有一定了解的话,我们知道可以使用微积分的链式规则来得出两层神经网络中的损失相对于 z [ 1 ] z^{\left [ 1 \right ]} z[1] 的导数,如下所示: d z [ 1 ] = ∂ L ∂ z [ 1 ] = ∂ L ∂ a [ 2 ] ⋅ ∂ a [ 2 ] ∂ z [ 2 ] ⋅ ∂ z [ 2 ] ∂ a [ 1 ] ⋅ ∂ a [ 1 ] ∂ z [ 1 ] (1) dz^{\left [ 1 \right ]}=\frac{\partial L}{\partial z^{\left [ 1 \right ]}}=\frac{\partial L}{\partial a^{\left [ 2 \right ]}}\cdot \frac{\partial a^{\left [ 2 \right ]}}{\partial z^{\left [ 2 \right ]}}\cdot \frac{\partial z^{\left [ 2 \right ]}}{\partial a^{\left [ 1 \right ]}}\cdot \frac{\partial a^{\left [ 1 \right ]}}{\partial z^{\left [ 1 \right ]}} \tag1 dz[1]=∂z[1]∂L=∂a[2]∂L⋅∂z[2]∂a[2]⋅∂a[1]∂z[2]⋅∂z[1]∂a[1](1)
为了计算梯度 d W [ 1 ] dW^{\left [ 1 \right ]} dW[1],可以在公式(1)的基础上再执行: d W [ 1 ] = d z [ 1 ] ⋅ ∂ z [ 1 ] ∂ W [ 1 ] (2) dW^{\left [ 1 \right ]} = dz^{\left [ 1 \right ]}\cdot \frac{\partial z^{\left [ 1 \right ]}}{\partial W^{\left [ 1 \right ]}} \tag2 dW[1]=dz[1]⋅∂W[1]∂z[1](2)
同样地,为了计算梯度 d b [ 1 ] db^{\left [ 1 \right ]} db[1],可以在公式(1)的基础上再执行: d b [ 1 ] = d z [ 1 ] ⋅ ∂ z [ 1 ] ∂ b [ 1 ] (3) db^{\left [ 1 \right ]} = dz^{\left [ 1 \right ]}\cdot \frac{\partial z^{\left [ 1 \right ]}}{\partial b^{\left [ 1 \right ]}} \tag3 db[1]=dz[1]⋅∂b[1]∂z[1](3)
这也是为什么我们称之为反向传播。
现在,类似于前向传播,可以分三个步骤构建后向传播:
∙
\bullet
∙ LINEAR backward
∙
\bullet
∙ LINEAR -> ACTIVATION backward
,其中激活函数使用ReLU
或sigmoid
的导数计算
∙
\bullet
∙ [LINEAR -> RELU] x (L-1) -> LINEAR -> SIGMOID backward
(整个模型)
6.1 线性后向
对于层 l l l,线性部分为: Z [ l ] = W [ l ] A [ l − 1 ] + b [ l ] Z^{\left [ l \right ]} = W^{\left [ l \right ]}A^{\left [ l-1 \right ]}+b^{\left [ l \right ]} Z[l]=W[l]A[l−1]+b[l]。
假设已经计算出导数 d Z [ l ] = ∂ L ∂ Z [ l ] dZ^{\left [ l \right ]}=\frac{\partial L}{\partial Z^{\left [ l \right ]}} dZ[l]=∂Z[l]∂L,则需要根据输入 d Z [ l ] dZ^{\left [ l \right ]} dZ[l] 计算三个输出 d W [ l ] dW^{\left [ l \right ]} dW[l]、 d b [ l ] db^{\left [ l \right ]} db[l] 和 d A [ l − 1 ] dA^{\left [ l-1 \right ]} dA[l−1]。所需要的公式如下: d W [ l ] = ∂ L ∂ W [ l ] = 1 m d Z [ l ] A [ l − 1 ] T (4) dW^{\left [ l \right ]}=\frac{\partial L}{\partial W^{\left [ l \right ]} } = \frac{1}{m}dZ^{\left [ l \right ]}A^{\left [ l-1 \right ]T} \tag4 dW[l]=∂W[l]∂L=m1dZ[l]A[l−1]T(4) d b [ l ] = ∂ L ∂ b [ l ] = 1 m ∑ i = 1 m d Z [ l ] ( i ) (5) db^{\left [ l \right ]}=\frac{\partial L}{\partial b^{\left [ l \right ]} } = \frac{1}{m}\sum_{i=1}^{m}dZ^{\left [ l \right ]\left ( i \right )} \tag5 db[l]=∂b[l]∂L=m1i=1∑mdZ[l](i)(5) d A [ l − 1 ] = ∂ L ∂ A [ l − 1 ] = W [ l ] T d Z [ l ] (6) dA^{\left [ l-1 \right ]}=\frac{\partial L}{\partial A^{\left [ l-1 \right ]} } = W^{\left [ l \right ]T}dZ^{\left [ l \right ]} \tag6 dA[l−1]=∂A[l−1]∂L=W[l]TdZ[l](6)
【代码】:
# GRADED FUNCTION: linear_backward
def linear_backward(dZ, cache):
"""
Implement the linear portion of backward propagation for a single layer (layer l)
Arguments:
dZ -- Gradient of the cost with respect to the linear output (of current layer l)
cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer
Returns:
dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
dW -- Gradient of the cost with respect to W (current layer l), same shape as W
db -- Gradient of the cost with respect to b (current layer l), same shape as b
"""
A_prev, W, b = cache
m = A_prev.shape[1]
dW = 1 / m * np.dot(dZ, A_prev.T)
db = 1 / m * np.sum(dZ, axis=1, keepdims=True)
dA_prev = np.dot(W.T, dZ)
assert (dA_prev.shape == A_prev.shape)
assert (dW.shape == W.shape)
assert (db.shape == b.shape)
return dA_prev, dW, db
【测试】:
# 测试linear_backward
print("==============测试linear_backward==============")
dZ, linear_cache = testCases.linear_backward_test_case()
dA_prev, dW, db = linear_backward(dZ, linear_cache)
print("dA_prev = " + str(dA_prev))
print("dW = " + str(dW))
print("db = " + str(db))
【结果】:
==============测试linear_backward==============
dA_prev = [[ 0.51822968 -0.19517421]
[-0.40506361 0.15255393]
[ 2.37496825 -0.89445391]]
dW = [[-0.10076895 1.40685096 1.64992505]]
db = [[0.50629448]]
6.2 后向线性激活
为了帮助你实现linear_activation_backward
,我们提供了两个反向函数:
∙
\bullet
∙ sigmoid_backward:实现SIGMOID
单元的后向传播。 你可以这样使用:
dZ = sigmoid_backward(dA, activation_cache)
∙
\bullet
∙ relu_backward:实现RELU
单元的后向传播。 你可以这样使用:
dZ = relu_backward(dA, activation_cache)
如果
g
(
⋅
)
g\left ( \cdot \right )
g(⋅) 是激活函数,则sigmoid_backward
和relu_backward
计算:
d
Z
[
l
]
=
d
A
[
l
]
∗
g
′
(
Z
[
l
]
)
dZ^{\left [ l \right ]} = dA^{\left [ l \right ]}\ast g^{'}\left ( Z^{\left [ l \right ]} \right )
dZ[l]=dA[l]∗g′(Z[l])
【代码】:
# GRADED FUNCTION: linear_activation_backward
def linear_activation_backward(dA, cache, activation):
"""
Implement the backward propagation for the LINEAR->ACTIVATION layer.
Arguments:
dA -- post-activation gradient for current layer l
cache -- tuple of values (linear_cache, activation_cache) we store for computing backward propagation efficiently
activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"
Returns:
dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
dW -- Gradient of the cost with respect to W (current layer l), same shape as W
db -- Gradient of the cost with respect to b (current layer l), same shape as b
"""
linear_cache, activation_cache = cache
if activation == "relu":
dZ = relu_backward(dA, activation_cache)
dA_prev, dW, db = linear_backward(dZ, linear_cache)
elif activation == "sigmoid":
dZ = sigmoid_backward(dA, activation_cache)
dA_prev, dW, db = linear_backward(dZ, linear_cache)
return dA_prev, dW, db
【测试】:
# 测试linear_activation_backward
print("==============测试linear_activation_backward==============")
AL, linear_activation_cache = testCases.linear_activation_backward_test_case()
dA_prev, dW, db = linear_activation_backward(AL, linear_activation_cache, activation="sigmoid")
print("sigmoid:")
print("dA_prev = " + str(dA_prev))
print("dW = " + str(dW))
print("db = " + str(db) + "\n")
dA_prev, dW, db = linear_activation_backward(AL, linear_activation_cache, activation="relu")
print("relu:")
print("dA_prev = " + str(dA_prev))
print("dW = " + str(dW))
print("db = " + str(db))
【结果】:
==============测试linear_activation_backward==============
sigmoid:
dA_prev = [[ 0.11017994 0.01105339]
[ 0.09466817 0.00949723]
[-0.05743092 -0.00576154]]
dW = [[ 0.10266786 0.09778551 -0.01968084]]
db = [[-0.05729622]]
relu:
dA_prev = [[ 0.44090989 0. ]
[ 0.37883606 0. ]
[-0.2298228 0. ]]
dW = [[ 0.44513824 0.37371418 -0.10478989]]
db = [[-0.20837892]]
6.3 后向L层模型
现在,你将为整个网络实现后向传播函数。回想一下,当实现L_model_forward
函数时,在每次迭代中,都存储了一个包含(A,W,b和Z)的缓存。在后向传播模块中,我们将使用这些变量来计算梯度。 因此,在L_model_backward
函数中,我们将从
L
L
L 层开始向后遍历所有隐藏层。在每个步骤中,我们都将使用
l
l
l 层的缓存值后向传播到层
l
l
l。下图展示了后向传播过程。
对于输出层,有: A [ L ] = σ ( Z [ L ] ) A^{\left [ L \right ]} = \sigma \left ( Z^{\left [ L \right ]} \right ) A[L]=σ(Z[L]),所以我们首先需要计算dAL,可以使用下面的代码来计算它:
dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
然后,就可以使用此激活后的梯度dAL继续后向传播。如上图所示,你现在可以将dAL输入到你实现的LINEAR-> SIGMOID
后向函数中(它将使用L_model_forward
函数存储的缓存值)。之后,你得通过for循环,使用LINEAR-> RELU
后向函数迭代所有其他层。同时将每个dA
,dW
和db
存储在grads
词典中。
【代码】:
# GRADED FUNCTION: L_model_backward
def L_model_backward(AL, Y, caches):
"""
Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group
Arguments:
AL -- probability vector, output of the forward propagation (L_model_forward())
Y -- true "label" vector (containing 0 if non-cat, 1 if cat)
caches -- list of caches containing:
every cache of linear_activation_forward() with "relu" (it's caches[l], for l in range(L-1) i.e l = 0...L-2)
the cache of linear_activation_forward() with "sigmoid" (it's caches[L-1])
Returns:
grads -- A dictionary with the gradients
grads["dA" + str(l)] = ...
grads["dW" + str(l)] = ...
grads["db" + str(l)] = ...
"""
grads = {}
L = len(caches) # the number of layers
m = AL.shape[1]
Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL
# Initializing the backpropagation
dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
# Lth layer (SIGMOID -> LINEAR) gradients.
# Inputs: "AL, Y, caches".
# Outputs: "grads["dAL"], grads["dWL"], grads["dbL"]
current_cache = caches[L - 1]
grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = \
linear_activation_backward(dAL, current_cache, activation="sigmoid")
for l in reversed(range(L - 1)):
# lth layer: (RELU -> LINEAR) gradients.
# Inputs: "grads["dA" + str(l + 2)], caches".
# Outputs: "grads["dA" + str(l + 1)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)]
current_cache = caches[l]
dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA" + str(l + 2)],
current_cache,
activation="relu")
grads["dA" + str(l + 1)] = dA_prev_temp
grads["dW" + str(l + 1)] = dW_temp
grads["db" + str(l + 1)] = db_temp
return grads
【测试】:
# 测试L_model_backward
print("==============测试L_model_backward==============")
AL, Y_assess, caches = testCases.L_model_backward_test_case()
grads = L_model_backward(AL, Y_assess, caches)
print("dW1 = " + str(grads["dW1"]))
print("db1 = " + str(grads["db1"]))
print("dA1 = " + str(grads["dA1"]))
【结果】:
==============测试L_model_backward==============
dW1 = [[0.41010002 0.07807203 0.13798444 0.10502167]
[0. 0. 0. 0. ]
[0.05283652 0.01005865 0.01777766 0.0135308 ]]
db1 = [[-0.22007063]
[ 0. ]
[-0.02835349]]
dA1 = [[ 0. 0.52257901]
[ 0. -0.3269206 ]
[ 0. -0.32070404]
[ 0. -0.74079187]]
6.4 更新参数
最后,使用梯度下降来更新模型的参数: W [ l ] = W [ l ] − α d W [ l ] (7) W^{\left [ l \right ]} = W^{\left [ l \right ]} - \alpha dW^{\left [ l \right ]} \tag7 W[l]=W[l]−αdW[l](7) b [ l ] = b [ l ] − α d b [ l ] (8) b^{\left [ l \right ]} = b^{\left [ l \right ]} - \alpha db^{\left [ l \right ]} \tag8 b[l]=b[l]−αdb[l](8)其中 α \alpha α 是学习率。在计算更新的参数后,将它们存储在参数字典中。
【代码】:
# GRADED FUNCTION: update_parameters
def update_parameters(parameters, grads, learning_rate):
"""
Update parameters using gradient descent
Arguments:
parameters -- python dictionary containing your parameters
grads -- python dictionary containing your gradients, output of L_model_backward
Returns:
parameters -- python dictionary containing your updated parameters
parameters["W" + str(l)] = ...
parameters["b" + str(l)] = ...
"""
L = len(parameters) // 2 # number of layers in the neural network
# Update rule for each parameter. Use a for loop.
for l in range(L):
parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * grads["dW" + str(l + 1)]
parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * grads["db" + str(l + 1)]
return parameters
【测试】:
# 测试update_parameters
print("==============测试update_parameters==============")
parameters, grads = testCases.update_parameters_test_case()
parameters = update_parameters(parameters, grads, 0.1)
print("W1 = " + str(parameters["W1"]))
print("b1 = " + str(parameters["b1"]))
print("W2 = " + str(parameters["W2"]))
print("b2 = " + str(parameters["b2"]))
【结果】:
==============测试update_parameters==============
W1 = [[-0.59562069 -0.09991781 -2.14584584 1.82662008]
[-1.76569676 -0.80627147 0.51115557 -1.18258802]
[-1.0535704 -0.86128581 0.68284052 2.20374577]]
b1 = [[-0.04659241]
[-1.28888275]
[ 0.53405496]]
W2 = [[-0.55569196 0.0354055 1.32964895]]
b2 = [[-0.84610769]]
至此,我们构建了深度神经网络所需的所有函数。