多层全连接神经网络
多层神经网络中的符号表示
在计算网络层数的时候,一般只记录输出层和隐藏层,因此,上图为一个4层的神经网络,令L为网络的层数,L=4.
对于输入层,习惯性称为“第零层”,由于计算机中的数组通常是从0开始的,第0层这种叫法有利于统一形式。
对于每一层(从1到L)都有两个参数W和b(为了区分不同层的W和b,将他们记为: W [ l ] , b [ l ] W^{[l]} ,b^{[l]} W[l],b[l],他们都是矩阵,其维度如下:
W [ l ] : ( n [ l ] , n [ l − 1 ] ) W^{[l]}: (n^{[l]},n^{[l-1]}) W[l]:(n[l],n[l−1])
b [ l ] : ( n [ l ] , 1 ) b^{[l]}:(n^{[l]},1) b[l]:(n[l],1)
在上面的式子中, n [ l ] n^{[l]} n[l]代表第l层的神经元的个数,这里就体现出“第0层”的作用了,因为对于 W [ 1 ] W^{[1]} W[1],其维度为: ( n [ 1 ] , n [ 0 ] ) (n^{[1]},n^{[0]}) (n[1],n[0])而这里 n [ 0 ] n^{[0]} n[0]恰好对应了输入层的神经元个数(这里也可以理解为x的属性的数目)
如前面所述,每个层都会产生一个线性的输出 Z [ l ] Z^{[l]} Z[l]以及将Z通过激活函数进行非线性变换后得到的 A [ l ] A^{[l]} A[l], 其维度如下:
Z [ l ] , A [ l ] : ( n [ l ] , m ) Z^{[l]},A^{[l]}: (n^{[l]},m) Z[l],A[l]:(n[l],m)
其中m为输入样本的个数。最后还需要偏导数以进行梯度下降,对于的偏导数项 d W [ l ] , d b [ l ] dW^{[l]}, db^{[l]} dW[l],db[l]和对于参数 W [ l ] , b [ l ] W^{[l]},b^{[l]} W[l],b[l]的矩阵维度相等。
d W [ l ] : ( n [ l ] , n [ l − 1 ] ) dW^{[l]}: (n^{[l]},n^{[l-1]}) dW[l]:(n[l],n[l−1])
d b [ l ] : ( n [ l ] , 1 ) db^{[l]}:(n^{[l]},1) db[l]:(n[l],1)
多层神经网络的通式
对于多层神经网络,我们希望对每一个层l,都抽象出一个通式,把每一层当做一个模块或者一块“积木”(从程序设计上就是把每一层抽象成一个通用的函数),从而通过多次调用这个函数,可以很方便的刻画出整个模型。
下面给出一个模块的示意图(这里使用小写的a和z代表是对单个样本进行处理):
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RohuYPrY-1601977409142)(./multi-layer/2.jpg)]
一个神经网络的层l在前向传播的过程中需要接受来自前一层的输出 a [ l ] a^{[l]} a[l],作为本层的输入。根据本层的 W [ l ] , b [ l ] W^{[l]} ,b^{[l]} W[l],b[l]和激活函数计算出该层的输出 a [ l ] a^{[l]} a[l]作为输出给下一层。
同时,其需要缓存计算过程中的 z [ l ] , a [ l ] z^{[l]},a^{[l]} z[l],a[l]用于在反向传播的过程中计算导数。
前向传播结束后,将计算代价函数J,然后将代价函数用于反向传播中,作为反向传播的"起点".
在反向传播的过程中,其接受来自上一层的导数 d a [ l ] da^{[l]} da[l],然后计算出本层参数的导数 d W [ l ] , d b [ l ] dW^{[l]}, db^{[l]} dW[l],db[l]用于梯度下降更新参数的值。同时其计算出 d a [ l − 1 ] da^{[l-1]} da[l−1]被上一层用于更新参数。
下面是公式:
前向传播:
Z n [ l ] ∗ m [ l ] = W n [ l ] ∗ n [ l − 1 ] [ l ] A n [ l ] ∗ m [ l − 1 ] + b n [ l ] ∗ 1 [ l ] Z^{[l]}_{n^{[l]}*m} = W^{[l]}_{n^{[l]}*n^{[l-1]}} A^{[l-1]}_{n^{[l]}*m}+b^{[l]}_{n^{[l]}*1} Zn[l]∗m[l]=Wn[l]∗n[l−1][l]An[l]∗m[l−1]+bn[l]∗1[l]
A [ l ] = g ( Z [ l ] ) A^{[l]}=g(Z^{[l]}) A[l]=g(Z[l])
计算代价函数:
− 1 m ∑ i = 1 m ( y ( i ) log ( a [ L ] ( i ) ) + ( 1 − y ( i ) ) log ( 1 − a [ L ] ( i ) ) ) (7) -\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right)) \tag{7} −m1i=1∑m(y(i)log(a[L](i))+(1−y(i))log(1−a[L](i)))(7)
反向传播:
d Z n [ l ] ∗ m [ l ] = ∂ J ∂ Z [ l ] = d A n [ l ] ∗ m [ l ] ⊙ g ′ [ l ] ( Z [ l ] ) n [ l ] ∗ m dZ^{[l]}_{n^{[l]}*m} ={\partial J \over \partial Z^{[l]}}= dA^{[l]}_{n^{[l]}*m}\odot g'^{[l]}(Z^{[l]})_{n^{[l]}*m} dZn[l]∗m[l]=∂Z[l]∂J=dAn[l]∗m[l]⊙g′[l](Z[l])n[l]∗m
d W ( n [ l ] ∗ n [ l − 1 ] ) [ l ] = ∂ J ∂ W [ l ] = 1 m d Z ( n [ l ] ∗ m ) [ l ] A ( m ∗ n [ l − 1 ] ) [ l − 1 ] T dW^{[l]}_{(n^{[l]}*n^{[l-1]})} = {\partial J \over \partial W^{[l]}} = {1\over m}dZ^{[l]}_{(n^{[l]}*m)} A^{[l-1]T}_{(m*n^{[l-1]})} dW(n[l]∗n[l−1])[l]=∂W[l]∂J=m1dZ(n[l]∗m)[l]A(m∗n[l−1])[l−1]T
d b ( n [ l ] ∗ 1 ) [ l ] = ∂ J ∂ d b [ l ] = 1 m ∑ i = 1 m d Z [ l ] ( i ) db^{[l]}_{(n^{[l]}*1)}={\partial J \over \partial db^{[l]}} = {1\over m}\sum^{m}_{i=1}dZ^{[l](i)} db(n[l]∗1)[l]=∂db[l]∂J=m1i=1∑mdZ[l](i)
d A ( n [ l − 1 ] ∗ m ) [ l − 1 ] = ∂ J ∂ A [ l − 1 ] = W ( n [ l − 1 ] ∗ n [ l ] ) [ l ] T d Z ( n [ l ] ∗ m ) [ l ] dA^{[l-1]}_{(n^{[l-1]}*m)}={\partial J\over \partial A^{[l-1]}}=W^{[l]T}_{ (n^{[l-1]}*n^{[l]})} dZ^{[l]}_{(n^{[l]}*m)} dA(n[l−1]∗m)[l−1]=∂A[l−1]∂J=W(n[l−1]∗n[l])[l]TdZ(n[l]∗m)[l]
多层全连接神经网络的代码
这是吴恩达课程作业的代码,将其进行分析
初始化参数
# GRADED FUNCTION: initialize_parameters_deep
def initialize_parameters_deep(layer_dims):
"""
Arguments:
layer_dims -- python array (list) containing the dimensions of each layer in our network
Returns:
parameters -- python dictionary containing your parameters "W1", "b1", ..., "WL", "bL":
Wl -- weight matrix of shape (layer_dims[l], layer_dims[l-1])
bl -- bias vector of shape (layer_dims[l], 1)
"""
np.random.seed(3)
parameters = {}
L = len(layer_dims) # number of layers in the network
for l in range(1, L):
### START CODE HERE ### (≈ 2 lines of code)
parameters['W' + str(l)] = np.random.randn(layer_dims[l],layer_dims[l-1]) * 0.01
parameters['b' + str(l)] = np.zeros((layer_dims[l],1))
### END CODE HERE ###
assert(parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l-1]))
assert(parameters['b' + str(l)].shape == (layer_dims[l], 1))
return parameters
initialize_parameters_deep的输入
是一个list叫做"layer_dims",layer_dims实际的长度应该是L+1,layer_dims[0]代表第0层(也就是输入特征)的维度,layer_dims[1]到layer_dims[L]代表从第1层到第L层的维度
initialize_parameters_deep的输出
输出是一个parameters的字典,通过parameters[“W3”]来访问 W [ 3 ] W^{[3]} W[3]其他以此类推
initialize_parameters_deep的分析
根据以下两个公式初始化参数:
W [ l ] : ( n [ l ] , n [ l − 1 ] ) W^{[l]}: (n^{[l]},n^{[l-1]}) W[l]:(n[l],n[l−1])
b [ l ] : ( n [ l ] , 1 ) b^{[l]}:(n^{[l]},1) b[l]:(n[l],1)
初始化参数的时候,W不能初始化为全0,原因课上讲过。这里相当于初始化 W [ 1 ] , b [ 1 ] W^{[1]},b^{[1]} W[1],b[1]到 W [ L ] , b [ L ] W^{[L]},b^{[L]} W[L],b[L]。第0层没有所谓的W和b
这里比较有意思的是layer_dims数组。layer_dims实际的长度应该是L+1,layer_dims[0]代表第0层(也就是输入特征)的维度,layer_dims[1]到layer_dims[L]代表从第1层到第L层的维度,第L层是输出层,其他都是隐藏层,隐藏层维度可以自己设置。所以程序里的变量L’(我把程序里定义的变量L称为L’是为了区分)是层数+1,但是由于遍历的时候,range函数是不包括L’的,所以刚好从第1层遍历到第L层。
前向传播
前向传播分为两个部分,第一个部分完成一个“模块”的前向传播,第二个部分重复调用第一个部分的函数,完成整个网络的前向传播。
第一部分 —— 单个通用模块的前向传播
第一部分完成以下两个公式的实现,分别采用两个函数进行。
Z n [ l ] ∗ m [ l ] = W n [ l ] ∗ n [ l − 1 ] [ l ] A n [ l ] ∗ m [ l − 1 ] + b n [ l ] ∗ 1 [ l ] Z^{[l]}_{n^{[l]}*m} = W^{[l]}_{n^{[l]}*n^{[l-1]}} A^{[l-1]}_{n^{[l]}*m}+b^{[l]}_{n^{[l]}*1} Zn[l]∗m[l]=Wn[l]∗n[l−1][l]An[l]∗m[l−1]+bn[l]∗1[l]
A [ l ] = g ( Z [ l ] ) A^{[l]}=g(Z^{[l]}) A[l]=g(Z[l])
linear_forward函数实现第一个公式
# GRADED FUNCTION: linear_forward
def linear_forward(A, W, b):
"""
Implement the linear part of a layer's forward propagation.
Arguments:
A -- activations from previous layer (or input data): (size of previous layer, number of examples)
W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
b -- bias vector, numpy array of shape (size of the current layer, 1)
Returns:
Z -- the input of the activation function, also called pre-activation parameter
cache -- a python tuple containing "A", "W" and "b" ; stored for computing the backward pass efficiently
"""
### START CODE HERE ### (≈ 1 line of code)
Z = np.dot(W,A)+b
### END CODE HERE ###
assert(Z.shape == (W.shape[0], A.shape[1]))
cache = (A, W, b)
return Z, cache
该函数输入为 W [ l ] , b [ l ] , A [ l − 1 ] W^{[l]},b^{[l]},A^{[l-1]} W[l],b[l],A[l−1]三个参数,得到线性输出 Z [ l ] Z^{[l]} Z[l],同时保存了 W [ l ] , b [ l ] , A [ l − 1 ] W^{[l]},b^{[l]},A^{[l-1]} W[l],b[l],A[l−1]三个参数
linear_activation_forward 实现两个公式
def linear_activation_forward(A_prev, W, b, activation):
"""
Implement the forward propagation for the LINEAR->ACTIVATION layer
Arguments:
A_prev -- activations from previous layer (or input data): (size of previous layer, number of examples)
W -- weights matrix: numpy array of shape (size of current layer, size of previous layer)
b -- bias vector, numpy array of shape (size of the current layer, 1)
activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"
Returns:
A -- the output of the activation function, also called the post-activation value
cache -- a python tuple containing "linear_cache" and "activation_cache";
stored for computing the backward pass efficiently
"""
if activation == "sigmoid":
# Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
### START CODE HERE ### (≈ 2 lines of code)
Z, linear_cache = linear_forward(A_prev, W, b)
A, activation_cache = sigmoid(Z)
### END CODE HERE ###
elif activation == "relu":
# Inputs: "A_prev, W, b". Outputs: "A, activation_cache".
### START CODE HERE ### (≈ 2 lines of code)
Z, linear_cache = linear_forward(A_prev, W, b)
A, activation_cache = relu(Z)
### END CODE HERE ###
assert (A.shape == (W.shape[0], A_prev.shape[1]))
cache = (linear_cache, activation_cache)
return A, cache
该函数同样输入 W [ l ] , b [ l ] , A [ l − 1 ] W^{[l]},b^{[l]},A^{[l-1]} W[l],b[l],A[l−1]三个参数,但是另外的还要指出激活函数,根据激活函数的取值不同,来选择对线性输出 Z [ l ] Z^{[l]} Z[l]的操作。
需要指出的是,relu和sigmoid函数除了输出对应的 A [ l ] A^{[l]} A[l]值外,还会将 Z [ l ] Z^{[l]} Z[l]缓存。因此,考虑上linear_forward()函数的缓存,对于第l层,总共缓存了四个量 W [ l ] , b [ l ] , A [ l − 1 ] , Z [ l ] W^{[l]},b^{[l]},A^{[l-1]},Z^{[l]} W[l],b[l],A[l−1],Z[l]
注意这里cache的结构,cache[0] = linear_cache = ( W [ l ] , A [ l − 1 ] , b [ l ] ) W^{[l]},A^{[l-1]},b^{[l]}) W[l],A[l−1],b[l])而cache[1] = activation_cache = ( Z [ l ] Z^{[l]} Z[l])
第二部分 —— L层组合形成网络
第二部分通过不断调用第一部分的函数,实现整个模型的前向传播。
def L_model_forward(X, parameters):
"""
Implement forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID computation
Arguments:
X -- data, numpy array of shape (input size, number of examples)
parameters -- output of initialize_parameters_deep()
Returns:
AL -- last post-activation value
caches -- list of caches containing:
every cache of linear_activation_forward() (there are L-1 of them, indexed from 0 to L-1)
"""
caches = []
A = X
L = len(parameters) // 2 # number of layers in the neural network
# Implement [LINEAR -> RELU]*(L-1). Add "cache" to the "caches" list.
for l in range(1, L):
A_prev = A
### START CODE HERE ### (≈ 2 lines of code)
A, cache = linear_activation_forward(A_prev, parameters["W"+str(l)],parameters["b"+str(l)] , 'relu')
caches.append(cache)
### END CODE HERE ###
# Implement LINEAR -> SIGMOID. Add "cache" to the "caches" list.
### START CODE HERE ### (≈ 2 lines of code)
AL, cache = linear_activation_forward(A, parameters["W"+str(L)],parameters["b"+str(L)] , 'sigmoid')
caches.append(cache)
### END CODE HERE ###
assert(AL.shape == (1,X.shape[1]))
return AL, caches
L_model_forward函数输入
该函数输入为两个参数:训练的数据X(这是一个 ( n [ 0 ] , m ) (n^{[0]},m) (n[0],m)的矩阵),和模型的结构(假定模型所有隐藏层使用relu函数,输出层使用sigmoid函数,这里parameter就描述了整个模型的结构)
L_model_forward函数输出
函数的输出为:前向传播的最终结果,以及在传播过程中各层的缓存,一个有L个缓存。(L为模型层数)
L_model_forward过程分析
首先,这里的L不再是层数+1,而就是网络的层数,所以for循环相当于处理了L-1层,也就是所有的隐藏层,for内部使用’relu’参数。最后对输出层单独处理,使用’sigmod’
同时,每次循环完,包括最后的单独处理,都将缓存cache保存在caches里面了,因此caches是个长度为L(0到L-1的下标)的tuple。tuple中的每一个元素又是一个包含四个元素的tuple,保存 W [ l ] , b [ l ] , A [ l − 1 ] , Z [ l ] W^{[l]},b^{[l]},A^{[l-1]},Z^{[l]} W[l],b[l],A[l−1],Z[l]
具体形式如上面说:caches[n] = cache,cache[0] = linear_cache = ( W [ l ] , A [ l − 1 ] , b [ l ] ) W^{[l]},A^{[l-1]},b^{[l]}) W[l],A[l−1],b[l])而cache[1] = activation_cache = ( Z [ l ] Z^{[l]} Z[l])
计算代价函数
这里假设问题是二分类问题,使用交叉熵评定模型的优劣,公式如下:
− 1 m ∑ i = 1 m ( y ( i ) log ( a [ L ] ( i ) ) + ( 1 − y ( i ) ) log ( 1 − a [ L ] ( i ) ) ) (7) -\frac{1}{m} \sum\limits_{i = 1}^{m} (y^{(i)}\log\left(a^{[L] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right)) \tag{7} −m1i=1∑m(y(i)log(a[L](i))+(1−y(i))log(1−a[L](i)))(7)
这里的python代码非常简单,输入m个样本的预测值AL和实际的标签Y,判断两者的误差。
def compute_cost(AL, Y):
"""
Implement the cost function defined by equation (7).
Arguments:
AL -- probability vector corresponding to your label predictions, shape (1, number of examples)
Y -- true "label" vector (for example: containing 0 if non-cat, 1 if cat), shape (1, number of examples)
Returns:
cost -- cross-entropy cost
"""
m = Y.shape[1]
# Compute loss from aL and y.
### START CODE HERE ### (≈ 1 lines of code)
cost = -(1/m) * np.sum(Y*np.log(AL)+(1-Y)*np.log(1-AL), axis=1, keepdims=True)
### END CODE HERE ###
cost = np.squeeze(cost) # To make sure your cost's shape is what we expect (e.g. this turns [[17]] into 17).
assert(cost.shape == ())
return cost
反向传播
反向传播的代码主要负责以下四个数学公式的实现:
d Z n [ l ] ∗ m [ l ] = ∂ J ∂ Z [ l ] = d A n [ l ] ∗ m [ l ] ⊙ g ′ [ l ] ( Z [ l ] ) n [ l ] ∗ m dZ^{[l]}_{n^{[l]}*m} ={\partial J \over \partial Z^{[l]}}= dA^{[l]}_{n^{[l]}*m}\odot g'^{[l]}(Z^{[l]})_{n^{[l]}*m} dZn[l]∗m[l]=∂Z[l]∂J=dAn[l]∗m[l]⊙g′[l](Z[l])n[l]∗m
d W ( n [ l ] ∗ n [ l − 1 ] ) [ l ] = ∂ J ∂ W [ l ] = 1 m d Z ( n [ l ] ∗ m ) [ l ] A ( m ∗ n [ l − 1 ] ) [ l − 1 ] T dW^{[l]}_{(n^{[l]}*n^{[l-1]})} = {\partial J \over \partial W^{[l]}} = {1\over m}dZ^{[l]}_{(n^{[l]}*m)} A^{[l-1]T}_{(m*n^{[l-1]})} dW(n[l]∗n[l−1])[l]=∂W[l]∂J=m1dZ(n[l]∗m)[l]A(m∗n[l−1])[l−1]T
d b ( n [ l ] ∗ 1 ) [ l ] = ∂ J ∂ d b [ l ] = 1 m ∑ i = 1 m d Z [ l ] ( i ) db^{[l]}_{(n^{[l]}*1)}={\partial J \over \partial db^{[l]}} = {1\over m}\sum^{m}_{i=1}dZ^{[l](i)} db(n[l]∗1)[l]=∂db[l]∂J=m1i=1∑mdZ[l](i)
d A ( n [ l − 1 ] ∗ m ) [ l − 1 ] = ∂ J ∂ A [ l − 1 ] = W ( n [ l − 1 ] ∗ n [ l ] ) [ l ] T d Z ( n [ l ] ∗ m ) [ l ] dA^{[l-1]}_{(n^{[l-1]}*m)}={\partial J\over \partial A^{[l-1]}}=W^{[l]T}_{ (n^{[l-1]}*n^{[l]})} dZ^{[l]}_{(n^{[l]}*m)} dA(n[l−1]∗m)[l−1]=∂A[l−1]∂J=W(n[l−1]∗n[l])[l]TdZ(n[l]∗m)[l]
和前向传播同样,将其分为两个部分,第一部分是对上面四个公式的实现,形成对一层网络的“模块化”。第二部分将模块化的结果进行组合,形成一个完整的网络的反向传播。
第一部分
第一部分分为两个函数来完成,这与前向传播相对应。其中后三个公式对应了前向传播中的“WX+b”部分,是线性的部分,第一个公式对应将线性输出通过激活函数做非线性处理的部分。
linear_backward函数实现后三个公式
def linear_backward(dZ, cache):
"""
Implement the linear portion of backward propagation for a single layer (layer l)
Arguments:
dZ -- Gradient of the cost with respect to the linear output (of current layer l)
cache -- tuple of values (A_prev, W, b) coming from the forward propagation in the current layer
Returns:
dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
dW -- Gradient of the cost with respect to W (current layer l), same shape as W
db -- Gradient of the cost with respect to b (current layer l), same shape as b
"""
A_prev, W, b = cache
m = A_prev.shape[1]
### START CODE HERE ### (≈ 3 lines of code)
dW = 1/m * np.dot(dZ,A_prev.T)
db = 1/m * np.sum(dZ,axis=1, keepdims=True)
dA_prev = np.dot(W.T,dZ)
### END CODE HERE ###
assert (dA_prev.shape == A_prev.shape)
assert (dW.shape == W.shape)
assert (db.shape == b.shape)
return dA_prev, dW, db
为了完成下面三个公式,需要 d Z [ l ] , A [ l − 1 ] , W [ l ] dZ^{[l]},A^{[l-1]},W^{[l]} dZ[l],A[l−1],W[l] 三个量,其中后两个量通过cache作为一个tuple的一部分被传入函数
d W ( n [ l ] ∗ n [ l − 1 ] ) [ l ] = ∂ J ∂ W [ l ] = 1 m d Z ( n [ l ] ∗ m ) [ l ] A ( m ∗ n [ l − 1 ] ) [ l − 1 ] T dW^{[l]}_{(n^{[l]}*n^{[l-1]})} = {\partial J \over \partial W^{[l]}} = {1\over m}dZ^{[l]}_{(n^{[l]}*m)} A^{[l-1]T}_{(m*n^{[l-1]})} dW(n[l]∗n[l−1])[l]=∂W[l]∂J=m1dZ(n[l]∗m)[l]A(m∗n[l−1])[l−1]T
d b ( n [ l ] ∗ 1 ) [ l ] = ∂ J ∂ d b [ l ] = 1 m ∑ i = 1 m d Z [ l ] ( i ) db^{[l]}_{(n^{[l]}*1)}={\partial J \over \partial db^{[l]}} = {1\over m}\sum^{m}_{i=1}dZ^{[l](i)} db(n[l]∗1)[l]=∂db[l]∂J=m1i=1∑mdZ[l](i)
d A ( n [ l − 1 ] ∗ m ) [ l − 1 ] = ∂ J ∂ A [ l − 1 ] = W ( n [ l − 1 ] ∗ n [ l ] ) [ l ] T d Z ( n [ l ] ∗ m ) [ l ] dA^{[l-1]}_{(n^{[l-1]}*m)}={\partial J\over \partial A^{[l-1]}}=W^{[l]T}_{ (n^{[l-1]}*n^{[l]})} dZ^{[l]}_{(n^{[l]}*m)} dA(n[l−1]∗m)[l−1]=∂A[l−1]∂J=W(n[l−1]∗n[l])[l]TdZ(n[l]∗m)[l]
linear_activation_backward调用上面的函数实现模块功能
def linear_activation_backward(dA, cache, activation):
"""
Implement the backward propagation for the LINEAR->ACTIVATION layer.
Arguments:
dA -- post-activation gradient for current layer l
cache -- tuple of values (linear_cache, activation_cache) we store for computing backward propagation efficiently
activation -- the activation to be used in this layer, stored as a text string: "sigmoid" or "relu"
Returns:
dA_prev -- Gradient of the cost with respect to the activation (of the previous layer l-1), same shape as A_prev
dW -- Gradient of the cost with respect to W (current layer l), same shape as W
db -- Gradient of the cost with respect to b (current layer l), same shape as b
"""
linear_cache, activation_cache = cache
if activation == "relu":
### START CODE HERE ### (≈ 2 lines of code)
dZ = relu_backward(dA, cache[1])
dA_prev, dW, db = linear_backward(dZ,cache[0])
### END CODE HERE ###
elif activation == "sigmoid":
### START CODE HERE ### (≈ 2 lines of code)
dZ = sigmoid_backward(dA, cache[1])
dA_prev, dW, db = linear_backward(dZ,cache[0])
### END CODE HERE ###
return dA_prev, dW, db
函数输入:第层的 d A [ l ] dA^{[l]} dA[l]和计算梯度所必须的相关量(这里不赘述了)参考前文cache的结构。
输出: d W [ l ] , d b [ l ] , d A [ l − 1 ] dW^{[l]}, db^{[l]},dA^{[l-1]} dW[l],db[l],dA[l−1].
这里需要注意的是:
d Z n [ l ] ∗ m [ l ] = ∂ J ∂ Z [ l ] = d A n [ l ] ∗ m [ l ] ⊙ g ′ [ l ] ( Z [ l ] ) n [ l ] ∗ m dZ^{[l]}_{n^{[l]}*m} ={\partial J \over \partial Z^{[l]}}= dA^{[l]}_{n^{[l]}*m}\odot g'^{[l]}(Z^{[l]})_{n^{[l]}*m} dZn[l]∗m[l]=∂Z[l]∂J=dAn[l]∗m[l]⊙g′[l](Z[l])n[l]∗m
上面这个公式已经用relu_backward()和sigmoid_backward()写好了,感谢。其调用方式是给出 d A [ l ] dA^{[l]} dA[l]和第l层的activation_cache。
第二部分
def L_model_backward(AL, Y, caches):
"""
Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group
Arguments:
AL -- probability vector, output of the forward propagation (L_model_forward())
Y -- true "label" vector (containing 0 if non-cat, 1 if cat)
caches -- list of caches containing:
every cache of linear_activation_forward() with "relu" (it's caches[l], for l in range(L-1) i.e l = 0...L-2)
the cache of linear_activation_forward() with "sigmoid" (it's caches[L-1])
Returns:
grads -- A dictionary with the gradients
grads["dA" + str(l)] = ...
grads["dW" + str(l)] = ...
grads["db" + str(l)] = ...
"""
grads = {}
L = len(caches) # the number of layers
m = AL.shape[1]
Y = Y.reshape(AL.shape) # after this line, Y is the same shape as AL
# Initializing the backpropagation
### START CODE HERE ### (1 line of code)
dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
### END CODE HERE ###
# Lth layer (SIGMOID -> LINEAR) gradients. Inputs: "dAL, current_cache". Outputs: "grads["dAL-1"], grads["dWL"], grads["dbL"]
### START CODE HERE ### (approx. 2 lines)
current_cache = caches[L-1]
grads["dA" + str(L-1)], grads["dW" + str(L)], grads["db" + str(L)] = linear_activation_backward(dAL, current_cache, 'sigmoid')
### END CODE HERE ###
# Loop from l=L-2 to l=0
for l in reversed(range(L-1)):
# lth layer: (RELU -> LINEAR) gradients.
# Inputs: "grads["dA" + str(l + 1)], current_cache". Outputs: "grads["dA" + str(l)] , grads["dW" + str(l + 1)] , grads["db" + str(l + 1)]
### START CODE HERE ### (approx. 5 lines)
current_cache = caches[l]
dA_prev_temp, dW_temp, db_temp = linear_activation_backward(grads["dA"+str(l+1)], current_cache, 'relu')
grads["dA" + str(l)] = dA_prev_temp
grads["dW" + str(l + 1)] = dW_temp
grads["db" + str(l + 1)] = db_temp
### END CODE HERE ###
return grads
输入:m个样本的预测结果,m个样本的标签Y,以及在前向传播过程中保留下来的caches。
输出:grads字典,其中保留了所有的 d W , d b , d A dW,db,dA dW,db,dA。grads[“dW3”] = d W [ 3 ] dW^{[3]} dW[3],其他以此类推。
过程分析(几个注意点):
-
L = len(caches),这里的L是层数,而不是层数+1,caches[0]到caches[L-1]对应了第一层到第L层的缓存。
-
输出层的 d A [ L ] dA^{[L]} dA[L]不适用公式,需要单独处理
d A ( n [ l − 1 ] ∗ m ) [ l − 1 ] = ∂ J ∂ A [ l − 1 ] = W ( n [ l − 1 ] ∗ n [ l ] ) [ l ] T d Z ( n [ l ] ∗ m ) [ l ] dA^{[l-1]}_{(n^{[l-1]}*m)}={\partial J\over \partial A^{[l-1]}}=W^{[l]T}_{ (n^{[l-1]}*n^{[l]})} dZ^{[l]}_{(n^{[l]}*m)} dA(n[l−1]∗m)[l−1]=∂A[l−1]∂J=W(n[l−1]∗n[l])[l]TdZ(n[l]∗m)[l]
-
注意反向传播是从L层传到1层,所以循环的时候使用了reversed(range(L-1))
-
注意for l in reversed(range(L-1))中的l是作为caches下标进行遍历,对应的是l+1层
第三部分——更新参数
def update_parameters(parameters, grads, learning_rate):
"""
Update parameters using gradient descent
Arguments:
parameters -- python dictionary containing your parameters
grads -- python dictionary containing your gradients, output of L_model_backward
Returns:
parameters -- python dictionary containing your updated parameters
parameters["W" + str(l)] = ...
parameters["b" + str(l)] = ...
"""
L = len(parameters) // 2 # number of layers in the neural network
# Update rule for each parameter. Use a for loop.
### START CODE HERE ### (≈ 3 lines of code)
for l in range(L):
parameters["W" + str(l+1)] = parameters["W" + str(l+1)]-learning_rate*grads["dW"+str(l+1)]
parameters["b" + str(l+1)] = parameters["b"+str(l+1)] - learning_rate*grads["db"+str(l+1)]
### END CODE HERE ###
return parameters
输入:所有的W和b,保存在字典parameters中,grad保存所有W和b对应的梯度,还有最后的学习率。
输出:更新后的所有W和b
过程:
W [ l ] < = W [ l ] − λ d W [ l ] W^{[l]} <= W^{[l]} - \lambda dW^{[l]} W[l]<=W[l]−λdW[l]
b [ l ] < = b [ l ] − λ d b [ l ] b^{[l]} <= b^{[l]} - \lambda db^{[l]} b[l]<=b[l]−λdb[l]
代码总结
首先我们有训练集 X, Y,以及人为定义的网络各层节点类型 layer_dims数组,当然第0层和第L层大小是固定的,中间是可以自己设的。层数L=len(layer_dims)-1
首先初始化得到一个网络的初始结构的参数
parameters = initialize_parameters_deep(layer_dims)
然后进行前向传播,得到输出和缓存
AL,caches = L_model_forward(X, parameters)
计算代价函数:
cost = compute_cost(AL, Y)
进行反向传播更新参数
grads = L_model_backward(AL, Y, caches)
parameters = update_parameters(parameters, grads, learning_rate)
其中,在反向传播的过程中需要用到W和b的值,本代码中没有从parameters中取得,而是再保存了一份W,b在caches中