文章目录
Initialization
parameters – python dictionary containing your parameters “W1”, “b1”, …, “WL”, “bL”:
- W1 – weight matrix of shape (layers_dims[1], layers_dims[0])
- b1 – bias vector of shape (layers_dims[1], 1)
- …
- WL – weight matrix of shape (layers_dims[L], layers_dims[L-1])
- bL – bias vector of shape (layers_dims[L], 1)
Zero initialization
# GRADED FUNCTION: initialize_parameters_zeros
def initialize_parameters_zeros(layers_dims):
parameters = {}
L = len(layers_dims) # number of layers in the network
for l in range(1, L):
parameters['W' + str(l)] = np.zeros((layers_dims[l] , layers_dims[l-1]))
parameters['b' + str(l)] = np.zeros((layers_dims[l] , 1))
return parameters
Random initialization
# GRADED FUNCTION: initialize_parameters_random
def initialize_parameters_random(layers_dims):
np.random.seed(3)
parameters = {}
L = len(layers_dims) # integer representing the number of layers
for l in range(1, L):
parameters['W' + str(l)] = np.random.randn(layers_dims[l] , layers_dims[l-1]) * 10
parameters['b' + str(l)] = np.zeros((layers_dims[l] , 1))
return parameters
He initialization
He 初始化对应的是非线性激活函数(Relu 和 Prelu)。
任意层的权重
W
[
l
]
W^{[l]}
W[l],按照均值为 0,且方差为
2
n
[
l
−
1
]
\sqrt{\frac{2}{n^[l-1]}}
n[l−1]2 的高斯分布进行初始化,可以保证每一层的输入方差尺度一致。
W [ l ] = r a n d o m ∗ 2 layers_dims[l-1] W^{[l]} = random * \sqrt{\frac{2}{\text{layers\_dims[l-1]}}} W[l]=random∗layers_dims[l-1]2
# GRADED FUNCTION: initialize_parameters_he
def initialize_parameters_he(layers_dims):
np.random.seed(3)
parameters = {}
L = len(layers_dims) - 1
for l in range(1, L + 1):
parameters['W' + str(l)] = np.random.randn(layers_dims[l] , layers_dims[l-1]) * np.sqrt(2 / layers_dims[l-1])
parameters['b' + str(l)] = np.zeros((layers_dims[l], 1))
return parameters
Regularization
L2 Regularization
from:
J
=
−
1
m
∑
i
=
1
m
(
y
(
i
)
log
(
a
[
L
]
(
i
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
a
[
L
]
(
i
)
)
)
J = -\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)}
J=−m1i=1∑m(y(i)log(a[L](i))+(1−y(i))log(1−a[L](i)))
To:
J
r
e
g
u
l
a
r
i
z
e
d
=
−
1
m
∑
i
=
1
m
(
y
(
i
)
log
(
a
[
L
]
(
i
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
a
[
L
]
(
i
)
)
)
⏟
cross-entropy cost
+
1
m
λ
2
∑
l
∑
k
∑
j
W
k
,
j
[
l
]
2
⏟
L2 regularization cost
J_{regularized} = \small \underbrace{-\frac{1}{m} \sum\limits_{i = 1}^{m} \large{(}\small y^{(i)}\log\left(a^{[L](i)}\right) + (1-y^{(i)})\log\left(1- a^{[L](i)}\right) \large{)} }_\text{cross-entropy cost} + \underbrace{\frac{1}{m} \frac{\lambda}{2} \sum\limits_l\sum\limits_k\sum\limits_j W_{k,j}^{[l]2} }_\text{L2 regularization cost}
Jregularized=cross-entropy cost
−m1i=1∑m(y(i)log(a[L](i))+(1−y(i))log(1−a[L](i)))+L2 regularization cost
m12λl∑k∑j∑Wk,j[l]2
To calculate ∑ k ∑ j W k , j [ l ] 2 \sum\limits_k\sum\limits_j W_{k,j}^{[l]2} k∑j∑Wk,j[l]2 , use :
np.sum(np.square(Wl))
compute_cost_with_regularization
# GRADED FUNCTION: compute_cost_with_regularization
def compute_cost_with_regularization(A3, Y, parameters, lambd):
m = Y.shape[1]
W1 = parameters["W1"]
W2 = parameters["W2"]
W3 = parameters["W3"]
cross_entropy_cost = compute_cost(A3, Y)
L2_regularization_cost = (1 / m) * (lambd / 2) * (np.sum(np.square(W1)) + np.sum(np.square(W2)) + np.sum(np.square(W3))
cost = cross_entropy_cost + L2_regularization_cost
return cost
backward_propagation_with_regularization
d d W ( 1 2 λ m W 2 ) = λ m W \frac{d}{dW} ( \frac{1}{2}\frac{\lambda}{m} W^2) = \frac{\lambda}{m} W dWd(21mλW2)=mλW
# GRADED FUNCTION: backward_propagation_with_regularization
def backward_propagation_with_regularization(X, Y, cache, lambd):
m = X.shape[1]
(Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
dZ3 = A3 - Y # sigmoid梯度
### START CODE HERE ### (approx. 1 line)
dW3 = 1. / m * np.dot(dZ3 , A2.T) + lambd / m * W3
### END CODE HERE ###
db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
dA2 = np.dot(W3.T, dZ3)
# 除了最后一层,其他层都是ReLU激活函数,ReLU 的梯度只有在 x > 0 时才为 1,否则为 0
dZ2 = np.multiply(dA2, np.int64(A2 > 0))
### START CODE HERE ### (approx. 1 line)
dW2 = 1. / m * np.dot(dZ2 , A1.T) + lambd / m * W2
### END CODE HERE ###
db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
dA1 = np.dot(W2.T, dZ2)
dZ1 = np.multiply(dA1, np.int64(A1 > 0))
### START CODE HERE ### (approx. 1 line)
dW1 = 1. / m * np.dot(dZ1 , X.T) + lambd / m * W1
### END CODE HERE ###
db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
"dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1,
"dZ1": dZ1, "dW1": dW1, "db1": db1}
return gradients
注意:
- λ \lambda λ 的值是一个超参数,可以通过使用验证集进行调整。
- L2 正则化会使决策边界更平滑。如果 λ \lambda λ 过大,也可能导致“过平滑”,进而产生高偏差的模型。
L2 正则化基于这样的假设:权重较小的模型比权重较大的模型更为简单。因此,通过在成本函数中惩罚权重的平方值,驱使所有权重趋于更小。这样一来,成本函数中拥有较大权重变得过于昂贵!这导致了一个更平滑的模型,其中输出随着输入的变化更为缓慢。
L2 正则化对以下方面的含义:
- 成本计算:成本中添加了正则化项
- 反向传播函数:与权重矩阵相关的梯度中包含了额外项
- 权重最终变小(“权重衰减”):权重被推向更小的值
Dropout
注意:
- 使用 Dropout 时常犯的 一个错误 是在训练和测试阶段都使用它。您应当仅在训练阶段使用 Dropout(随机删除节点)。
- 深度学习框架如 tensorflow、PaddlePaddle、keras 或 caffe 提供了 Dropout 层的实现。不必紧张,您很快就能学到其中一些框架。
关于 Dropout 应该记住的事项:
- Dropout 是一种正则化技术。
- 您只能在训练阶段使用 Dropout。在测试阶段不要使用 Dropout(随机删除节点)。
- 在前向传播和反向传播中都要应用 Dropout。
- 训练阶段时,为了使激活值的期望值保持不变,应对每个 Dropout 层除以 keep_prob。例如,如果 keep_prob 为 0.5,那么平均会关闭一半的节点,所以输出将缩小为原来的 0.5 倍,因为只有剩下的一半节点对解决方案有所贡献。除以 0.5 等同于乘以 2。因此,现在输出具有相同的期望值。您可以检查即使 keep_prob 为 0.5 以外的其他值,这种方法仍然有效。
forward_propagation_with_dropout
打算在第一层和第二层关闭一些神经元。为此,您需要完成以下4个步骤:
- 在课程中,我们讨论过通过 np.random.rand() 生成0到1之间的随机数,创建一个与 a [ 1 ] a^{[1]} a[1] 形状相同的变量 d [ 1 ] d^{[1]} d[1]。在此处,您将采用向量化方法,故需构建一个与 A [ 1 ] A^{[1]} A[1] 维度相同的随机矩阵 D [ 1 ] = [ d 1 d 1 . . . d 1 ] D^{[1]} = [d^{1} d^{1} ... d^{1}] D[1]=[d1d1...d1]。
- 通过合理设定阈值,将 D [ 1 ] D^{[1]} D[1] 中各元素以概率 1-keep_prob 设为0,以概率 keep_prob 设为1。提示:若要将矩阵 X 的所有元素小于0.5的设为0,大于等于0.5的设为1,可以运行 X = (X < 0.5)。需要注意,0和1分别代表False和True。
- 将 A [ 1 ] A^{[1]} A[1] 设置为 A [ 1 ] × D [ 1 ] A^{[1]} \times D^{[1]} A[1]×D[1]。(此时您正在关闭部分神经元)。可以将 D [ 1 ] D^{[1]} D[1] 视作一个掩膜,当它与另一个矩阵相乘时,会“屏蔽”掉某些值。
- 将 A [ 1 ] A^{[1]} A[1] 除以 keep_prob。这样操作的目的是保证即使使用了dropout,成本函数的输出仍具有与未使用时相同的期望值。(这种技术也被称为反转dropout)。
# GRADED FUNCTION: forward_propagation_with_dropout
def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):
np.random.seed(1)
# retrieve parameters
W1 = parameters["W1"]
b1 = parameters["b1"]
W2 = parameters["W2"]
b2 = parameters["b2"]
W3 = parameters["W3"]
b3 = parameters["b3"]
# LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID
Z1 = np.dot(W1, X) + b1
A1 = relu(Z1)
### START CODE HERE ### (approx. 4 lines)
# Steps 1-4 below correspond to the Steps 1-4 described above.
# Step 1: initialize matrix D1 = np.random.rand(..., ...)
# 对生成的随机矩阵 D1 的每个元素,执行 < keep_prob 的比较操作。这将产生一个与 D1 形状相同的布尔矩阵(True/False)
D1 = np.random.rand(A1.shape[0] , A1.shape[1])
# Step 2: convert entries of D1 to 0 or 1 (using keep_prob as the threshold)
D1 = (D1 < keep_prob)
# Step 3: shut down some neurons of A1
A1 *= D1
# Step 4: scale the value of neurons that haven't been shut down
A1 = np.divide(A1 , keep_prob)
### END CODE HERE ###
Z2 = np.dot(W2, A1) + b2
A2 = relu(Z2)
### START CODE HERE ### (approx. 4 lines)
# Step 1: initialize matrix D2 = np.random.rand(..., ...)
D2 = np.random.rand(A2.shape[0] , A2.shape[1])
# Step 2: convert entries of D2 to 0 or 1 (using keep_prob as the threshold)
D2 = (D2 < keep_prob)
# Step 3: shut down some neurons of A2
A2 *= D2
# Step 4: scale the value of neurons that haven't been shut down
A2 = np.divide(A2 , keep_prob)
### END CODE HERE ###
Z3 = np.dot(W3, A2) + b3
A3 = sigmoid(Z3)
cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)
return A3, cache
backward_propagation_with_dropout
反向传播中使用 Dropout 实际上相当简单,您需执行以下两步:
- 在前向传播阶段,您通过将掩模
D
[
1
]
D^{[1]}
D[1] 应用于
A1
关闭了部分神经元。在反向传播时,您需要通过将相同的掩模 D [ 1 ] D^{[1]} D[1] 重新应用于dA1
来关闭相同的神经元。 - 在前向传播过程中,您已将
A1
除以keep_prob
。因此,在反向传播时,您需要再次将dA1
除以keep_prob
(从微积分角度看,若 A [ 1 ] A^{[1]} A[1] 被keep_prob
缩放,则其导数 d A [ 1 ] dA^{[1]} dA[1] 也会被相同的keep_prob
缩放)。
# GRADED FUNCTION: backward_propagation_with_dropout
def backward_propagation_with_dropout(X, Y, cache, keep_prob):
m = X.shape[1]
(Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache
dZ3 = A3 - Y
dW3 = 1./m * np.dot(dZ3, A2.T)
db3 = 1./m * np.sum(dZ3, axis=1, keepdims = True)
dA2 = np.dot(W3.T, dZ3)
### START CODE HERE ### (≈ 2 lines of code)
# Step 1: Apply mask D2 to shut down the same neurons as during the forward propagation
dA2 = dA2 * D2
# Step 2: Scale the value of neurons that haven't been shut down
dA2 = dA2 / keep_prob
### END CODE HERE ###
dZ2 = np.multiply(dA2, np.int64(A2 > 0))
dW2 = 1./m * np.dot(dZ2, A1.T)
db2 = 1./m * np.sum(dZ2, axis=1, keepdims = True)
dA1 = np.dot(W2.T, dZ2)
### START CODE HERE ### (≈ 2 lines of code)
# Step 1: Apply mask D1 to shut down the same neurons as during the forward propagation
dA1 = dA1 * D1
# Step 2: Scale the value of neurons that haven't been shut down
dA1 = dA1 / keep_prob
### END CODE HERE ###
dZ1 = np.multiply(dA1, np.int64(A1 > 0))
dW1 = 1./m * np.dot(dZ1, X.T)
db1 = 1./m * np.sum(dZ1, axis=1, keepdims = True)
gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
"dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1,
"dZ1": dZ1, "dW1": dW1, "db1": db1}
return gradients
Gradient Checking
N-dimensional gradient checking
- 提供前向传播函数 forward_propagation_n,后向传播函数 backward_propagation_n
∂ J ∂ θ = lim ε → 0 J ( θ + ε ) − J ( θ − ε ) 2 ε \frac{\partial J}{\partial \theta} = \lim_{\varepsilon \to 0} \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon} ∂θ∂J=ε→0lim2εJ(θ+ε)−J(θ−ε)
每一个参数(w或b)发生 epsilon 微小变化后,得到 新的J ,通过导数定义就可以得到对应参数导数的近似值
所以需要对每个参数w,b都得计算,速度会很慢
对于 num_parameters 中的每个索引 i:
- 计算 J_plus[i]:
- 将 θ + \theta^{+} θ+ 设置为 np.copy(parameters_values)
- 将 θ i + \theta^{+}_i θi+ 设置为 θ i + + ε \theta^{+}_i + \varepsilon θi++ε
- 使用 forward_propagation_n(x, y, vector_to_dictionary( θ + \theta^{+} θ+ )) 计算 J i + J^{+}_i Ji+
- 计算 J_minus[i]:对 θ − \theta^{-} θ− 执行相同的操作
- 计算 g r a d a p p r o x [ i ] = J i + − J i − 2 ε gradapprox[i] = \frac{J^{+}_i - J^{-}_i}{2 \varepsilon} gradapprox[i]=2εJi+−Ji−
因此,您获得了一个向量 gradapprox,其中 gradapprox[i] 是关于 parameter_values[i]
的梯度近似值。现在您可以将这个 gradapprox 向量与反向传播得到的梯度向量进行比较。计算:
d
i
f
f
e
r
e
n
c
e
=
∥
g
r
a
d
−
g
r
a
d
a
p
p
r
o
x
∥
2
∥
g
r
a
d
∥
2
+
∥
g
r
a
d
a
p
p
r
o
x
∥
2
difference = \frac {\| grad - gradapprox \|_2}{\| grad \|_2 + \| gradapprox \|_2 }
difference=∥grad∥2+∥gradapprox∥2∥grad−gradapprox∥2
# GRADED FUNCTION: gradient_check_n
def gradient_check_n(parameters, gradients, X, Y, epsilon = 1e-7):
"""
Checks if backward_propagation_n computes correctly the gradient of the cost output by forward_propagation_n
Arguments:
parameters -- python dictionary containing your parameters "W1", "b1", "W2", "b2", "W3", "b3":
grad -- output of backward_propagation_n, contains gradients of the cost with respect to the parameters.
x -- input datapoint, of shape (input size, 1)
y -- true "label"
epsilon -- tiny shift to the input to compute approximated gradient with formula(1)
Returns:
difference -- difference (2) between the approximated gradient and the backward propagation gradient
"""
# Set-up variables
parameters_values, _ = dictionary_to_vector(parameters)
grad = gradients_to_vector(gradients)
num_parameters = parameters_values.shape[0]
J_plus = np.zeros((num_parameters, 1))
J_minus = np.zeros((num_parameters, 1))
gradapprox = np.zeros((num_parameters, 1))
# Compute gradapprox
# 每一个参数发生theta变化后,得到cost,导数定义可以得到J对这个参数的近似导数
for i in range(num_parameters):
# Compute J_plus[i]. Inputs: "parameters_values, epsilon". Output = "J_plus[i]".
# "_" is used because the function you have to outputs two parameters but we only care about the first one
### START CODE HERE ### (approx. 3 lines)
theta_plus = np.copy(parameters_values)
theta_plus[i] += epsilon
J_plus[i], _ = forward_propagation_n(X , Y , vector_to_dictionary(theta_plus))
### END CODE HERE ###
# Compute J_minus[i]. Inputs: "parameters_values, epsilon". Output = "J_minus[i]".
### START CODE HERE ### (approx. 3 lines)
theta_minus = np.copy(parameters_values)
theta_minus[i] -= epsilon
J_minus[i], _ = forward_propagation_n(X , Y , vector_to_dictionary(theta_minus))
### END CODE HERE ###
# Compute gradapprox[i]
### START CODE HERE ### (approx. 1 line)
gradapprox[i] = (J_plus[i] - J_minus[i]) / (2 * epsilon)
### END CODE HERE ###
# Compare gradapprox to backward propagation gradients by computing difference.
### START CODE HERE ### (approx. 1 line)
# Step 1'
numerator = np.linalg.norm(grad - gradapprox) # np.linalg.norm 计算范数(Norm)
# Step 2'
denominator = np.linalg.norm(grad) + np.linalg.norm(gradapprox)
# Step 3'
difference = numerator / denominator
### END CODE HERE ###
if difference > 1e-7:
print ("\033[93m" + "There is a mistake in the backward propagation! difference = " + str(difference) + "\033[0m")
else:
print ("\033[92m" + "Your backward propagation works perfectly fine! difference = " + str(difference) + "\033[0m")
return difference
- 作者在 backward_propagation_n 函数中设计了bugs(提示:检查 dW2 和 db1)
注意:
- 梯度检查很慢!使用 ∂ J ∂ θ ≈ J ( θ + ε ) − J ( θ − ε ) 2 ε \frac{\partial J}{\partial \theta} \approx \frac{J(\theta + \varepsilon) - J(\theta - \varepsilon)}{2 \varepsilon} ∂θ∂J≈2εJ(θ+ε)−J(θ−ε) 近似梯度计算成本较高。因此,我们在训练过程中不会在每次迭代中都运行梯度检查。只需运行几次以检查梯度是否正确即可。
- 至少按照我们目前展示的方式,梯度检查与 Dropout 不兼容。通常,您会先在没有 Dropout 的情况下运行梯度检查以确保反向传播正确无误,然后再添加 Dropout。😃
差异标准的具体阈值:
对于上述差异度量,需要设定一个阈值来判断反向传播是否通过了梯度检查。常用的阈值有:
- 1e-7:这是一个常见的严格标准,意味着允许的最大相对误差在百万分之一左右。如果差异小于这个值,通常认为反向传播计算的梯度与数值近似梯度非常接近,可以接受。
- 1e-5:稍宽松的标准,允许的相对误差在十万分之一左右。在某些情况下,如果计算资源有限或模型较为复杂,可能需要接受略高的差异容忍度。
- 1e-2:对于某些大型网络或计算资源极其受限的情况,可能需要设定更宽松的标准。但请注意,如此高的差异可能表明反向传播存在显著误差,应谨慎对待。
选择差异标准时需权衡精度需求、计算资源消耗以及模型复杂度等因素。通常,梯度检查仅在模型开发初期用于调试阶段,一旦确认反向传播正确,即可停止使用,因其计算开销较大,不适合在常规训练过程中持续使用。在实践中,大部分开发者倾向于选择一个相对严格的阈值(如 1e-7 或 1e-5),确保反向传播的高精度。如果差异超过设定的阈值,则应进一步排查反向传播实现中的潜在错误,并在修正后再次进行梯度检查。