参数初始化
- Zero initialization
- Random initialization
- He initialization
Try “He Initialization”; this is named for the first author of He et al., 2015. (If you have heard of “Xavier initialization”, this is similar except Xavier initialization uses a scaling factor for the weights W [ l ] W^{[l]} W[l] ofsqrt(1./layers_dims[l-1])
where He initialization would usesqrt(2./layers_dims[l-1])
.)
Hint: This function is similar to the previousinitialize_parameters_random(...)
. The only difference is that instead of multiplyingnp.random.randn(..,..)
by 10, you will multiply it by 2 dimension of the previous layer \sqrt{\frac{2}{\text{dimension of the previous layer}}} dimension of the previous layer2, which is what He initialization recommends for layers with a ReLU activation.- 为了使得网络中信息更好的流动,每一层输出的方差应该尽量相等。
- 默认情况,方差只考虑输入个数: v a r ( w i ) = 1 / n i var(w_i) = 1 / n_i var(wi)=1/ni
- FillerParameter_VarianceNorm_FAN_OUT,方差只考虑输出个数: v a r ( w i ) = 1 / n i + 1 var(w_i) = 1 / n_{i+1} var(wi)=1/ni+1
- FillerParameter_VarianceNorm_AVERAGE,方差同时考虑输入和输出个数: v a r ( w i ) = 2 / ( n i + n i + 1 ) var(w_i) = 2 / (n_i + n_{i+1}) var(wi)=2/(ni+ni+1)
正则化
-
L2 Regularization
-
Dropout
You would like to shut down some neurons in the first and second layers. To do that, you are going to carry out 4 Steps:- In lecture, we dicussed creating a variable
d
[
1
]
d^{[1]}
d[1] with the same shape as
a
[
1
]
a^{[1]}
a[1] using
np.random.rand()
to randomly get numbers between 0 and 1. Here, you will use a vectorized implementation, so create a random matrix D [ 1 ] = [ d [ 1 ] ( 1 ) d [ 1 ] ( 2 ) . . . d [ 1 ] ( m ) ] D^{[1]} = [d^{[1](1)} d^{[1](2)} ... d^{[1](m)}] D[1]=[d[1](1)d[1](2)...d[1](m)] of the same dimension as A [ 1 ] A^{[1]} A[1]. - Set each entry of
D
[
1
]
D^{[1]}
D[1] to be 0 with probability (
1-keep_prob
) or 1 with probability (keep_prob
), by thresholding values in D [ 1 ] D^{[1]} D[1] appropriately. Hint: to set all the entries of a matrix X to 0 (if entry is less than 0.5) or 1 (if entry is more than 0.5) you would do:X = (X < 0.5)
. Note that 0 and 1 are respectively equivalent to False and True. - Set A [ 1 ] A^{[1]} A[1] to A [ 1 ] ∗ D [ 1 ] A^{[1]} * D^{[1]} A[1]∗D[1]. (You are shutting down some neurons). You can think of D [ 1 ] D^{[1]} D[1] as a mask, so that when it is multiplied with another matrix, it shuts down some of the values.
- Divide
A
[
1
]
A^{[1]}
A[1] by
keep_prob
. By doing this you are assuring that the result of the cost will still have the same expected value as without drop-out. (This technique is also called inverted dropout.)
Inverted-dropout的基本实现原理是在训练阶段每次迭代过程中,以keep_prob的概率保留一个神经元(也就是以 1 − k e e p p r o b 1-keep_prob 1−keepprob的概率关闭一个神经元),上述代码中利用numpy的具体实现方式为:
U1=(np.random.rand(*H1.shape) < p) / p
得到一个mask,再用神经元输出的激活值乘这个mask,这里 n u m p y . r a n d o m . r a n d numpy.random.rand numpy.random.rand得到的是一个满足0到1的均匀分布的数组, n u m p y . r a n d o m . r a n d n numpy.random.randn numpy.random.randn得到的是标准正态分布的数组。
np.random.rand(*H1.shape) < p
得到的是一个布尔值数组,当其元素值小于p时是True,大于p时是False。- 那么后面为什么还要除以 p p p呢?吴恩达在课里讲的是为了保证神经元输出激活值的期望值与不使用dropout时一致,我们结合概率论的知识来具体看一下:假设一个神经元的输出激活值为a,在不使用dropout的情况下,其输出期望值为a,如果使用了dropout,神经元就可能有保留和关闭两种状态,把它看作一个离散型随机变量,它就符合概率论中的0-1分布,其输出激活值的期望变为 p ∗ a + ( 1 − p ) ∗ 0 = p a p * a+(1-p) * 0=pa p∗a+(1−p)∗0=pa,此时若要保持期望和不使用dropout时一致,就要除以 p p p。
- AlexNet里传统的dropout,在训练阶段应用dropout时没有让神经元的输出激活值除以 p p p,因此其期望值为 p a pa pa,在测试阶段不用dropout,所有神经元都保留,因此其输出期望值为 a a a ,为了让测试阶段神经元的输出期望值和训练阶段保持一致(这样才能正确评估训练出的模型),就要给测试阶段的输出激活值乘上 p p p,使其输出期望值保持为 p a pa pa。
- 传统的dropout和Inverted-dropout虽然在具体实现步骤上有一些不同,但从数学原理上来看,其正则化功能是相同的,那么为什么现在大家都用Inverted-dropout了呢?有两点原因:
- 测试阶段的模型性能很重要,特别是对于上线的产品,模型已经训练好了,只要执行测试阶段的推断过程,那对于用户来说,当然是推断越快用户体验就越好了,而Inverted-dropout把保持期望一致的关键步骤转移到了训练阶段,节省了测试阶段的步骤,提升了速度;
- dropout方法里的 keep_prob 是一个可能需要调节的超参数,用Inverted-dropout的情况下,当你要改变 keep_prob 的时候,只需要修改训练阶段的代码,而测试阶段的推断代码没有用到 keep_prob ,就不需要修改了,降低了写错代码的概率。
- In lecture, we dicussed creating a variable
d
[
1
]
d^{[1]}
d[1] with the same shape as
a
[
1
]
a^{[1]}
a[1] using
Batch Normalization
- 深层神经网络在做非线性变换前的激活输入值( x = W U + B x=WU+B x=WU+B, U U U是输入)随着网络深度加深或者在训练过程中,其分布逐渐发生偏移或者变动,之所以训练收敛慢,一般是整体分布逐渐往非线性函数的取值区间的上下限两端靠近(对于Sigmoid函数来说,意味着激活输入值 W U + B WU+B WU+B是大的负值或正值),所以这导致反向传播时低层神经网络的梯度消失,这是训练深层神经网络收敛越来越慢的本质原因,而BN就是通过一定的规范化手段,把每层神经网络任意神经元这个输入值的分布强行拉回到均值为0方差为1的标准正态分布,其实就是把越来越偏的分布强制拉回比较标准的分布,这样使得激活输入值落在非线性函数对输入比较敏感的区域,这样输入的小变化就会导致损失函数较大的变化,意思是这样让梯度变大,避免梯度消失问题产生,而且梯度变大意味着学习收敛速度快,能大大加快训练速度。
- 对于每个隐层神经元,把逐渐向非线性函数映射后向取值区间极限饱和区靠拢的输入分布强制拉回到均值为0方差为1的比较标准的正态分布,使得非线性变换函数的输入值落入对输入比较敏感的区域,以此避免梯度消失问题。
- 经过BN后,目前大部分Activation的值落入非线性函数的线性区内,其对应的导数远离导数饱和区,这样来加速训练收敛过程。
- 经过变换后某个神经元的激活x形成了均值为0,方差为1的正态分布,目的是把值往后续要进行的非线性变换的线性区拉动,增大导数值,增强反向传播信息流动性,加快训练收敛速度。但是这样会导致网络表达能力下降,为了防止这一点,每个神经元增加两个调节参数(scale和shift),这两个参数是通过训练来学习到的,用来对变换后的激活反变换,使得网络表达能力增强,即对变换后的激活进行scale和shift操作。
- ①不仅仅极大提升了训练速度,收敛过程大大加快;②还能增加分类效果,一种解释是这是类似于Dropout的一种防止过拟合的正则化表达方式,所以不用Dropout也能达到相当的效果;③另外调参过程也简单多了,对于初始化要求没那么高,而且可以使用大的学习率等。
梯度检验
For each i in num_parameters:
- To compute
J_plus[i]
:- Set
θ
+
\theta^{+}
θ+ to
np.copy(parameters_values)
- Set θ i + \theta^{+}_i θi+ to θ i + + ε \theta^{+}_i + \varepsilon θi++ε
- Calculate
J
i
+
J^{+}_i
Ji+ using to
forward_propagation_n(x, y, vector_to_dictionary(
θ + \theta^{+} θ+))
.
- Set
θ
+
\theta^{+}
θ+ to
- To compute
J_minus[i]
: do the same thing with θ − \theta^{-} θ− - Compute g r a d a p p r o x [ i ] = J i + − J i − 2 ε gradapprox[i] = \frac{J^{+}_i - J^{-}_i}{2 \varepsilon} gradapprox[i]=2εJi+−Ji−
Thus, you get a vector gradapprox, where gradapprox[i] is an approximation of the gradient with respect to parameter_values[i]
. You can now compare this gradapprox vector to the gradients vector from backpropagation. Just like for the 1D case (Steps 1’, 2’, 3’), compute:
d
i
f
f
e
r
e
n
c
e
=
∥
g
r
a
d
−
g
r
a
d
a
p
p
r
o
x
∥
2
∥
g
r
a
d
∥
2
+
∥
g
r
a
d
a
p
p
r
o
x
∥
2
difference = \frac {\| grad - gradapprox \|_2}{\| grad \|_2 + \| gradapprox \|_2 }
difference=∥grad∥2+∥gradapprox∥2∥grad−gradapprox∥2