深度学习网络优化-初始化参数
参数的影响
使用梯度下降法训练神经网络时,需要对参数初始化。对w参数(weight matrices) 和b参数(bias vectors),使用不同的初始化策略会对训练结果和效率有非常大的影响:
- Speed up the convergence of gradient descent 加速梯度下降的收敛速度
- Increase the odds of gradient descent converging to a lower training (and generalization) error 使误差降低的概率更大
不同参数的方案:
- 初始化为0
- fails to "break symmetry" 未能破坏对成性
- This means that every neuron in each layer will learn the same thing
for l in range(1, L):
parameters['W' + str(l)] = np.zeros((layers_dims[l],layers_dims[l-1]))
parameters['b' + str(l)] = np.zeros((layers_dims[l],1))
复制代码
- 随机初始化
- success to "break symmetry",make sure different hidden units can learn different things
- Initializing weights to very large random values does not work well.
- poor initialization can lead to vanishing/exploding gradients 导致梯度消失/梯度爆炸
for l in range(1, L):
parameters['W' + str(l)] = np.random.randn(layers_dims[l],layers_dims[l-1]) * 10
parameters['b' + str(l)] = np.zeros((layers_dims[l],1))
复制代码
3. 定制初始化方案——He initialization (author of He et al., 2015)
for l in range(1, L + 1):
parameters['W' + str(l)] = np.random.randn(layers_dims[l],layers_dims[l-1]) * np.sqrt(2/layers_dims[l-1])
parameters['b' + str(l)] = np.zeros((layers_dims[l],1))
复制代码
总结:
- Different initializations lead to different results
- Random initialization is used to break symmetry and make sure different hidden units can learn different things
- Don't intialize to values that are too large
- He initialization works well for networks with ReLU activations.