1. 固定数据矩阵维度
X = (特征数,样本数m)
Y = (1, 样本数)
w = (n[L],n[L-1])
b = (n[L], 1)
2. 如何防止梯度消失或爆炸
1-4里面 assignment2里面关于深层神经网络的初始化,用了2-1里面讲的方法,即为了防止梯度消失或爆炸,可以使其权重除以输入层神经单元n[l-1]的个数(在初始化时),这样新得到的z就不会变化过大。具体的对于不同激活函数,人们研究了其对应的最优值:
最后一种又称Xavier initialization
在2-1的编程练习中,练习了0初始化,任意初始化,He初始化(上图第二个),得出的结论是:
Model | Train accuracy | Problem/Comment |
---|---|---|
3-layer NN with zeros initialization | 50% | fails to break symmetry |
3-layer NN with large random initialization | 83% | too large weights |
3-layer NN with He initialization | 99% | recommended method |
推荐He初始化。
但文中为了支持这个结论,对任意初始化的W乘以10,如果去掉乘以10的操作,可以发现虽然任意初始化的初始cost比较高,但是收敛很快,精度也很高。但是还是推荐He初始化,因为它可以有效的防止梯度消失和爆炸问题。
He初始化:
任意初始化:
3. 正则化和dropout
L2正则化后可以看到,得到的最终参数比不正则化的参数要小,权重越小认为模型就越简单,因而可以防止过拟合。
Dropout注意事项:
1. Dropout is a regularization technique.
2. You only use dropout during training. Don’t use dropout (randomly eliminate nodes) during test time.
3. Apply dropout both during forward and backward propagation.
4. During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.这一条是为了保持各层激活值的期望不变
model | train accuracy | test accuracy |
---|---|---|
3-layer NN without regularization | 95% | 91.5% |
3-layer NN with L2-regularization | 94% | 93% |
3-layer NN with dropout | 93% | 95% |
可以看出,正则化降低了测试集的准确度,提高了测试准确度。这是因为正则化简化了模型,由于我们更关心测试准确度,所以正则化之后performance提高了
4. 优化算法
1. 动量梯度下降
How do you choose ββ ?
The larger the momentum β is, the smoother the update because the more we take the past gradients into account. But if β is too big, it could also smooth out the updates too much.
Common values for β range from 0.8 to 0.999. If you don’t feel inclined to tune this, β=0.9 is often a reasonable default.
Tuning the optimal β for your model might need trying several values to see what works best in term of reducing the value of the cost function J .
动量梯度的公式实际是: