2020-8-5 吴恩达-改善深层NN-w2 优化算法(课后作业)

最新推荐文章于 2024-03-28 14:21:28 发布

没人不认识我

最新推荐文章于 2024-03-28 14:21:28 发布

阅读量683

点赞数 1

分类专栏：深度学习 python IT 文章标签：深度学习

本文链接：https://blog.csdn.net/weixin_42555985/article/details/107813053

版权

这篇博客探讨了在深度学习中优化算法的重要性，包括mini-batch梯度下降的优势，最佳mini-batch大小的选择，以及不同的学习率衰减策略。文章通过一系列问题和解释，阐述了批量梯度下降、mini-batch梯度下降和随机梯度下降的差异，并讨论了指数加权平均在学习率调整中的作用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

参考链接

1、Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?
当输入是第八个mini-batch的第七个样本的时候，你会用哪种符号表示第三层的激活？

$a^{[8] \{ 3 \}(7)}$
$a^{[8] \{ 7 \}(3)}$
$a^{[3] \{ 8 \}(7)}$ 正确
$a^{[3] \{ 7 \}(8)}$

[i]{j}(k) superscript means i-th layer, j-th minibatch, k-th example

关于mini-batch的符号定义参见链接

===============================================================

2、Which of these statements about mini-batch gradient descent do you agree with?
mini-batch梯度下降的描述，哪个是对的？

You should implement mini-batch gradient descent without an explicit for-loop over different mini-batches, so that the algorithm processes all mini-batches at the same time (vectorization).
Training one epoch (one pass through the training set) using mini-batch gradient descent is faster than training one epoch using batch gradient descent.
One iteration of mini-batch gradient descent (computing on a single mini-batch) is faster than one iteration of batch gradient descent. mini-batch梯度下降（在单个mini-batch上计算）的一次迭代快于批量梯度下降(传统方法)的一次迭代。（正确）

参见链接：使用batch梯度下降法，一次遍历训练集只能让你做一个梯度下降，而使用mini-batch梯度下降法，一次遍历训练集，能让你做5000个梯度下降。mini-batch梯度下降法比batch梯度下降法运行地更快，所以几乎每个研习DL的人在训练巨大的数据集时都会用到。

Vectorization is not for computing several mini-batches in the same time.矢量化不适用于同时计算多个mini-batch。

================================================================

3、Why is the best mini-batch size usually not 1 and not m, but instead something in-between? 为什么最好的mini-batch的大小通常不是1也不是m，而是介于两者之间？

If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch. 如果mini-batch大小为1，则会失去mini-batch示例中矢量化带来的的好处。(正确)
If the mini-batch size is 1, you end up having to process the entire training set before making any progress.
If the mini-batch size is m, you end up with stochastic（随机） gradient descent, which is usually slower than mini-batch gradient descent.
If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress. 如果mini-batch的大小是m，那么你会得到批量梯度下降(传统方法)，这需要在进行训练之前对整个训练集进行处理。（正确）

如果子集的尺寸为m，那么相当于没有划分子集。若尺寸为1，每次训练完一个样本就要更新参数，失去了向量化的优势，也使得参数更新波动更大，更随机化，效率更低。

=================================================================

4、Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like this:如果你的模型的成本J随着迭代次数的增加，绘制出来的图如下，那么：
在这里插入图片描述

以下哪个正确？

Whether you’re using batch gradient descent or mini-batch gradient descent, something is wrong.
Whether you’re using batch gradient descent or mini-batch gradient descent, this looks acceptable.
If you’re using mini-batch gradient descent, something is wrong. But if you’re using batch gradient descent, this looks acceptable.
If you’re using mini-batch gradient descent, this looks acceptable. But if you’re using batch gradient descent, something is wrong. 如果你使用的是mini-batch梯度下降，这看起来是可以接受的。但是如果你使用的是下降，那么你的模型就有问题。

参见链接，mini-batch梯度下降法成本函数J的曲线图，可以发现走向朝下，但有更多的噪声。每次迭代并不是都下降无关紧要，但整体走势应该向下。

There will be some oscillations when you’re using mini-batch gradient descent since there could be some noisy data example in batches. However batch gradient descent always guarantees a lower J before reaching the optimal.

==================================================================

5、Suppose the temperature in Casablanca over the first three days of January are the same:假设卡萨布兰卡一月前三天的气温是一样的：

Jan 1st: θ_1 = 10

Jan 2nd: θ_2 = 10

Say you use an exponentially weighted average with β = 0.5 to track the temperature: v_0 = 0, v_t &#