2020-8-5 吴恩达-改善深层NN-w2 优化算法(课后作业)

参考链接

1、Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th minibatch?
当输入是第八个mini-batch的第七个样本的时候,你会用哪种符号表示第三层的激活?

  • a [ 8 ] { 3 } ( 7 ) a^{[8] \{ 3 \}(7)} a[8]{ 3}(7)
  • a [ 8 ] { 7 } ( 3 ) a^{[8] \{ 7 \}(3)} a[8]{ 7}(3)
  • a [ 3 ] { 8 } ( 7 ) a^{[3] \{ 8 \}(7)} a[3]{ 8}(7) 正确
  • a [ 3 ] { 7 } ( 8 ) a^{[3] \{ 7 \}(8)} a[3]{ 7}(8)

[i]{j}(k) superscript means i-th layer, j-th minibatch, k-th example

关于mini-batch的符号定义参见链接

===============================================================

2、Which of these statements about mini-batch gradient descent do you agree with?
mini-batch梯度下降的描述,哪个是对的?

  • You should implement mini-batch gradient descent without an explicit for-loop over different mini-batches, so that the algorithm processes all mini-batches at the same time (vectorization).
  • Training one epoch (one pass through the training set) using mini-batch gradient descent is faster than training one epoch using batch gradient descent.
  • One iteration of mini-batch gradient descent (computing on a single mini-batch) is faster than one iteration of batch gradient descent. mini-batch梯度下降(在单个mini-batch上计算)的一次迭代快于批量梯度下降(传统方法)的一次迭代。(正确)

参见链接:使用batch梯度下降法,一次遍历训练集只能让你做一个梯度下降,而使用mini-batch梯度下降法,一次遍历训练集,能让你做5000个梯度下降。mini-batch梯度下降法比batch梯度下降法运行地更快,所以几乎每个研习DL的人在训练巨大的数据集时都会用到。

Vectorization is not for computing several mini-batches in the same time.矢量化不适用于同时计算多个mini-batch。

================================================================

3、Why is the best mini-batch size usually not 1 and not m, but instead something in-between? 为什么最好的mini-batch的大小通常不是1也不是m,而是介于两者之间?

  • If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch. 如果mini-batch大小为1,则会失去mini-batch示例中矢量化带来的的好处。(正确)
  • If the mini-batch size is 1, you end up having to process the entire training set before making any progress.
  • If the mini-batch size is m, you end up with stochastic(随机) gradient descent, which is usually slower than mini-batch gradient descent.
  • If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress. 如果mini-batch的大小是m,那么你会得到批量梯度下降(传统方法),这需要在进行训练之前对整个训练集进行处理。(正确)

如果子集的尺寸为m,那么相当于没有划分子集。若尺寸为1,每次训练完一个样本就要更新参数,失去了向量化的优势,也使得参数更新波动更大,更随机化,效率更低。

=================================================================

4、Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like this:如果你的模型的成本J随着迭代次数的增加,绘制出来的图如下,那么:
在这里插入图片描述

以下哪个正确?

  • Whether you’re using batch gradient descent or mini-batch gradient descent, something is wrong.
  • Whether you’re using batch gradient descent or mini-batch gradient descent, this looks acceptable.
  • If you’re using mini-batch gradient descent, something is wrong. But if you’re using batch gradient descent, this looks acceptable.
  • If you’re using mini-batch gradient descent, this looks acceptable. But if you’re using batch gradient descent, something is wrong. 如果你使用的是mini-batch梯度下降,这看起来是可以接受的。但是如果你使用的是下降,那么你的模型就有问题。

参见链接,mini-batch梯度下降法成本函数J的曲线图,可以发现走向朝下,但有更多的噪声。每次迭代并不是都下降无关紧要,但整体走势应该向下。

There will be some oscillations when you’re using mini-batch gradient descent since there could be some noisy data example in batches. However batch gradient descent always guarantees a lower J before reaching the optimal.

==================================================================

5、Suppose the temperature in Casablanca over the first three days of January are the same:假设卡萨布兰卡一月前三天的气温是一样的:

Jan 1st: θ_1 = 10

Jan 2nd: θ_2 = 10

Say you use an exponentially weighted average with β = 0.5 to track the temperature: v_0 = 0, v_t &#

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值