【吴恩达深度学习】02_week2_quiz Optimization algorithms

(1)Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th mini-batch
[A] a [ 3 ] { 7 } ( 8 ) a^{[3]\{7\}(8)} a[3]{7}(8)
[B] a [ 8 ] { 7 } ( 3 ) a^{[8]\{7\}(3)} a[8]{7}(3)
[C] a [ 8 ] { 3 } ( 7 ) a^{[8]\{3\}(7)} a[8]{3}(7)
[D] a [ 3 ] { 8 } ( 7 ) a^{[3]\{8\}(7)} a[3]{8}(7)

答案:D
解析:方括号[]表示第几个layer,大括号{}表示第几个mini-batch,圆括号()表示第几个样本。

(2)Which of these statements about mini-batch gradient descent do you agree with?
[A]One iteration of mini-batch gradient descent (computing on a single mini-batch) is faster than one iteration of batch gradient descent.
[B] training one epoch (one pass through the training set) using mini-batch gradient descent is faster than training one epoch using batch gradient descent.
[C] You should implement mini-batch gradient descent without an explicit for-loop over different mini-batches, so that the algorithm processes all mini-batches at the same time (vectorization).

答案:A
解析:
[B]当mini-batch的大小与数据集大小相同时,mini-batch gradient descent 和 batch gradient descent的速度相同,故B错。
[C]mini-batch需要使用显式的for循环,故C错。


(3)Why is the best mini-batch size usually not 1 and not m, but instead something in-between?
[A]If the mini-batch size is 1, you end up having to process the entire training set before making any progress.
[B]If the mini-batch size is m, you end up with stochastic gradient descent, which is usually slower than mini-batch gradient descent.
[C]If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch.
[D]If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress.

答案:C,D
解析:mini-batch需要用到显式的for循环,如果mini-batch大小为1,将会失去随机化的优势。如果mini-batch的大小为m,则mini-batch gradient descent 和 batch gradient descent 相同。

(4)Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like this:
在这里插入图片描述
Which of the following do you agree with?
[A]Whether you’re using batch gradient descent or mini-batch gradient descent, something is wring.
[B]Whether you’re using batch gradient descent or mini-batch gradient descent, this looks acceptable.
[C]If you’re using mini-batch gradient descent, something is wrong. But is you’re using batch gradient descent, this looks acceptable.
[D]If you’re using mini-batch gradient descent, this looks acceptable. But is you’re using batch gradient descent, something is wrong.

答案:D
解析:batch gradient descent 的代价函数曲线必然单调递减,但mini-batch gradient descent 不一定,因为mini-batch中可能会有一些干扰数据,导致产生一定的震荡。

(5)Suppose the temperature in Cassblanca over the first three days of January are the same:
Jan 1st: θ 1 = 10 ℃ \theta_1=10℃ θ1=10
Jan 2nd: θ 2 = 10 ℃ \theta_2=10℃ θ2=10
(We used Fahrenheit in lecture, so will use Celsius here in honor of the metric world.)
Say you use an exponentially weighted average with β = 0.5 \beta=0.5 β=0.5 to track the temperature: v 0 = 0 v_0=0 v0=0, v t = β v t − 1 + ( 1 − β ) θ t v_t=\beta v_{t-1}+(1-\beta)\theta_t vt=βvt1+(1β)θt. If v 2 v_2 v2 is the value computed after day 2 without bias correction, and v 2 c o r r e c t e d v_2^{corrected} v2corrected is the value you compute with bias correction. What are these values?
[A] v 2 = 7.5 v_2=7.5 v2=7.5, v 2 c o r r e c t e d = 10 v_2^{corrected}=10 v2corrected=10
[B] v 2 = 10 v_2=10 v2=10, v 2 c o r r e c t e d = 10 v_2^{corrected}=10 v2corrected=10
[C] v 2 = 7.5 v_2=7.5 v2=7.5, v 2 c o r r e c t e d = 7.5 v_2^{corrected}=7.5 v2corrected=7.5
[D] v 2 = 10 v_2=10 v2=10, v 2 c o r r e c t e d = 7.5 v_2^{corrected}=7.5 v2corrected=7.5

答案:A
解析:
v 1 = β v 0 + ( 1 − β ) θ 1 = 0.5 × 0 + ( 1 − 0.5 ) × 10 = 5 v_1=\beta v_0 + (1-\beta)\theta_1 = 0.5 \times 0 + (1-0.5)\times 10=5 v1=βv0+(1β)θ1=0.5×0+(10.5)×10=5
v 2 = β v 1 + ( 1 − β ) θ 2 = 0.5 × 5 + ( 1 − 0.5 ) × 10 = 7.5 v_2=\beta v_1 + (1-\beta)\theta_2 = 0.5 \times 5 + (1-0.5)\times 10 =7.5 v2=βv1+(1β)θ2=0.5×5+(10.5)×10=7.5
v 1 c o r r e c t e d = v 1 1 − β 1 = 5 1 − 0. 5 1 = 10 v_1^{corrected}=\frac{v_1}{1-\beta^1}=\frac{5}{1-0.5^1}=10 v1corrected=1β1v1=10.515=10
v 2 c o r r e c t e d = v 2 1 − β 2 = 7.5 1 − 0. 5 2 = 10 v_2^{corrected}=\frac{v_2}{1-\beta^2}=\frac{7.5}{1-0.5^2}=10 v2corrected=1β2v2=10.527.5=10


(6)Which of these is NOT a good learning rate decay scheme? Here, t is the epoch number.
[A] α = 1 1 + 2 ∗ t α 0 \alpha=\frac{1}{1+2*t}\alpha_0 α=1+2t1α0
[B] α = e t α 0 \alpha=e^t\alpha_0 α=etα0
[C] α = 0.9 5 t α 0 \alpha=0.95^t\alpha_0 α=0.95tα0
[D] α = 1 t α 0 \alpha=\frac{1}{\sqrt{t}}\alpha_0 α=t 1α0

答案:B
解析: α = e t α 0 \alpha=e^t\alpha_0 α=etα0 单调递增

(7)You use an exponentially weighted average on the London temperature dataset. You use the following to track the temperature: v t = β v t − 1 + ( 1 − β ) θ t v_t=\beta v_{t-1}+(1-\beta)\theta_t vt=βvt1+(1β)θt. The red line below was computed using β = 0.9 \beta=0.9 β=0.9. What would happen to your red curve as you vary β \beta β?(Check the two that apply)
在这里插入图片描述

[A]Decreasing β \beta β will shift the red line slightly yo the right.
[B]Increasing β \beta β will shift the red line slightly yo the right.
[C]Decreasing β \beta β will create more oscillation within the red line.
[D]Increasing β \beta β will create more oscillation within the red line.

答案:B,C
解析:如图所示,绿线 β = 0.98 \beta=0.98 β=0.98,黄线 β = 0.5 \beta=0.5 β=0.5
在这里插入图片描述


(8)Consider this figure:
在这里插入图片描述
These plots were generated with gradient descent; with gradient descent with momentum ( β = 0.5 \beta=0.5 β=0.5) and gradient descent with momentum ( β = 0.9 \beta=0.9 β=0.9). Which curve corresponds to which algorithm?
[A] (1)is gradient descent with momentum (small β \beta β), (2)is gradient descent with momentum (small β \beta β), (3)is gradient descent.
[B] (1)is gradient descent with momentum (small β \beta β), (2)is gradient descent, (3)is gradient descent with momentum (large β \beta β).
[C] (1)is gradient descent, (2)is gradient descent with momentum (large β \beta β), (3)is gradient descent with momentum (small β \beta β).
[D] (1)is gradient descent, (2)is gradient descent with momentum (small β \beta β), (3)is gradient descent with momentum (large β \beta β).

答案:D
解析:
v d w = β v d w + ( 1 − β ) d w v_{dw}=\beta v_{dw}+(1-\beta)dw vdw=βvdw+(1β)dw
W = W − α v d w W=W-\alpha v_{dw} W=Wαvdw
β \beta β越大,越依赖于先前的状态,波动越小。
The larger the momentum β \beta β is, the smoother the update because the more we take the past gradients into account.


(9)Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function J ( W [ 1 ] , b [ 1 ] , . . . , W [ L ] , b [ L ] ) J(W^{[1]},b^{[1]},...,W^{[L]},b^{[L]}) J(W[1],b[1],...,W[L],b[L]). Which of the following techniques could help find parameter values that attain a small value for J?(Check all that apply)
[A]Try initializing all the weights to zero.
[B]Try better random initialization for the weights.
[C]Try using Adam.
[D]Try mini-batch gradient descent.
[E]Try tuning the learning rate α \alpha α .

答案:B,C,D,E

(10)Which of the following statements about Adam is False?
[A]The learning rate hyperparameter α \alpha α in Adam usually needs to be tuned.
[B]Adam combines the advantages of RMSProp and momentum.
[C]Adam should be used with batch gradient computations, not with mini-batches.
[D]We usually use “default” values for the hyperparameters β 1 \beta_1 β1, β 2 \beta_2 β2 and ϵ \epsilon ϵ in Adam ( β 1 = 0.9 \beta_1=0.9 β1=0.9, β 2 = 0.999 \beta_2=0.999 β2=0.999, ϵ = 1 0 − 8 \epsilon=10^{-8} ϵ=108)

答案:C

  • 1
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值