【吴恩达深度学习】02_week2_quiz Optimization algorithms

深海里的鱼(・ω<)★

已于 2022-03-14 04:06:49 修改

阅读量762

点赞数 1

分类专栏：人工智能，机器学习，深度学习文章标签：深度学习 batch 人工智能

于 2022-02-28 03:16:49 首次发布

本文链接：https://blog.csdn.net/qq_50710984/article/details/123173783

版权

人工智能，机器学习，深度学习专栏收录该内容

48 篇文章 9 订阅

订阅专栏

本文探讨了在深度学习中优化算法的选择和应用，包括批量梯度下降、小批量梯度下降和随机梯度下降的比较，以及动量法、RMSProp和Adam等优化器的作用。重点分析了不同优化器对模型训练速度和收敛性的影响，并讨论了学习率衰减策略。同时，还介绍了指数加权平均在温度数据平滑中的应用。

摘要由CSDN通过智能技术生成

(1)Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th mini-batch
[A] $a^{[3]\{7\}(8)}$
[B] $a^{[8]\{7\}(3)}$
[C] $a^{[8]\{3\}(7)}$
[D] $a^{[3]\{8\}(7)}$
答案：D
解析：方括号[]表示第几个layer，大括号{}表示第几个mini-batch，圆括号()表示第几个样本。

(2)Which of these statements about mini-batch gradient descent do you agree with?
[A]One iteration of mini-batch gradient descent (computing on a single mini-batch) is faster than one iteration of batch gradient descent.
[B] training one epoch (one pass through the training set) using mini-batch gradient descent is faster than training one epoch using batch gradient descent.
[C] You should implement mini-batch gradient descent without an explicit for-loop over different mini-batches, so that the algorithm processes all mini-batches at the same time (vectorization).
答案：A
解析：
[B]当mini-batch的大小与数据集大小相同时，mini-batch gradient descent 和 batch gradient descent的速度相同，故B错。
[C]mini-batch需要使用显式的for循环，故C错。

(3)Why is the best mini-batch size usually not 1 and not m, but instead something in-between?
[A]If the mini-batch size is 1, you end up having to process the entire training set before making any progress.
[B]If the mini-batch size is m, you end up with stochastic gradient descent, which is usually slower than mini-batch gradient descent.
[C]If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch.
[D]If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress.
答案：C,D
解析：mini-batch需要用到显式的for循环，如果mini-batch大小为1，将会失去随机化的优势。如果mini-batch的大小为m，则mini-batch gradient descent 和 batch gradient descent 相同。

(4)Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like this:
在这里插入图片描述
Which of the following do you agree with?
[A]Whether you’re using batch gradient descent or mini-batch gradient descent, something is wring.
[B]Whether you’re using batch gradient descent or mini-batch gradient descent, this looks acceptable.
[C]If you’re using mini-batch gradient descent, something is wrong. But is you’re using batch gradient descent, this looks acceptable.
[D]If you’re using mini-batch gradient descent, this looks acceptable. But is you’re using batch gradient descent, something is wrong.
答案：D
解析：batch gradient descent 的代价函数曲线必然单调递减，但mini-batch gradient descent 不一定，因为mini-batch中可能会有一些干扰数据，导致产生一定的震荡。

(5)Suppose the temperature in Cassblanca over the first three days of January are the same:
Jan 1st: $\theta_1=10℃$
Jan 2nd: $\theta_2=10℃$
(We used Fahrenheit in lecture, so will use Celsius here in honor of the metric world.)
Say you use an exponentially weighted average with $\beta=0.5$ to track the temperature: $v_0=0$ , $v_t=\beta v_{t-1}+(1-\beta)\theta_t$ . If $v_2$ is the value computed after day 2 without bias correction, and $v_2^{corrected}$ is the value you compute with bias correction. What are these values?
[A] $v_2=7.5$ , $v_2^{corrected}=10$
[B] $v_2=10$ , $v_2^{corrected}=10$
[C] $v_2=7.5$ , $v_2^{corrected}=7.5$
[D] $v_2=10$ , $v_2^{corrected}=7.5$
答案：A
解析：
$v_1=\beta v_0 + (1-\beta)\theta_1 = 0.5 \times 0 + (1-0.5)\times 10=5$
$v_2=\beta v_1 + (1-\beta)\theta_2 = 0.5 \times 5 + (1-0.5)\times 10 =7.5$
$v_1^{corrected}=\frac{v_1}{1-\beta^1}=\frac{5}{1-0.5^1}=10$
$v_2^{corrected}=\frac{v_2}{1-\beta^2}=\frac{7.5}{1-0.5^2}=10$

(6)Which of these is NOT a good learning rate decay scheme? Here, t is the epoch number.
[A] $\alpha=\frac{1}{1+2*t}\alpha_0$
[B] $\alpha=e^t\alpha_0$
[C] $\alpha=0.95^t\alpha_0$
[D] $\alpha=\frac{1}{\sqrt{t}}\alpha_0$
答案：B
解析： $\alpha=e^t\alpha_0$ 单调递增

(7)You use an exponentially weighted average on the London temperature dataset. You use the following to track the temperature: $v_t=\beta v_{t-1}+(1-\beta)\theta_t$ . The red line below was computed using $\beta=0.9$ . What would happen to your red curve as you vary $\beta$ ?(Check the two that apply)
在这里插入图片描述

[A]Decreasing $\beta$ will shift the red line slightly yo the right.
[B]Increasing $\beta$ will shift the red line slightly yo the right.
[C]Decreasing $\beta$ will create more oscillation within the red line.
[D]Increasing $\beta$ will create more oscillation within the red line.
答案：B,C
解析：如图所示，绿线 $\beta=0.98$ ，黄线 $\beta=0.5$ 。
在这里插入图片描述

(8)Consider this figure:

These plots were generated with gradient descent; with gradient descent with momentum ( $\beta=0.5$ ) and gradient descent with momentum ( $\beta=0.9$ ). Which curve corresponds to which algorithm?
[A] (1)is gradient descent with momentum (small $\beta$ ), (2)is gradient descent with momentum (small $\beta$ ), (3)is gradient descent.
[B] (1)is gradient descent with momentum (small $\beta$ ), (2)is gradient descent, (3)is gradient descent with momentum (large $\beta$ ).
[C] (1)is gradient descent, (2)is gradient descent with momentum (large $\beta$ ), (3)is gradient descent with momentum (small $\beta$ ).
[D] (1)is gradient descent, (2)is gradient descent with momentum (small $\beta$ ), (3)is gradient descent with momentum (large $\beta$ ).
答案：D
解析：
$v_{dw}=\beta v_{dw}+(1-\beta)dw$
$W=W-\alpha v_{dw}$
$\beta$ 越大，越依赖于先前的状态，波动越小。
The larger the momentum $\beta$ is, the smoother the update because the more we take the past gradients into account.

(9)Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function $J(W^{[1]},b^{[1]},...,W^{[L]},b^{[L]})$ . Which of the following techniques could help find parameter values that attain a small value for J?(Check all that apply)
[A]Try initializing all the weights to zero.
[B]Try better random initialization for the weights.
[C]Try using Adam.
[D]Try mini-batch gradient descent.
[E]Try tuning the learning rate $\alpha$ .
答案：B,C,D,E

(10)Which of the following statements about Adam is False?
[A]The learning rate hyperparameter $\alpha$ in Adam usually needs to be tuned.
[B]Adam combines the advantages of RMSProp and momentum.
[C]Adam should be used with batch gradient computations, not with mini-batches.
[D]We usually use “default” values for the hyperparameters $\beta_1$ , $\beta_2$ and $\epsilon$ in Adam ( $\beta_1=0.9$ , $\beta_2=0.999$ , $\epsilon=10^{-8}$ )
答案：C

深海里的鱼(・ω<)★

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
【吴恩达深度学习】02_week2_quiz Optimization algorithms

(1)Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th mini-batch[A] a[3]{7}(8)a^{[3]\{7\}(8)}a[3]{7}(8)[B] a[8]{7}(3)a^{[8]\{7\}(3)}a[8]{7}(3)[C] a[8]{3}(7)a^{[8]\{3\}(7)}a[8]{3}(7)[D] a[3]
复制链接

扫一扫

专栏目录