(1)Which notation would you use to denote the 3rd layer’s activations when the input is the 7th example from the 8th mini-batch
[A]
a
[
3
]
{
7
}
(
8
)
a^{[3]\{7\}(8)}
a[3]{7}(8)
[B]
a
[
8
]
{
7
}
(
3
)
a^{[8]\{7\}(3)}
a[8]{7}(3)
[C]
a
[
8
]
{
3
}
(
7
)
a^{[8]\{3\}(7)}
a[8]{3}(7)
[D]
a
[
3
]
{
8
}
(
7
)
a^{[3]\{8\}(7)}
a[3]{8}(7)
答案:D
解析:方括号[]表示第几个layer,大括号{}表示第几个mini-batch,圆括号()表示第几个样本。
(2)Which of these statements about mini-batch gradient descent do you agree with?
[A]One iteration of mini-batch gradient descent (computing on a single mini-batch) is faster than one iteration of batch gradient descent.
[B] training one epoch (one pass through the training set) using mini-batch gradient descent is faster than training one epoch using batch gradient descent.
[C] You should implement mini-batch gradient descent without an explicit for-loop over different mini-batches, so that the algorithm processes all mini-batches at the same time (vectorization).
答案:A
解析:
[B]当mini-batch的大小与数据集大小相同时,mini-batch gradient descent 和 batch gradient descent的速度相同,故B错。
[C]mini-batch需要使用显式的for循环,故C错。
(3)Why is the best mini-batch size usually not 1 and not m, but instead something in-between?
[A]If the mini-batch size is 1, you end up having to process the entire training set before making any progress.
[B]If the mini-batch size is m, you end up with stochastic gradient descent, which is usually slower than mini-batch gradient descent.
[C]If the mini-batch size is 1, you lose the benefits of vectorization across examples in the mini-batch.
[D]If the mini-batch size is m, you end up with batch gradient descent, which has to process the whole training set before making progress.
答案:C,D
解析:mini-batch需要用到显式的for循环,如果mini-batch大小为1,将会失去随机化的优势。如果mini-batch的大小为m,则mini-batch gradient descent 和 batch gradient descent 相同。
(4)Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like this:
Which of the following do you agree with?
[A]Whether you’re using batch gradient descent or mini-batch gradient descent, something is wring.
[B]Whether you’re using batch gradient descent or mini-batch gradient descent, this looks acceptable.
[C]If you’re using mini-batch gradient descent, something is wrong. But is you’re using batch gradient descent, this looks acceptable.
[D]If you’re using mini-batch gradient descent, this looks acceptable. But is you’re using batch gradient descent, something is wrong.
答案:D
解析:batch gradient descent 的代价函数曲线必然单调递减,但mini-batch gradient descent 不一定,因为mini-batch中可能会有一些干扰数据,导致产生一定的震荡。
(5)Suppose the temperature in Cassblanca over the first three days of January are the same:
Jan 1st:
θ
1
=
10
℃
\theta_1=10℃
θ1=10℃
Jan 2nd:
θ
2
=
10
℃
\theta_2=10℃
θ2=10℃
(We used Fahrenheit in lecture, so will use Celsius here in honor of the metric world.)
Say you use an exponentially weighted average with
β
=
0.5
\beta=0.5
β=0.5 to track the temperature:
v
0
=
0
v_0=0
v0=0,
v
t
=
β
v
t
−
1
+
(
1
−
β
)
θ
t
v_t=\beta v_{t-1}+(1-\beta)\theta_t
vt=βvt−1+(1−β)θt. If
v
2
v_2
v2 is the value computed after day 2 without bias correction, and
v
2
c
o
r
r
e
c
t
e
d
v_2^{corrected}
v2corrected is the value you compute with bias correction. What are these values?
[A]
v
2
=
7.5
v_2=7.5
v2=7.5,
v
2
c
o
r
r
e
c
t
e
d
=
10
v_2^{corrected}=10
v2corrected=10
[B]
v
2
=
10
v_2=10
v2=10,
v
2
c
o
r
r
e
c
t
e
d
=
10
v_2^{corrected}=10
v2corrected=10
[C]
v
2
=
7.5
v_2=7.5
v2=7.5,
v
2
c
o
r
r
e
c
t
e
d
=
7.5
v_2^{corrected}=7.5
v2corrected=7.5
[D]
v
2
=
10
v_2=10
v2=10,
v
2
c
o
r
r
e
c
t
e
d
=
7.5
v_2^{corrected}=7.5
v2corrected=7.5
答案:A
解析:
v
1
=
β
v
0
+
(
1
−
β
)
θ
1
=
0.5
×
0
+
(
1
−
0.5
)
×
10
=
5
v_1=\beta v_0 + (1-\beta)\theta_1 = 0.5 \times 0 + (1-0.5)\times 10=5
v1=βv0+(1−β)θ1=0.5×0+(1−0.5)×10=5
v
2
=
β
v
1
+
(
1
−
β
)
θ
2
=
0.5
×
5
+
(
1
−
0.5
)
×
10
=
7.5
v_2=\beta v_1 + (1-\beta)\theta_2 = 0.5 \times 5 + (1-0.5)\times 10 =7.5
v2=βv1+(1−β)θ2=0.5×5+(1−0.5)×10=7.5
v
1
c
o
r
r
e
c
t
e
d
=
v
1
1
−
β
1
=
5
1
−
0.
5
1
=
10
v_1^{corrected}=\frac{v_1}{1-\beta^1}=\frac{5}{1-0.5^1}=10
v1corrected=1−β1v1=1−0.515=10
v
2
c
o
r
r
e
c
t
e
d
=
v
2
1
−
β
2
=
7.5
1
−
0.
5
2
=
10
v_2^{corrected}=\frac{v_2}{1-\beta^2}=\frac{7.5}{1-0.5^2}=10
v2corrected=1−β2v2=1−0.527.5=10
(6)Which of these is NOT a good learning rate decay scheme? Here, t is the epoch number.
[A]
α
=
1
1
+
2
∗
t
α
0
\alpha=\frac{1}{1+2*t}\alpha_0
α=1+2∗t1α0
[B]
α
=
e
t
α
0
\alpha=e^t\alpha_0
α=etα0
[C]
α
=
0.9
5
t
α
0
\alpha=0.95^t\alpha_0
α=0.95tα0
[D]
α
=
1
t
α
0
\alpha=\frac{1}{\sqrt{t}}\alpha_0
α=t1α0
答案:B
解析:
α
=
e
t
α
0
\alpha=e^t\alpha_0
α=etα0 单调递增
(7)You use an exponentially weighted average on the London temperature dataset. You use the following to track the temperature:
v
t
=
β
v
t
−
1
+
(
1
−
β
)
θ
t
v_t=\beta v_{t-1}+(1-\beta)\theta_t
vt=βvt−1+(1−β)θt. The red line below was computed using
β
=
0.9
\beta=0.9
β=0.9. What would happen to your red curve as you vary
β
\beta
β?(Check the two that apply)
[A]Decreasing
β
\beta
β will shift the red line slightly yo the right.
[B]Increasing
β
\beta
β will shift the red line slightly yo the right.
[C]Decreasing
β
\beta
β will create more oscillation within the red line.
[D]Increasing
β
\beta
β will create more oscillation within the red line.
答案:B,C
解析:如图所示,绿线
β
=
0.98
\beta=0.98
β=0.98,黄线
β
=
0.5
\beta=0.5
β=0.5。
(8)Consider this figure:
These plots were generated with gradient descent; with gradient descent with momentum (
β
=
0.5
\beta=0.5
β=0.5) and gradient descent with momentum (
β
=
0.9
\beta=0.9
β=0.9). Which curve corresponds to which algorithm?
[A] (1)is gradient descent with momentum (small
β
\beta
β), (2)is gradient descent with momentum (small
β
\beta
β), (3)is gradient descent.
[B] (1)is gradient descent with momentum (small
β
\beta
β), (2)is gradient descent, (3)is gradient descent with momentum (large
β
\beta
β).
[C] (1)is gradient descent, (2)is gradient descent with momentum (large
β
\beta
β), (3)is gradient descent with momentum (small
β
\beta
β).
[D] (1)is gradient descent, (2)is gradient descent with momentum (small
β
\beta
β), (3)is gradient descent with momentum (large
β
\beta
β).
答案:D
解析:
v
d
w
=
β
v
d
w
+
(
1
−
β
)
d
w
v_{dw}=\beta v_{dw}+(1-\beta)dw
vdw=βvdw+(1−β)dw
W
=
W
−
α
v
d
w
W=W-\alpha v_{dw}
W=W−αvdw
β
\beta
β越大,越依赖于先前的状态,波动越小。
The larger the momentum
β
\beta
β is, the smoother the update because the more we take the past gradients into account.
(9)Suppose batch gradient descent in a deep network is taking excessively long to find a value of the parameters that achieves a small value for the cost function
J
(
W
[
1
]
,
b
[
1
]
,
.
.
.
,
W
[
L
]
,
b
[
L
]
)
J(W^{[1]},b^{[1]},...,W^{[L]},b^{[L]})
J(W[1],b[1],...,W[L],b[L]). Which of the following techniques could help find parameter values that attain a small value for J?(Check all that apply)
[A]Try initializing all the weights to zero.
[B]Try better random initialization for the weights.
[C]Try using Adam.
[D]Try mini-batch gradient descent.
[E]Try tuning the learning rate
α
\alpha
α .
答案:B,C,D,E
(10)Which of the following statements about Adam is False?
[A]The learning rate hyperparameter
α
\alpha
α in Adam usually needs to be tuned.
[B]Adam combines the advantages of RMSProp and momentum.
[C]Adam should be used with batch gradient computations, not with mini-batches.
[D]We usually use “default” values for the hyperparameters
β
1
\beta_1
β1,
β
2
\beta_2
β2 and
ϵ
\epsilon
ϵ in Adam (
β
1
=
0.9
\beta_1=0.9
β1=0.9,
β
2
=
0.999
\beta_2=0.999
β2=0.999,
ϵ
=
1
0
−
8
\epsilon=10^{-8}
ϵ=10−8)
答案:C