(1)Suppose your training examples are sentences (sequences of words). Which of the following refers to the jth word in the ith training example?
[A]
x
(
i
)
<
j
>
x^{(i)<j>}
x(i)<j>
[B]
x
<
i
>
(
j
)
x^{<i>(j)}
x<i>(j)
[C]
x
(
j
)
<
i
>
x^{(j)<i>}
x(j)<i>
[D]
x
<
j
>
(
i
)
x^{<j>(i)}
x<j>(i)
答案:A
(2)Consider this RNN:
This specific type of architecture is appropriate when:
[A]
T
x
=
T
y
T_x=T_y
Tx=Ty
[B]
T
x
<
T
y
T_x<T_y
Tx<Ty
[C]
T
x
>
T
y
T_x>T_y
Tx>Ty
[D]
T
x
=
1
T_x=1
Tx=1
答案:A
解析:如图所示,输入和输出序列长度相等。
(3)To which of these tasks would you apply a many-to-one RNN architecture?(Check all that apply)
[A]Speech recognition (input an audio clip and output a transcript)
[B]Sentiment classification (input a piece of text and output a 0/1 to denote positive or negative sentiment)
[C]Image classification (input an image and output a label)
[D]Gender recognition from speech (input an audio clip and output a label indicating the speaker’s gender)
答案:B,D
关键词:many-to-one
(4)You are training this RNN language model.
At the tth time step, what is the RNN doing? Choose the best answer.
[A]Estimating
P
(
y
<
1
>
,
y
<
2
>
,
.
.
.
,
y
<
t
−
1
>
)
P(y^{<1>},y^{<2>},...,y^{<t-1>})
P(y<1>,y<2>,...,y<t−1>)
[B]Estimating
P
(
y
<
t
>
)
P(y^{<t>})
P(y<t>)
[C]Estimating
P
(
y
<
t
>
∣
y
<
1
>
,
y
<
2
>
,
.
.
.
,
y
<
t
−
1
>
)
P(y^{<t>}|y^{<1>},y^{<2>},...,y^{<t-1>})
P(y<t>∣y<1>,y<2>,...,y<t−1>)
[D]Estimating
P
(
y
<
t
>
∣
y
<
1
>
,
y
<
2
>
,
.
.
.
,
y
<
t
>
)
P(y^{<t>}|y^{<1>},y^{<2>},...,y^{<t>})
P(y<t>∣y<1>,y<2>,...,y<t>)
答案:C
(5)You have finished training a language model RNN and are using it to sample random sentences, as follow:
What are you doing at each time step t?
[A] (i)Use the probabilities output by the RNN to pick the highest probability word for that time-step as
y
<
t
>
y^{<t>}
y<t>. (ii)Then pass the ground-truth word from the training set to the next time-step.
[B] (i)Use the probabilities output by the RNN to randomly sample a chosen word for that time-step as
y
<
t
>
y^{<t>}
y<t>. (ii)Then pass the ground-truth word from the training set to the next time-step.
[C] (i)Use the probabilities output by the RNN to pick the highest probability word for that time-step as
y
<
t
>
y^{<t>}
y<t>. (ii)Then pass the selected word to the next time-step.
[D] (i)Use the probabilities output by the RNN to randomly sample a chosen word for that time-step as
y
<
t
>
y^{<t>}
y<t>. (ii)Then pass the selected word to the next time-step.
答案:D
(6)You are training an RNN, and find that your weights and activations are all taking on the value of NaN (“Not a Number”). Which of these is the most likely cause of this problem?
[A]Vanishing gradient problem.
[B]Exploding gradient problem.
[C]ReLU activation function g(.) used to compute g(z), where z is too large.
[D]Sigmoid activation function g(.) used to compute g(z), where z is too large.
答案:B
(7)Suppose you are training a LSTM. You have a 10000 word vocabulary, and are using an LSTM with 100 dimensional activations
a
<
t
>
a^{<t>}
a<t>. What is the dimension of
Γ
u
\Gamma_u
Γu at each time step?
[A] 1
[B] 100
[C] 300
[D] 10000
答案:B
解析:
Γ
u
\Gamma_u
Γu的维度和激活函数的维度相同。
(8)Here’re the update equations for the GRU.
Alice proposes to simplify the GRU by always removing the
Γ
u
\Gamma_u
Γu. I.e.,setting
Γ
u
=
1
\Gamma_u=1
Γu=1. Betty proposes to simplify the GRU by removing the
Γ
r
\Gamma_r
Γr. I.e.,setting
Γ
r
=
1
\Gamma_r=1
Γr=1 always. Which of these models is more likely to work without vanishing gradient problems even when trained on very long input sequences.
[A]Alice’s model (removing
Γ
u
\Gamma_u
Γu), because if
Γ
r
≈
0
\Gamma_r \approx 0
Γr≈0 for a timestep, the gradient can propagate back through that timestep without much decay.
[B]Alice’s model (removing
Γ
u
\Gamma_u
Γu), because if
Γ
r
≈
1
\Gamma_r \approx 1
Γr≈1 for a timestep, the gradient can propagate back through that timestep without much decay.
[C]Betty’s model (removing
Γ
r
\Gamma_r
Γr), because if
Γ
u
≈
0
\Gamma_u \approx 0
Γu≈0 for a timestep, the gradient can propagate back through that timestep without much decay.
[D]Betty’s model (removing
Γ
r
\Gamma_r
Γr), because if
Γ
u
≈
1
\Gamma_u \approx 1
Γu≈1 for a timestep, the gradient can propagate back through that timestep without much decay.
答案:C
解析:要想梯度尽可能不消失,就要使
c
<
t
>
c^{<t>}
c<t>尽可能依赖于
c
<
t
−
1
>
c^{<t-1>}
c<t−1>,可以参考残差网络的结构来理解。
(9)Here are the equations for the GRU and the LSTM:
From these, we can see that the Update Gate and Forget Gate in the LSTM play a role similar to _______ and _______ in the GRU. What should go in the the blanks?
[A]
Γ
u
\Gamma_u
Γu and
1
−
Γ
u
1-\Gamma_u
1−Γu
[B]
Γ
u
\Gamma_u
Γu and
Γ
r
\Gamma_r
Γr
[C]
1
−
Γ
u
1-\Gamma_u
1−Γu and
Γ
u
\Gamma_u
Γu
[D]
Γ
r
\Gamma_r
Γr and
Γ
u
\Gamma_u
Γu
答案:A
(10)You have a pet dog whose mood is heavily dependent on the current and past few days’ weather. You’ve collected data for the past 365 days on the weather, which you represent as a sequence as
x
<
1
>
,
.
.
.
,
x
<
365
>
x^{<1>},...,x^{<365>}
x<1>,...,x<365>. You’ve also collected data on your dog’s mood, which you represent as
y
<
1
>
,
.
.
.
,
y
<
365
>
y^{<1>},...,y^{<365>}
y<1>,...,y<365>. You’d like to build a model to map from
x
→
y
x \rightarrow y
x→y. Should you use a Unidirectional RNN or Bidirectional RNN for this problem?
[A]Bidirectional RNN, because this allows the prediction of mood on day t to take into account more information.
[B]Bidirectional RNN, because this allows backpropagation to compute more accurate gradients.
[C]Unidirectional RNN, because the value of
y
<
t
>
y^{<t>}
y<t> depends only on
x
<
1
>
,
.
.
.
,
x
<
t
>
x^{<1>},...,x^{<t>}
x<1>,...,x<t>, but not on
x
<
t
+
1
>
,
.
.
.
,
x
<
365
>
x^{<t+1>},...,x^{<365>}
x<t+1>,...,x<365>
[D]Unidirectional RNN, because the value of
y
<
t
>
y^{<t>}
y<t> depends only on
x
<
t
>
x^{<t>}
x<t>, and not other days’ weather.
答案:C
解析:a pet dog whose mood is heavily dependent on the current and past few days’ weather.