【吴恩达深度学习】05_week1_quiz Recurrent Neural Networks

最新推荐文章于 2022-11-13 12:29:53 发布

深海里的鱼(・ω<)★

最新推荐文章于 2022-11-13 12:29:53 发布

阅读量367

点赞数 1

分类专栏：人工智能，机器学习，深度学习文章标签：深度学习

本文链接：https://blog.csdn.net/qq_50710984/article/details/123668190

版权

人工智能，机器学习，深度学习专栏收录该内容

48 篇文章 9 订阅

订阅专栏

(1)Suppose your training examples are sentences (sequences of words). Which of the following refers to the jth word in the ith training example?
[A] $x^{(i)<j>}$
[B] $x^{<i>(j)}$
[C] $x^{(j)<i>}$
[D] $x^{<j>(i)}$
答案：A

(2)Consider this RNN:
在这里插入图片描述
This specific type of architecture is appropriate when:
[A] $T_x=T_y$
[B] $T_x<T_y$
[C] $T_x>T_y$
[D] $T_x=1$
答案：A
解析：如图所示，输入和输出序列长度相等。

(3)To which of these tasks would you apply a many-to-one RNN architecture?(Check all that apply)
在这里插入图片描述
[A]Speech recognition (input an audio clip and output a transcript)
[B]Sentiment classification (input a piece of text and output a 0/1 to denote positive or negative sentiment)
[C]Image classification (input an image and output a label)
[D]Gender recognition from speech (input an audio clip and output a label indicating the speaker’s gender)
答案：B,D
关键词：many-to-one

(4)You are training this RNN language model.
在这里插入图片描述
At the tth time step, what is the RNN doing? Choose the best answer.
[A]Estimating $P(y^{<1>},y^{<2>},...,y^{<t-1>})$
[B]Estimating $P(y^{<t>})$
[C]Estimating $P(y^{<t>}|y^{<1>},y^{<2>},...,y^{<t-1>})$
[D]Estimating $P(y^{<t>}|y^{<1>},y^{<2>},...,y^{<t>})$
答案：C

(5)You have finished training a language model RNN and are using it to sample random sentences, as follow:
在这里插入图片描述
What are you doing at each time step t?
[A] (i)Use the probabilities output by the RNN to pick the highest probability word for that time-step as $y^{<t>}$ . (ii)Then pass the ground-truth word from the training set to the next time-step.
[B] (i)Use the probabilities output by the RNN to randomly sample a chosen word for that time-step as $y^{<t>}$ . (ii)Then pass the ground-truth word from the training set to the next time-step.
[C] (i)Use the probabilities output by the RNN to pick the highest probability word for that time-step as $y^{<t>}$ . (ii)Then pass the selected word to the next time-step.
[D] (i)Use the probabilities output by the RNN to randomly sample a chosen word for that time-step as $y^{<t>}$ . (ii)Then pass the selected word to the next time-step.
答案：D

(6)You are training an RNN, and find that your weights and activations are all taking on the value of NaN (“Not a Number”). Which of these is the most likely cause of this problem?
[A]Vanishing gradient problem.
[B]Exploding gradient problem.
[C]ReLU activation function g(.) used to compute g(z), where z is too large.
[D]Sigmoid activation function g(.) used to compute g(z), where z is too large.
答案：B

(7)Suppose you are training a LSTM. You have a 10000 word vocabulary, and are using an LSTM with 100 dimensional activations $a^{<t>}$ . What is the dimension of $\Gamma_u$ at each time step?
[A] 1
[B] 100
[C] 300
[D] 10000
答案：B
解析： $\Gamma_u$ 的维度和激活函数的维度相同。

(8)Here’re the update equations for the GRU.
在这里插入图片描述
Alice proposes to simplify the GRU by always removing the $\Gamma_u$ . I.e.,setting $\Gamma_u=1$ . Betty proposes to simplify the GRU by removing the $\Gamma_r$ . I.e.,setting $\Gamma_r=1$ always. Which of these models is more likely to work without vanishing gradient problems even when trained on very long input sequences.
[A]Alice’s model (removing $\Gamma_u$ ), because if $\Gamma_r \approx 0$ for a timestep, the gradient can propagate back through that timestep without much decay.
[B]Alice’s model (removing $\Gamma_u$ ), because if $\Gamma_r \approx 1$ for a timestep, the gradient can propagate back through that timestep without much decay.
[C]Betty’s model (removing $\Gamma_r$ ), because if $\Gamma_u \approx 0$ for a timestep, the gradient can propagate back through that timestep without much decay.
[D]Betty’s model (removing $\Gamma_r$ ), because if $\Gamma_u \approx 1$ for a timestep, the gradient can propagate back through that timestep without much decay.
答案：C
解析：要想梯度尽可能不消失，就要使 $c^{<t>}$ 尽可能依赖于 $c^{<t-1>}$ ，可以参考残差网络的结构来理解。

(9)Here are the equations for the GRU and the LSTM:
在这里插入图片描述
From these, we can see that the Update Gate and Forget Gate in the LSTM play a role similar to _______ and _______ in the GRU. What should go in the the blanks?
[A] $\Gamma_u$ and $1-\Gamma_u$
[B] $\Gamma_u$ and $\Gamma_r$
[C] $1-\Gamma_u$ and $\Gamma_u$
[D] $\Gamma_r$ and $\Gamma_u$
答案：A

(10)You have a pet dog whose mood is heavily dependent on the current and past few days’ weather. You’ve collected data for the past 365 days on the weather, which you represent as a sequence as $x^{<1>},...,x^{<365>}$ . You’ve also collected data on your dog’s mood, which you represent as $y^{<1>},...,y^{<365>}$ . You’d like to build a model to map from $\rightarrow y$ . Should you use a Unidirectional RNN or Bidirectional RNN for this problem?
[A]Bidirectional RNN, because this allows the prediction of mood on day t to take into account more information.
[B]Bidirectional RNN, because this allows backpropagation to compute more accurate gradients.
[C]Unidirectional RNN, because the value of $y^{<t>}$ depends only on $x^{<1>},...,x^{<t>}$ , but not on $x^{<t+1>},...,x^{<365>}$
[D]Unidirectional RNN, because the value of $y^{<t>}$ depends only on $x^{<t>}$ , and not other days’ weather.
答案：C
解析：a pet dog whose mood is heavily dependent on the current and past few days’ weather.