【中英】【吴恩达课后测验】Course 5 - 序列模型 - 第三周测验 - 序列模型与注意力机制

何宽

于 2019-05-12 20:17:50 发布

阅读量7.8k

点赞数 13

分类专栏：吴恩达的课后作业

本文链接：https://blog.csdn.net/u013733326/article/details/90144255

版权

吴恩达的课后作业专栏收录该内容

32 篇文章

订阅专栏

【中英】【吴恩达课后测验】Course 5 - 序列模型 - 第三周测验 - 序列模型与注意力机制

上一篇：【课程5 - 第二周编程作业】※※※※※ 【回到目录】※※※※※下一篇：【待撰写-课程5 - 第三周编程作业】

想一想使用如下的编码-解码模型来进行机器翻译：

这个模型是“条件语言模型”,编码器部分(绿色显示)的意义是建模中输入句子x的概率
- 正确
- 错误
在集束搜索中，如果增加集束宽度 $b$ ，以下哪一项是正确的？
- 集束搜索将运行的更慢。
- 集束搜索将使用更多的内存。
- 集束搜索通常将找到更好地解决方案（比如：在最大化概率 $P (y ∣ x$ )上做的更好）。
- 集束搜索将在更少的步骤后收敛。
在机器翻译中，如果我们在不使用句子归一化的情况下使用集束搜索，那么算法会输出过短的译文。
- 正确
- 错误
假设你正在构建一个能够让语音片段 $x$ 转为译文 $y$ 的基于RNN模型的语音识别系统，你的程序使用了集束搜索来试着找寻最大的 $P (y ∣ x)$ 的值 $y$ 。在开发集样本中，给定一个输入音频，你的程序会输出译文 $\hat{y} =$ “I’m building an A Eye system in Silly con Valley.”，人工翻译为 $y^* =$ “I’m building an AI system in Silicon Valley.”

在你的模型中,

$P(\hat{y} \mid x) = 1.09*10^{-7}$

$P(y^* \mid x) = 7.21*10^{-8}$

那么，你会增加集束宽度 $B$ 来帮助修正这个样本吗？
- 不会，因为 $P(y^* \mid x) \leq P(\hat{y} \mid x)$ 说明了这个锅要丢给RNN，不能让搜索算法背锅。
- 不会，因为 $P(y^* \mid x) \leq P(\hat{y} \mid x)$ 说明了这个锅要丢给搜索算法，凭什么让RNN背锅？
- 会的，因为 $P(y^* \mid x) \leq P(\hat{y} \mid x)$ 说明了都是RNN的错，咱不能冤枉搜索算法。
- 会的，因为 $P(y^* \mid x) \leq P(\hat{y} \mid x)$ 说明了千错万错都是搜索算法的错，可不能惩罚RNN啊~
博主注：皮这一下好开心~(～￣▽￣)～
接着使用第4题那里的样本，假设你花了几周的时间来研究你的算法，现在你发现，对于绝大多数让算法出错的例子而言， $P(y^* \mid x) \leq P(\hat{y} \mid x)$ ，这表明你应该将注意力集中在改进搜索算法上，对吗？
- 嗯嗯~
- 不对
回想一下机器翻译的模型：

除此之外，还有个公式 $a^{<t,t'>} = \frac{\text{exp}(e^{<t,t'>})}{\sum^{T_x}_{t'=1}\text{exp}(e^{<t,t'>})}$

下面关于 $\alpha^{<t,t’>}$ 的选项那个（些）是正确的？
- 对于网络中与输出 $y^{<t>}$ 高度相关的 $\alpha^{<t'>}$ 而言，我们通常希望 $\alpha^{<t,t'>}$ 的值更大。（请注意上标）
- 对于网络中与输出 $y^{<t>}$ 高度相关的 $\alpha^{<t>}$ 而言，我们通常希望 $\alpha^{<t,t'>}$ 的值更大。（请注意上标）
- $\sum_{t} \alpha^{<t,t'>} = 1$ (注意是和除以t.)
- $\sum_{t'} \alpha^{<t,t'>}=1$ (注意是和除以t′.)
网络通过学习的值 $e^{<t,t'>}$ 来学习在哪里关注“关注点”，这个值是用一个小的神经网络的计算出来的：

这个神经网络的输入中，我们不能将 $s^{<t>}$ 替换为 $s^{<t-1>}$ 。这是因为 $s^{<t>}$ 依赖于 $\alpha^{<t,t'>}$ ，而 $\alpha^{<t,t'>}$ 又依赖于 $e^{<t,t'>}$ ；所以在我们需要评估这个网络时，我们还没有计算出 $s^{t}$ 。
- 正确
- 错误
与题1中的编码-解码模型（没有使用注意力机制）相比，我们希望有注意力机制的模型在下面的情况下有着最大的优势：
- 输入序列的长度 $T_x$ 比较大。
- 输入序列的长度 $T_x$ 比较小。

9.在CTC模型下，不使用"空白"字符（_）分割的相同字符串将会被折叠。那么在CTC模型下，以下字符串将会被折叠成什么样子？__c_oo_o_kk___b_ooooo__oo__kkk

cokbok
- cookbook
- cook book
- coookkboooooookkk

在触发词检测中, $x^{<t>}$ 是:
- 时间 $t$ 时的音频特征（就像是频谱特征一样）。
- 第 $t$ 个输入字，其被表示为一个独热向量或者一个字嵌入。
- 是否在第 $t$ 时刻说出了触发词。
- 是否有人在第 $t$ 时刻说完了触发词。

Sequence models & Attention mechanism

Consider using this encoder-decoder model for machine translation.

This model is a “conditional language model” in the sense that the encoder portion (shown in green) is modeling the probability of the input sentence $x$ .
- [x] True
- [ ] False

In beam search, if you increase the beam width BB, which of the following would you expect to be true? Check all that apply.
- Beam search will run more slowly.
- Beam search will use up more memory.
- Beam search will generally find better solutions (i.e. do a better job maximizing P(y \mid x)P(y∣x))
- Beam search will converge after fewer steps.

In machine translation, if we carry out beam search without using sentence normalization, the algorithm will tend to output overly short translations.
- True
- False

Suppose you are building a speech recognition system, which uses an RNN model to map from audio clip $x$ to a text transcript $y$ . Your algorithm uses beam search to try to find the value of $y$ that maximizes $\mid x)$ .
On a dev set example, given an input audio clip, your algorithm outputs the transcript $\hat{y} =$ “I’m building an A Eye system in Silly con Valley.”, whereas a human gives a much superior transcript $y^* =$ “I’m building an AI system in Silicon Valley.”.
According to your model,
$P(\hat{y} \mid x) = 1.09*10^{-7}$
$P(y^∗ \mid x) = 7.21∗10^{−8}$
Would you expect increasing the beam width B to help correct this example?
- No, because $P(y^∗ \mid x) \leq P(\hat{y} \mid x)$ indicates the error should be attributed to the RNN rather than to the search algorithm.
- No, because $P(y^∗ \mid x) \leq P(\hat{y} \mid x)$ indicates the error should be attributed to the search algorithm rather than to the RNN.
- Yes, because $P(y^∗ \mid x) \leq P(\hat{y} \mid x)$ indicates the error should be attributed to the RNN rather than to the search algorithm.
- Yes, because $P(y^∗ \mid x) \leq P(\hat{y} \mid x)$ indicates the error should be attributed to the search algorithm rather than to the RNN.

Continuing the example from Q4, suppose you work on your algorithm for a few more weeks, and now find that for the vast majority of examples on which your algorithm makes a mistake, $P(y^∗ \mid x) > P(\hat{y} \mid x)$ . This suggest you should focus your attention on improving the search algorithm.
- True
- False

Consider the attention model for machine translation.

Further, here is the formula for $\alpha^{<t,t′>}$ .

$a^{<t,t'>} = \frac{\text{exp}(e^{<t,t'>})}{\sum^{T_x}_{t'=1}\text{exp}(e^{<t,t'>})}$

Which of the following statements about $\alpha^{<t,t′>}$ are true? Check all that apply.

We expect $\alpha^{<t,t'>}$ to be generally larger for values of $a^{<t'>}$ that are highly relevant to the value the network should output for $y^{<t>}$ . (Note the indices in the superscripts.)
We expect $\alpha^{<t,t'>}$ to be generally larger for values of $a^{<t>}$ that are highly relevant to the value the network should output for $y^{<t'>}$ . (Note the indices in the superscripts.)
$\sum_{t} \alpha^{<t,t'>}=1$ (Note the summation is over $t$ .)
$\sum_{t'} \alpha^{<t,t'>}=1$ (Note the summation is over $t^{'}$ .)

The network learns where to “pay attention” by learning the values e<t,t′>, which are computed using a small neural network:
We can’t replace $s^{<t-1>}$ with $s^{<t>}$ as an input to this neural network. This is because $s^{<t>}$ depends on $\alpha^{<t,t′>}$ which in turn depends on $e^{<t,t′>}$ ; so at the time we need to evalute this network, we haven’t computed $s^{<t>}$ yet.
- True
- False

Compared to the encoder-decoder model shown in Question 1 of this quiz (which does not use an attention mechanism), we expect the attention model to have the greatest advantage when:
- The input sequence length $T_x$ is large.
- The input sequence length $T_x$ is small.

Under the CTC model, identical repeated characters not separated by the “blank” character (_) are collapsed. Under the CTC model, what does the following string collapse to? __c_oo_o_kk___b_ooooo__oo__kkk
- cokbok
- cookbook
- cook book
- coookkboooooookkk

In trigger word detection, $x^{<t>}$ is:
- Features of the audio (such as spectrogram features) at time $t$ .
- The $t$ -th input word, represented as either a one-hot vector or a word embedding.
- Whether the trigger word is being said at time $t$ .
- Whether someone has just finished saying the trigger word at time $t$ .