【中英】【吴恩达课后测验】Course 5 - 序列模型 - 第三周测验 - 序列模型与注意力机制

【中英】【吴恩达课后测验】Course 5 - 序列模型 - 第三周测验 - 序列模型与注意力机制


上一篇: 【课程5 - 第二周编程作业】※※※※※ 【回到目录】※※※※※下一篇: 【待撰写-课程5 - 第三周编程作业】

  1. 想一想使用如下的编码-解码模型来进行机器翻译:
    model
    这个模型是“条件语言模型”,编码器部分(绿色显示)的意义是建模中输入句子x的概率

    • 正确
    • 错误
  2. 在集束搜索中,如果增加集束宽度 b b b,以下哪一项是正确的?

    • 集束搜索将运行的更慢。
    • 集束搜索将使用更多的内存。
    • 集束搜索通常将找到更好地解决方案(比如:在最大化概率 P ( y ∣ x P(y|x P(yx)上做的更好)。
    • 集束搜索将在更少的步骤后收敛。
  3. 在机器翻译中,如果我们在不使用句子归一化的情况下使用集束搜索,那么算法会输出过短的译文。

    • 正确
    • 错误
  4. 假设你正在构建一个能够让语音片段 x x x转为译文 y y y的基于RNN模型的语音识别系统,你的程序使用了集束搜索来试着找寻最大的 P ( y ∣ x ) P(y|x) P(yx)的值 y y y。在开发集样本中,给定一个输入音频,你的程序会输出译文 y ^ = \hat{y} = y^= “I’m building an A Eye system in Silly con Valley.”,人工翻译为 y ∗ = y^* = y= “I’m building an AI system in Silicon Valley.”

    在你的模型中,

    P ( y ^ ∣ x ) = 1.09 ∗ 1 0 − 7 P(\hat{y} \mid x) = 1.09*10^{-7} P(y^x)=1.09107

    P ( y ∗ ∣ x ) = 7.21 ∗ 1 0 − 8 P(y^* \mid x) = 7.21*10^{-8} P(yx)=7.21108

    那么,你会增加集束宽度 B B B来帮助修正这个样本吗?

    • 不会,因为 P ( y ∗ ∣ x ) ≤ P ( y ^ ∣ x ) P(y^* \mid x) \leq P(\hat{y} \mid x) P(yx)P(y^x) 说明了这个锅要丢给RNN,不能让搜索算法背锅。

    • 不会,因为 P ( y ∗ ∣ x ) ≤ P ( y ^ ∣ x ) P(y^* \mid x) \leq P(\hat{y} \mid x) P(yx)P(y^x) 说明了这个锅要丢给搜索算法,凭什么让RNN背锅?

    • 会的,因为 P ( y ∗ ∣ x ) ≤ P ( y ^ ∣ x ) P(y^* \mid x) \leq P(\hat{y} \mid x) P(yx)P(y^x) 说明了都是RNN的错,咱不能冤枉搜索算法。

    • 会的,因为 P ( y ∗ ∣ x ) ≤ P ( y ^ ∣ x ) P(y^* \mid x) \leq P(\hat{y} \mid x) P(yx)P(y^x) 说明了千错万错都是搜索算法的错,可不能惩罚RNN啊~

    博主注:皮这一下好开心~(~ ̄▽ ̄)~

  5. 接着使用第4题那里的样本,假设你花了几周的时间来研究你的算法,现在你发现,对于绝大多数让算法出错的例子而言, P ( y ∗ ∣ x ) ≤ P ( y ^ ∣ x ) P(y^* \mid x) \leq P(\hat{y} \mid x) P(yx)P(y^x),这表明你应该将注意力集中在改进搜索算法上,对吗?

    • 嗯嗯~
    • 不对
  6. 回想一下机器翻译的模型:
    model2
    除此之外,还有个公式 a &lt; t , t ′ &gt; = exp ( e &lt; t , t ′ &gt; ) ∑ t ′ = 1 T x exp ( e &lt; t , t ′ &gt; ) a^{&lt;t,t&#x27;&gt;} = \frac{\text{exp}(e^{&lt;t,t&#x27;&gt;})}{\sum^{T_x}_{t&#x27;=1}\text{exp}(e^{&lt;t,t&#x27;&gt;})} a<t,t>=t=1Txexp(e<t,t>)exp(e<t,t>)

    下面关于 α &lt; t , t ’ &gt; \alpha^{&lt;t,t’&gt;} α<t,t> 的选项那个(些)是正确的?

    • 对于网络中与输出 y &lt; t &gt; y^{&lt;t&gt;} y<t>高度相关的 α &lt; t ′ &gt; \alpha^{&lt;t&#x27;&gt;} α<t> 而言,我们通常希望 α &lt; t , t ′ &gt; \alpha^{&lt;t,t&#x27;&gt;} α<t,t>的值更大。(请注意上标)
    • 对于网络中与输出 y &lt; t &gt; y^{&lt;t&gt;} y<t>高度相关的 α &lt; t &gt; \alpha^{&lt;t&gt;} α<t> 而言,我们通常希望 α &lt; t , t ′ &gt; \alpha^{&lt;t,t&#x27;&gt;} α<t,t>的值更大。(请注意上标)
    • ∑ t α &lt; t , t ′ &gt; = 1 \sum_{t} \alpha^{&lt;t,t&#x27;&gt;} = 1 tα<t,t>=1 (注意是和除以t.)
    • ∑ t ′ α &lt; t , t ′ &gt; = 1 \sum_{t&#x27;} \alpha^{&lt;t,t&#x27;&gt;}=1 tα<t,t>=1 (注意是和除以t′.)
  7. 网络通过学习的值 e &lt; t , t ′ &gt; e^{&lt;t,t&#x27;&gt;} e<t,t>来学习在哪里关注“关注点”,这个值是用一个小的神经网络的计算出来的:

    这个神经网络的输入中,我们不能将 s &lt; t &gt; s^{&lt;t&gt;} s<t>替换为 s &lt; t − 1 &gt; s^{&lt;t-1&gt;} s<t1>。这是因为 s &lt; t &gt; s^{&lt;t&gt;} s<t>依赖于 α &lt; t , t ′ &gt; \alpha^{&lt;t,t&#x27;&gt;} α<t,t>,而 α &lt; t , t ′ &gt; \alpha^{&lt;t,t&#x27;&gt;} α<t,t>又依赖于 e &lt; t , t ′ &gt; e^{&lt;t,t&#x27;&gt;} e<t,t>;所以在我们需要评估这个网络时,我们还没有计算出 s t s^{t} st

    • 正确
    • 错误
  8. 与题1中的编码-解码模型(没有使用注意力机制)相比,我们希望有注意力机制的模型在下面的情况下有着最大的优势:

    • 输入序列的长度 T x T_x Tx比较大。
    • 输入序列的长度 T x T_x Tx比较小。

9.在CTC模型下,不使用"空白"字符(_)分割的相同字符串将会被折叠。那么在CTC模型下,以下字符串将会被折叠成什么样子?__c_oo_o_kk___b_ooooo__oo__kkk

  • cokbok
    • cookbook
    • cook book
    • coookkboooooookkk
  1. 在触发词检测中, x &lt; t &gt; x^{&lt;t&gt;} x<t> 是:
    • 时间 t t t时的音频特征(就像是频谱特征一样)。
    • t t t个输入字,其被表示为一个独热向量或者一个字嵌入。
    • 是否在第 t t t时刻说出了触发词。
    • 是否有人在第 t t t时刻说完了触发词。

Sequence models & Attention mechanism

  1. Consider using this encoder-decoder model for machine translation.


This model is a “conditional language model” in the sense that the encoder portion (shown in green) is modeling the probability of the input sentence x x x.
- [x] True
- [ ] False

  1. In beam search, if you increase the beam width BB, which of the following would you expect to be true? Check all that apply.
    • Beam search will run more slowly.
    • Beam search will use up more memory.
    • Beam search will generally find better solutions (i.e. do a better job maximizing P(y \mid x)P(y∣x))
    • Beam search will converge after fewer steps.

  1. In machine translation, if we carry out beam search without using sentence normalization, the algorithm will tend to output overly short translations.
    • True
    • False

  1. Suppose you are building a speech recognition system, which uses an RNN model to map from audio clip x x x to a text transcript y y y. Your algorithm uses beam search to try to find the value of y y y that maximizes P ( y ∣ x ) P(y \mid x) P(yx).
    On a dev set example, given an input audio clip, your algorithm outputs the transcript y ^ = \hat{y} = y^= “I’m building an A Eye system in Silly con Valley.”, whereas a human gives a much superior transcript y ∗ = y^* = y= “I’m building an AI system in Silicon Valley.”.
    According to your model,
    P ( y ^ ∣ x ) = 1.09 ∗ 1 0 − 7 P(\hat{y} \mid x) = 1.09*10^{-7} P(y^x)=1.09107
    P ( y ∗ ∣ x ) = 7.21 ∗ 1 0 − 8 P(y^∗ \mid x) = 7.21∗10^{−8} P(yx)=7.21108
    Would you expect increasing the beam width B to help correct this example?

    • No, because P ( y ∗ ∣ x ) ≤ P ( y ^ ∣ x ) P(y^∗ \mid x) \leq P(\hat{y} \mid x) P(yx)P(y^x) indicates the error should be attributed to the RNN rather than to the search algorithm.
    • No, because P ( y ∗ ∣ x ) ≤ P ( y ^ ∣ x ) P(y^∗ \mid x) \leq P(\hat{y} \mid x) P(yx)P(y^x) indicates the error should be attributed to the search algorithm rather than to the RNN.
    • Yes, because P ( y ∗ ∣ x ) ≤ P ( y ^ ∣ x ) P(y^∗ \mid x) \leq P(\hat{y} \mid x) P(yx)P(y^x) indicates the error should be attributed to the RNN rather than to the search algorithm.
    • Yes, because P ( y ∗ ∣ x ) ≤ P ( y ^ ∣ x ) P(y^∗ \mid x) \leq P(\hat{y} \mid x) P(yx)P(y^x) indicates the error should be attributed to the search algorithm rather than to the RNN.

  1. Continuing the example from Q4, suppose you work on your algorithm for a few more weeks, and now find that for the vast majority of examples on which your algorithm makes a mistake, P ( y ∗ ∣ x ) &gt; P ( y ^ ∣ x ) P(y^∗ \mid x) &gt; P(\hat{y} \mid x) P(yx)>P(y^x). This suggest you should focus your attention on improving the search algorithm.
    • True
    • False

  1. Consider the attention model for machine translation.

Further, here is the formula for α &lt; t , t ′ &gt; \alpha^{&lt;t,t′&gt;} α<t,t>.

a &lt; t , t ′ &gt; = exp ( e &lt; t , t ′ &gt; ) ∑ t ′ = 1 T x exp ( e &lt; t , t ′ &gt; ) a^{&lt;t,t&#x27;&gt;} = \frac{\text{exp}(e^{&lt;t,t&#x27;&gt;})}{\sum^{T_x}_{t&#x27;=1}\text{exp}(e^{&lt;t,t&#x27;&gt;})} a<t,t>=t=1Txexp(e<t,t>)exp(e<t,t>)

Which of the following statements about α &lt; t , t ′ &gt; \alpha^{&lt;t,t′&gt;} α<t,t> are true? Check all that apply.

  • We expect α &lt; t , t ′ &gt; \alpha^{&lt;t,t&#x27;&gt;} α<t,t> to be generally larger for values of a &lt; t ′ &gt; a^{&lt;t&#x27;&gt;} a<t> that are highly relevant to the value the network should output for y &lt; t &gt; y^{&lt;t&gt;} y<t>. (Note the indices in the superscripts.)
  • We expect α &lt; t , t ′ &gt; \alpha^{&lt;t,t&#x27;&gt;} α<t,t> to be generally larger for values of a &lt; t &gt; a^{&lt;t&gt;} a<t> that are highly relevant to the value the network should output for y &lt; t ′ &gt; y^{&lt;t&#x27;&gt;} y<t>. (Note the indices in the superscripts.)
  • ∑ t α &lt; t , t ′ &gt; = 1 \sum_{t} \alpha^{&lt;t,t&#x27;&gt;}=1 tα<t,t>=1 (Note the summation is over t t t.)
  • ∑ t ′ α &lt; t , t ′ &gt; = 1 \sum_{t&#x27;} \alpha^{&lt;t,t&#x27;&gt;}=1 tα<t,t>=1 (Note the summation is over t ′ t&#x27; t.)
  1. The network learns where to “pay attention” by learning the values e<t,t′>, which are computed using a small neural network:
    We can’t replace s &lt; t − 1 &gt; s^{&lt;t-1&gt;} s<t1> with s &lt; t &gt; s^{&lt;t&gt;} s<t> as an input to this neural network. This is because s &lt; t &gt; s^{&lt;t&gt;} s<t> depends on α &lt; t , t ′ &gt; \alpha^{&lt;t,t′&gt;} α<t,t> which in turn depends on e &lt; t , t ′ &gt; e^{&lt;t,t′&gt;} e<t,t>; so at the time we need to evalute this network, we haven’t computed s &lt; t &gt; s^{&lt;t&gt;} s<t> yet.

    • True
    • False

  1. Compared to the encoder-decoder model shown in Question 1 of this quiz (which does not use an attention mechanism), we expect the attention model to have the greatest advantage when:
    • The input sequence length T x T_x Tx is large.
    • The input sequence length T x T_x Tx is small.

  1. Under the CTC model, identical repeated characters not separated by the “blank” character (_) are collapsed. Under the CTC model, what does the following string collapse to? __c_oo_o_kk___b_ooooo__oo__kkk
    • cokbok
    • cookbook
    • cook book
    • coookkboooooookkk

  1. In trigger word detection, x &lt; t &gt; x^{&lt;t&gt;} x<t> is:
    • Features of the audio (such as spectrogram features) at time t t t.
    • The t t t-th input word, represented as either a one-hot vector or a word embedding.
    • Whether the trigger word is being said at time t t t.
    • Whether someone has just finished saying the trigger word at time t t t.

评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值