深度学习Course5第三周Sequence Models & Attention Mechanism习题整理

  1. Consider using this encoder-decoder model for machine translation.
    在这里插入图片描述
    True/False: This model is a “conditional language model” in the sense that the decoder portion (shown in green) is modeling the probability of the input sentence x x x.
  • True
  • False

解析:The encoder-decoder model for machine translation models the probability of the output sentence y conditioned on the input sentence x. The encoder portion is shown in green, while the decoder portion is shown in purple.

  1. In beam search, if you increase the beam width B, which of the following would you expect to be true?
  • Beam search will converge after fewer steps.
  • Beam search will use up less memory.
  • Beam search will generally find better solutions (i.e. do a better job maximizing P ( y ∣ x ) ) P ( y ∣ x )) P(yx)).
  • Beam search will run more quickly.

解析:As the beam width increases, beam search runs more slowly, uses up more memory, and converges after more steps, but generally finds better solutions.

  1. In machine translation, if we carry out beam search without using sentence normalization, the algorithm will tend to output overly short translations.
  • True
  • False
  1. Suppose you are building a speech recognition system, which uses an RNN model to map from audio clip x x x to a text transcript y y y. Your algorithm uses beam search to try to find the value of y y y that maximizes P ( y ∣ x ) P(y \mid x) P(yx).On a dev set example, given an input audio clip, your algorithm outputs the transcript y ^ \hat{y} y^​= “I’m building an A Eye system in Silly con Valley.”, whereas a human gives a much superior transcript y ∗ y^* y “I’m building an AI system in Silicon Valley.”According to your model,
    P ( y ^ ∣ x ) P(\hat{y} \mid x) P(y^x) = 1.09 ∗ 1 0 − 7 1.09*10^-7 1.09107
    P ( y ∗ ∣ x ) P(y^* \mid x) P(yx) = 7.21 ∗ 1 0 − 8 7.21*10^-8 7.21108
    Would you expect increasing the beam width B to help correct this example?
  • Yes, because P ( y ∗ ∣ x ) ≤ P ( y ^ ∣ x ) P(y∗∣x)≤P(\hat{y}^∣x) P(yx)P(y^x) indicates the error should be attributed to the RNN rather than to the search algorithm.
  • Yes, because P ( y ∗ ∣ x ) ≤ P ( y ^ ∣ x ) P(y∗∣x)≤P(\hat{y}^∣x) P(yx)P(y^x) indicates the error should be attributed to the search algorithm rather than to the RNN.
  • No, because P ( y ∗ ∣ x ) ≤ P ( y ^ ∣ x ) P(y∗∣x)≤P(\hat{y}^∣x) P(yx)P(y^x) indicates the error should be attributed to the search algorithm rather than to the RNN.
  • No, because P ( y ∗ ∣ x ) ≤ P ( y ^ ∣ x ) P(y∗∣x)≤P(\hat{y}^∣x) P(yx)P(y^x) indicates the error should be attributed to the RNN rather than to the search algorithm.
  1. Continuing the example from Q4, suppose you work on your algorithm for a few more weeks, and now find that for the vast majority of examples on which your algorithm makes a mistake, P ( y ∗ ∣ x ) > P ( y ^ ∣ x ) P(y∗∣x)>P(\hat{y}^∣x) P(yx)>P(y^x). This suggests you should focus your attention on improving the search algorithm.
  • True
  • False
  1. Consider the attention model for machine translation.
    在这里插入图片描述
    Which of the following statements about α < t , t ’ > \alpha^{<t,t’>} α<t,t> are true? Check all that apply.
  • ∑ t ′ α < t , t ′ > = 1 \sum_{t^{'}}\alpha ^{<t,t^{'}>}=1 tα<t,t>=1
  • We expect α < t , t ′ > \alpha ^{<t,t^{'}>} α<t,t> to be generally larger for values of α < t ′ > \alpha ^{<t^{'}>} α<t> that are highly relevant to the value the network should output for y < t ′ > y ^{<t^{'}>} y<t>. (Note the indices in the superscripts.)
  • α < t , t ′ > \alpha ^{<t,t^{'}>} α<t,t> is equal to the amount of attention y < t > y ^{<t>} y<t> should pay to α < t ′ > \alpha^{<t^{'}>} α<t>This should not be
  • ∑ t ′ α < t , t ′ > = 0 \sum_{t^{'}}\alpha ^{<t,t^{'}>}=0 tα<t,t>=0
  1. The network learns where to “pay attention” by learning the values e < t , t ’ > e^{<t,t’>} e<t,t>, which are computed using a small neural network:
    We can replace s < t − 1 > s^{<t-1>} s<t1> with s < t > s^{<t>} s<t> as an input to this neural network because s < t > s^{<t>} s<t> is independent of α < t , t ’ > \alpha^{<t,t’>} α<t,t> and e < t , t ’ > e^{<t,t’>} e<t,t>.
  • True
  • False
  1. Compared to the encoder-decoder model shown in Question 1 of this quiz (which does not use an attention mechanism), we expect the attention model to have the least advantage when:
  • The input sequence length T x T_x Tx is small.
  • The input sequence length T x T_x Tx is large.

解析:The encoder-decoder model works quite well with short sentences. The true advantage for the attention model occurs when the input sentence is large.

  1. Under the CTC model, identical repeated characters not separated by the “blank” character (_) are collapsed. Under the CTC model, what does the following string collapse to?

__c_oo_o_kk___b_ooooo__oo__kkk

  • cokbok
  • cookbook
  • coookkboooooookkk
  • cook book
  1. In trigger word detection, x < t > x^{<t>} x<t> is:
  • Whether someone has just finished saying the trigger word at time t t t.
  • Features of the audio (such as spectrogram features) at time t t t.
  • Whether the trigger word is being said at time t t t.
  • The t t t-th input word, represented as either a one-hot vector or a word embedding.
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

l8947943

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值