- Consider using this encoder-decoder model for machine translation.
True/False: This model is a “conditional language model” in the sense that the decoder portion (shown in green) is modeling the probability of the input sentence x x x.
- True
- False
解析:The encoder-decoder model for machine translation models the probability of the output sentence y conditioned on the input sentence x. The encoder portion is shown in green, while the decoder portion is shown in purple.
- In beam search, if you increase the beam width B, which of the following would you expect to be true?
- Beam search will converge after fewer steps.
- Beam search will use up less memory.
- Beam search will generally find better solutions (i.e. do a better job maximizing P ( y ∣ x ) ) P ( y ∣ x )) P(y∣x)).
- Beam search will run more quickly.
解析:As the beam width increases, beam search runs more slowly, uses up more memory, and converges after more steps, but generally finds better solutions.
- In machine translation, if we carry out beam search without using sentence normalization, the algorithm will tend to output overly short translations.
- True
- False
- Suppose you are building a speech recognition system, which uses an RNN model to map from audio clip
x
x
x to a text transcript
y
y
y. Your algorithm uses beam search to try to find the value of
y
y
y that maximizes
P
(
y
∣
x
)
P(y \mid x)
P(y∣x).On a dev set example, given an input audio clip, your algorithm outputs the transcript
y
^
\hat{y}
y^= “I’m building an A Eye system in Silly con Valley.”, whereas a human gives a much superior transcript
y
∗
y^*
y∗ “I’m building an AI system in Silicon Valley.”According to your model,
P ( y ^ ∣ x ) P(\hat{y} \mid x) P(y^∣x) = 1.09 ∗ 1 0 − 7 1.09*10^-7 1.09∗10−7
P ( y ∗ ∣ x ) P(y^* \mid x) P(y∗∣x) = 7.21 ∗ 1 0 − 8 7.21*10^-8 7.21∗10−8
Would you expect increasing the beam width B to help correct this example?
- Yes, because P ( y ∗ ∣ x ) ≤ P ( y ^ ∣ x ) P(y∗∣x)≤P(\hat{y}^∣x) P(y∗∣x)≤P(y^∣x) indicates the error should be attributed to the RNN rather than to the search algorithm.
- Yes, because P ( y ∗ ∣ x ) ≤ P ( y ^ ∣ x ) P(y∗∣x)≤P(\hat{y}^∣x) P(y∗∣x)≤P(y^∣x) indicates the error should be attributed to the search algorithm rather than to the RNN.
- No, because P ( y ∗ ∣ x ) ≤ P ( y ^ ∣ x ) P(y∗∣x)≤P(\hat{y}^∣x) P(y∗∣x)≤P(y^∣x) indicates the error should be attributed to the search algorithm rather than to the RNN.
- No, because P ( y ∗ ∣ x ) ≤ P ( y ^ ∣ x ) P(y∗∣x)≤P(\hat{y}^∣x) P(y∗∣x)≤P(y^∣x) indicates the error should be attributed to the RNN rather than to the search algorithm.
- Continuing the example from Q4, suppose you work on your algorithm for a few more weeks, and now find that for the vast majority of examples on which your algorithm makes a mistake, P ( y ∗ ∣ x ) > P ( y ^ ∣ x ) P(y∗∣x)>P(\hat{y}^∣x) P(y∗∣x)>P(y^∣x). This suggests you should focus your attention on improving the search algorithm.
- True
- False
- Consider the attention model for machine translation.
Which of the following statements about α < t , t ’ > \alpha^{<t,t’>} α<t,t’> are true? Check all that apply.
- ∑ t ′ α < t , t ′ > = 1 \sum_{t^{'}}\alpha ^{<t,t^{'}>}=1 ∑t′α<t,t′>=1
- We expect α < t , t ′ > \alpha ^{<t,t^{'}>} α<t,t′> to be generally larger for values of α < t ′ > \alpha ^{<t^{'}>} α<t′> that are highly relevant to the value the network should output for y < t ′ > y ^{<t^{'}>} y<t′>. (Note the indices in the superscripts.)
- α < t , t ′ > \alpha ^{<t,t^{'}>} α<t,t′> is equal to the amount of attention y < t > y ^{<t>} y<t> should pay to α < t ′ > \alpha^{<t^{'}>} α<t′>This should not be
- ∑ t ′ α < t , t ′ > = 0 \sum_{t^{'}}\alpha ^{<t,t^{'}>}=0 ∑t′α<t,t′>=0
- The network learns where to “pay attention” by learning the values
e
<
t
,
t
’
>
e^{<t,t’>}
e<t,t’>, which are computed using a small neural network:
We can replace s < t − 1 > s^{<t-1>} s<t−1> with s < t > s^{<t>} s<t> as an input to this neural network because s < t > s^{<t>} s<t> is independent of α < t , t ’ > \alpha^{<t,t’>} α<t,t’> and e < t , t ’ > e^{<t,t’>} e<t,t’>.
- True
- False
- Compared to the encoder-decoder model shown in Question 1 of this quiz (which does not use an attention mechanism), we expect the attention model to have the least advantage when:
- The input sequence length T x T_x Tx is small.
- The input sequence length T x T_x Tx is large.
解析:The encoder-decoder model works quite well with short sentences. The true advantage for the attention model occurs when the input sentence is large.
- Under the CTC model, identical repeated characters not separated by the “blank” character (_) are collapsed. Under the CTC model, what does the following string collapse to?
__c_oo_o_kk___b_ooooo__oo__kkk
- cokbok
- cookbook
- coookkboooooookkk
- cook book
- In trigger word detection, x < t > x^{<t>} x<t> is:
- Whether someone has just finished saying the trigger word at time t t t.
- Features of the audio (such as spectrogram features) at time t t t.
- Whether the trigger word is being said at time t t t.
- The t t t-th input word, represented as either a one-hot vector or a word embedding.