【中英】【吴恩达课后测验】Course 5 -序列模型 - 第二周测验 - 自然语言处理与词嵌入
-
假设你为10000个单词学习词嵌入,为了捕获全部范围的单词的变化以及意义,那么词嵌入向量应该是10000维的。
- 【 】 正确
- 【★】 错误
-
什么是t-SNE?
- 【★】 一种非线性降维算法。
- 【 】 一种能够解决词向量上的类比的线性变换。
- 【 】 一种用于学习词嵌入的监督学习算法。
- 【 】 一个开源序列模型库。
-
假设你下载了一个已经在一个很大的文本语料库上训练过的词嵌入的数据,然后你要用这个词嵌入来训练RNN并用于识别一段文字中的情感,判断这段文字的内容是否表达了“快乐”。
x(输入文本) | y (是否快乐) |
---|---|
我今天感觉很好! | 1 |
我觉得很沮丧,因为我的猫生病了。 | 0 |
真的很享受这个! | 1 |
- 【★】 正确
- 【 】 错误
-
对于词嵌入而言,下面哪一个(些)方程是成立的?
- 【★】 e b o y − e g i r l ≈ e b r o t h e r − e s i s t e r e_{boy} - e_{girl} ≈ e_{brother} - e_{sister} eboy−egirl≈ebrother−esister
- 【 】 e b o y − e g i r l ≈ e s i s t e r − e b r o t h e r e_{boy} - e_{girl} ≈ e_{sister} - e_{brother} eboy−egirl≈esister−ebrother
- 【★】 $e_{boy} - e_{brother} ≈ e_{girl} - e_{sister} $
- 【 】 $e_{boy} - e_{brother} ≈ e_{sister} - e_{girl} $
-
设 E E E为嵌入矩阵, e 1234 e_{1234} e1234对应的是词“1234”的独热向量,为了获得1234的词嵌入,为什么不直接在Python中使用代码 E ∗ e 1234 E∗e_{1234} E∗e1234呢?
- 【★】 因为这个操作是在浪费计算资源。
- 【 】 因为正确的计算方式是 E T ∗ e 1234 E^T ∗ e_{1234} ET∗e1234。
- 【 】 因为它没有办法处理未知的单词(<UNK>)。
- 【 】 以上全都不对,因为直接调用 E ∗ e 1234 E∗e_{1234} E∗e1234是最好的方案。
-
在学习词嵌入时,我们创建了一个预测 P ( t a r g e t ∣ c o n t e x t ) P(target \mid context) P(target∣context)的任务,如果这个预测做的不是很好那也是没有关系的,因为这个任务更重要的是学习了一组有用的嵌入词。
- 【 】 正确
- 【★】 错误
-
在word2vec算法中,你要预测 P ( t ∣ c ) P(t \mid c) P(t∣c),其中 t t t是目标词(target word), c c c是语境词(context word)。你应当在训练集中怎样选择 t t t 与 c c c呢?
- 【★】 c c c 与 t t t 应当在附近词中。
- 【 】 c c c 是在 t t t前面的一个词。
- 【 】 c c c 是 t t t之前句子中所有单词的序列。
- 【 】 c c c 是 t t t之前句子中几个单词的序列。
-
假设你有1000个单词词汇,并且正在学习500维的词嵌入,word2vec模型使用下面的softmax函数:
P ( t ∣ c ) = e θ t T e c ∑ t ′ = 1 10000 e θ t ′ T e c P(t \mid c)=\frac{e^{\theta_t^T e_c}}{\sum_{t′=1}^{10000} e^{\theta_{t'}^T e_c}} P(t∣c)=∑t′=110000eθt′TeceθtTec
以下说法中哪一个(些)是正确的?- 【★】 θ t \theta_t θt 与 e c e_c ec 都是500维的向量。
- 【 】 θ t \theta_t θt 与 e c e_c ec 都是10000维的向量。
- 【★】 θ t \theta_t θt 与 e c e_c ec 都是通过Adam或梯度下降等优化算法进行训练的。
- 【 】 训练之后, θ t \theta_t θt应该非常接近 e c e_c ec,因为 t t t和 c c c是一个词。
-
假设你有10000个单词词汇,并且正在学习500维的词嵌入,GloVe模型最小化了这个目标:
min ∑ i = 1 10 , 000 ∑ j = 1 10 , 000 f ( X i j ) ( θ i T e j + b i + b ′ j − l o g X i j ) 2 \min \sum^{10,000}_{i=1}\sum^{10,000}_{j=1}f(X_{ij})(\theta^T_ie_j+b_i+b′_j−logX_{ij})^2 mini=1∑10,000j=1∑10,000f(Xij)(θiTej+bi+b′j−logXij)2
以下说法中哪一个(些)是正确的?- 【 】 θ i \theta_i θi 与 e j e_j ej 应当初始化为0。
- 【★】 θ i \theta_i θi 与 e j e_j ej 应当使用随机数进行初始化。
- 【★】 X i j X_{ij} Xij 是单词i在j中出现的次数。
- 【★】 加权函数
f
(
.
)
f(.)
f(.) 必须满足
f
(
0
)
=
0
f(0)=0
f(0)=0。
The weighting function helps prevent learning only from extremely common word pairs. It is not necessary that it satisfies this function.
加权函数有助于防止仅从非常常见的单词对中学习,它不必满足这个函数。
-
你已经在文本数据集 m 1 m_1 m1上训练了词嵌入,现在准备将它用于一个语言任务中,对于这个任务,你有一个单独标记的数据集 m 2 m_2 m2 ,请记住,使用词嵌入是一种迁移学习的形式,在这种情况下,你认为词嵌入会有帮助吗?
- 【★】 m 1 > > m 2 m_1 >> m_2 m1>>m2
- 【 】 m 1 < < m 2 m_1 << m_2 m1<<m2
Natural Language Processing & Word Embeddings
- Suppose you learn a word embedding for a vocabulary of 10000 words. Then the embedding vectors should be 10000 dimensional, so as to capture the full range of variation and meaning in those words.
- True
- False
-
What is t-SNE?
- A non-linear dimensionality reduction technique.
- A linear transformation that allows us to solve analogies on word vectors.
- A supervised learning algorithm for learning word embeddings.
- An open-source sequence modeling library.
- Suppose you download a pre-trained word embedding which has been trained on a huge corpus of text. You then use this word embedding to train an RNN for a language task of recognizing if someone is happy from a short snippet of text, using a small training set.
x(input text) | y (happy?) |
---|---|
I'm feeling wonderful today! | 1 |
I'm bummed my cat is ill. | 0 |
Really enjoying this! | 1 |
Then even if the word "ecstatic" does not appear in your small training set, your RNN might reasonably be expected to recognize "I’m ecstatic" as deserving a label $y = 1$. |
Then even if the word “ecstatic” does not appear in your small training set, your RNN might reasonably be expected to recognize “I’m ecstatic” as deserving a label
y
=
1
y = 1
y=1.
- [x] True
- [ ] False
-
Which of these equations do you think should hold for a good word embedding? (Check all that apply)
- e b o y − e g i r l ≈ e b r o t h e r − e s i s t e r e_{boy} - e_{girl} ≈ e_{brother} - e_{sister} eboy−egirl≈ebrother−esister
- e b o y − e g i r l ≈ e s i s t e r − e b r o t h e r e_{boy} - e_{girl} ≈ e_{sister} - e_{brother} eboy−egirl≈esister−ebrother
- $e_{boy} - e_{brother} ≈ e_{girl} - e_{sister} $
- $e_{boy} - e_{brother} ≈ e_{sister} - e_{girl} $
-
Let E E E be an embedding matrix, and let e 1234 e_{1234} e1234 be a one-hot vector corresponding to word 1234. Then to get the embedding of word 1234, why don’t we call E ∗ e 1234 E∗e_{1234} E∗e1234 in Python?
- It is computationally wasteful.
- The correct formula is E T ∗ e 1234 E^T∗e_{1234} ET∗e1234.
- This doesn’t handle unknown words ().
- None of the above: Calling the Python snippet as described above is fine.
-
When learning word embeddings, we create an artificial task of estimating P ( t a r g e t ∣ c o n t e x t ) P(target \mid context) P(target∣context). It is okay if we do poorly on this artificial prediction task; the more important by-product of this task is that we learn a useful set of word embeddings.
- True
- False
-
In the word2vec algorithm, you estimate P ( t ∣ c ) P(t \mid c) P(t∣c), where t t t is the target word and c c c is a context word. How are t t t and c c c chosen from the training set? Pick the best answer.
- c c c and t t t are chosen to be nearby words.
- c c c is the one word that comes immediately before t t t.
- c c c is the sequence of all the words in the sentence before t t t.
- c c c is a sequence of several words immediately before t t t.
- Suppose you have a 10000 word vocabulary, and are learning 500-dimensional word embeddings. The word2vec model uses the following softmax function:
P ( t ∣ c ) = e θ t T e c ∑ t ′ = 1 10000 e θ t ′ T e c P(t \mid c)=\frac{e^{\theta_t^T e_c}}{\sum_{t′=1}^{10000} e^{\theta_{t'}^T e_c}} P(t∣c)=∑t′=110000eθt′TeceθtTec
Which of these statements are correct? Check all that apply.- θ t \theta_t θt and e c e_c ec are both 500 dimensional vectors.
- θ t \theta_t θt and e c e_c ec are both 10000 dimensional vectors.
- θ t \theta_t θt and e c e_c ec are both trained with an optimization algorithm such as Adam or gradient descent.
- After training, we should expect θ t \theta_t θt to be very close to e c e_c ec when t t t and c c c are the same word.
-
Suppose you have a 10000 word vocabulary, and are learning 500-dimensional word embeddings.The GloVe model minimizes this objective:
min ∑ i = 1 10 , 000 ∑ j = 1 10 , 000 f ( X i j ) ( θ i T e j + b i + b ′ j − l o g X i j ) 2 \min \sum^{10,000}_{i=1}\sum^{10,000}_{j=1}f(X_{ij})(\theta^T_ie_j+b_i+b′_j−logX_{ij})^2 min∑i=110,000∑j=110,000f(Xij)(θiTej+bi+b′j−logXij)2
Which of these statements are correct? Check all that apply.-
θ i \theta_i θi and e j e_j ej hould be initialized to 0 at the beginning of training.
-
θ i \theta_i θi and e j e_j ej hould be initialized randomly at the beginning of training.
-
X i j X_{ij} Xij is the number of times word i appears in the context of word j.
-
The weighting function f ( . ) f(.) f(.) must satisfy f ( 0 ) = 0 f(0)=0 f(0)=0.
The weighting function helps prevent learning only from extremely common word pairs. It is not necessary that it satisfies this function.
-
- You have trained word embeddings using a text dataset of m 1 m_1 m1 words. You are considering using these word embeddings for a language task, for which you have a separate labeled dataset of m 2 m_2 m2 words. Keeping in mind that using word embeddings is a form of transfer learning, under which of these circumstance would you expect the word embeddings to be helpful?
- m 1 > > m 2 m_1 >> m_2 m1>>m2
- m 1 < < m 2 m_1 << m_2 m1<<m2