【中英】【吴恩达课后测验】Course 5 -序列模型 - 第二周测验 - 自然语言处理与词嵌入

何宽

于 2018-10-16 15:20:51 发布

阅读量9.3k

点赞数 9

分类专栏：吴恩达的课后作业

本文链接：https://blog.csdn.net/u013733326/article/details/83089164

版权

吴恩达的课后作业专栏收录该内容

32 篇文章

订阅专栏

【中英】【吴恩达课后测验】Course 5 -序列模型 - 第二周测验 - 自然语言处理与词嵌入

上一篇：【课程5 - 第一周编程作业】※※※※※ 【回到目录】※※※※※下一篇：【课程5 -第二周编程作业】

假设你为10000个单词学习词嵌入，为了捕获全部范围的单词的变化以及意义，那么词嵌入向量应该是10000维的。
- 【】正确
- 【★】错误
什么是t-SNE？
- 【★】一种非线性降维算法。
- 【】一种能够解决词向量上的类比的线性变换。
- 【】一种用于学习词嵌入的监督学习算法。
- 【】一个开源序列模型库。
假设你下载了一个已经在一个很大的文本语料库上训练过的词嵌入的数据，然后你要用这个词嵌入来训练RNN并用于识别一段文字中的情感，判断这段文字的内容是否表达了“快乐”。

x(输入文本)	y (是否快乐)
我今天感觉很好!	1
我觉得很沮丧，因为我的猫生病了。	0
真的很享受这个！	1

那么即使“欣喜若狂”这个词没有出现在你的小训练集中，你的RNN也会认为“我欣喜若狂”应该被贴上$y = 1$的标签。

【★】正确
【】错误

对于词嵌入而言，下面哪一个（些）方程是成立的？
- 【★】 $e_{boy} - e_{girl} ≈ e_{brother} - e_{sister}$
- 【】 $e_{boy} - e_{girl} ≈ e_{sister} - e_{brother}$
- 【★】 $e_{boy} - e_{brother} ≈ e_{girl} - e_{sister} $
- 【】 $e_{boy} - e_{brother} ≈ e_{sister} - e_{girl} $
设 $E$ 为嵌入矩阵， $e_{1234}$ 对应的是词“1234”的独热向量，为了获得1234的词嵌入，为什么不直接在Python中使用代码 $E∗e_{1234}$ 呢？
- 【★】因为这个操作是在浪费计算资源。
- 【】因为正确的计算方式是 $E^T ∗ e_{1234}$ 。
- 【】因为它没有办法处理未知的单词（<UNK>）。
- 【】以上全都不对，因为直接调用 $E∗e_{1234}$ 是最好的方案。
在学习词嵌入时，我们创建了一个预测 $\mid context)$ 的任务，如果这个预测做的不是很好那也是没有关系的，因为这个任务更重要的是学习了一组有用的嵌入词。
- 【】正确
- 【★】错误
在word2vec算法中，你要预测 $\mid c)$ ，其中 $t$ 是目标词（target word）， $c$ 是语境词（context word）。你应当在训练集中怎样选择 $t$ 与 $c$ 呢？
- 【★】 $c$ 与 $t$ 应当在附近词中。
- 【】 $c$ 是在 $t$ 前面的一个词。
- 【】 $c$ 是 $t$ 之前句子中所有单词的序列。
- 【】 $c$ 是 $t$ 之前句子中几个单词的序列。
假设你有1000个单词词汇，并且正在学习500维的词嵌入，word2vec模型使用下面的softmax函数：
$\mid c)=\frac{e^{\theta_t^T e_c}}{\sum_{t′=1}^{10000} e^{\theta_{t'}^T e_c}}$
以下说法中哪一个（些）是正确的？
- 【★】 $\theta_t$ 与 $e_c$ 都是500维的向量。
- 【】 $\theta_t$ 与 $e_c$ 都是10000维的向量。
- 【★】 $\theta_t$ 与 $e_c$ 都是通过Adam或梯度下降等优化算法进行训练的。
- 【】训练之后， $\theta_t$ 应该非常接近 $e_c$ ，因为 $t$ 和 $c$ 是一个词。
假设你有10000个单词词汇，并且正在学习500维的词嵌入，GloVe模型最小化了这个目标:
$\min \sum^{10,000}_{i=1}\sum^{10,000}_{j=1}f(X_{ij})(\theta^T_ie_j+b_i+b′_j−logX_{ij})^2$
以下说法中哪一个（些）是正确的？
- 【】 $\theta_i$ 与 $e_j$ 应当初始化为0。
- 【★】 $\theta_i$ 与 $e_j$ 应当使用随机数进行初始化。
- 【★】 $X_{ij}$ 是单词i在j中出现的次数。
- 【★】加权函数 $f (.)$ 必须满足 $f (0) = 0$ 。
  
  The weighting function helps prevent learning only from extremely common word pairs. It is not necessary that it satisfies this function.
  
  加权函数有助于防止仅从非常常见的单词对中学习，它不必满足这个函数。
你已经在文本数据集 $m_1$ 上训练了词嵌入，现在准备将它用于一个语言任务中，对于这个任务，你有一个单独标记的数据集 $m_2$ ，请记住，使用词嵌入是一种迁移学习的形式，在这种情况下，你认为词嵌入会有帮助吗？

【★】 $m_1 >> m_2$
【】 $m_1 << m_2$

Natural Language Processing & Word Embeddings

Suppose you learn a word embedding for a vocabulary of 10000 words. Then the embedding vectors should be 10000 dimensional, so as to capture the full range of variation and meaning in those words.
- True
- False

What is t-SNE?
- A non-linear dimensionality reduction technique.
- A linear transformation that allows us to solve analogies on word vectors.
- A supervised learning algorithm for learning word embeddings.
- An open-source sequence modeling library.

Suppose you download a pre-trained word embedding which has been trained on a huge corpus of text. You then use this word embedding to train an RNN for a language task of recognizing if someone is happy from a short snippet of text, using a small training set.

x(input text)	y (happy?)
I'm feeling wonderful today!	1
I'm bummed my cat is ill.	0
Really enjoying this!	1
Then even if the word "ecstatic" does not appear in your small training set, your RNN might reasonably be expected to recognize "I’m ecstatic" as deserving a label $y = 1$.

Then even if the word “ecstatic” does not appear in your small training set, your RNN might reasonably be expected to recognize “I’m ecstatic” as deserving a label $y = 1$ .
- [x] True
- [ ] False

Which of these equations do you think should hold for a good word embedding? (Check all that apply)
- $e_{boy} - e_{girl} ≈ e_{brother} - e_{sister}$
- $e_{boy} - e_{girl} ≈ e_{sister} - e_{brother}$
- $e_{boy} - e_{brother} ≈ e_{girl} - e_{sister} $
- $e_{boy} - e_{brother} ≈ e_{sister} - e_{girl} $

Let $E$ be an embedding matrix, and let $e_{1234}$ be a one-hot vector corresponding to word 1234. Then to get the embedding of word 1234, why don’t we call $E∗e_{1234}$ in Python?
- It is computationally wasteful.
- The correct formula is $E^T∗e_{1234}$ .
- This doesn’t handle unknown words ().
- None of the above: Calling the Python snippet as described above is fine.

When learning word embeddings, we create an artificial task of estimating $\mid context)$ . It is okay if we do poorly on this artificial prediction task; the more important by-product of this task is that we learn a useful set of word embeddings.
- True
- False

In the word2vec algorithm, you estimate $\mid c)$ , where $t$ is the target word and $c$ is a context word. How are $t$ and $c$ chosen from the training set? Pick the best answer.
- $c$ and $t$ are chosen to be nearby words.
- $c$ is the one word that comes immediately before $t$ .
- $c$ is the sequence of all the words in the sentence before $t$ .
- $c$ is a sequence of several words immediately before $t$ .

Suppose you have a 10000 word vocabulary, and are learning 500-dimensional word embeddings. The word2vec model uses the following softmax function:
$\mid c)=\frac{e^{\theta_t^T e_c}}{\sum_{t′=1}^{10000} e^{\theta_{t'}^T e_c}}$
Which of these statements are correct? Check all that apply.
- $\theta_t$ and $e_c$ are both 500 dimensional vectors.
- $\theta_t$ and $e_c$ are both 10000 dimensional vectors.
- $\theta_t$ and $e_c$ are both trained with an optimization algorithm such as Adam or gradient descent.
- After training, we should expect $\theta_t$ to be very close to $e_c$ when $t$ and $c$ are the same word.

Suppose you have a 10000 word vocabulary, and are learning 500-dimensional word embeddings.The GloVe model minimizes this objective:
$\min \sum^{10,000}_{i=1}\sum^{10,000}_{j=1}f(X_{ij})(\theta^T_ie_j+b_i+b′_j−logX_{ij})^2$
Which of these statements are correct? Check all that apply.
- $\theta_i$ and $e_j$ hould be initialized to 0 at the beginning of training.
- $\theta_i$ and $e_j$ hould be initialized randomly at the beginning of training.
- $X_{ij}$ is the number of times word i appears in the context of word j.
- The weighting function $f (.)$ must satisfy $f (0) = 0$ .
  
  The weighting function helps prevent learning only from extremely common word pairs. It is not necessary that it satisfies this function.

You have trained word embeddings using a text dataset of $m_1$ words. You are considering using these word embeddings for a language task, for which you have a separate labeled dataset of $m_2$ words. Keeping in mind that using word embeddings is a form of transfer learning, under which of these circumstance would you expect the word embeddings to be helpful?