Natural Language Processing & Word Embeddings
- True/False: Suppose you learn a word embedding for a vocabulary of 20000 words. Then the embedding vectors could be 1000 dimensional, so as to capture the full range of variation and meaning in those words.
- False
- True
解析:The dimension of word vectors is usually smaller than the size of the vocabulary. Most common sizes for word vectors range between 50 and 1000.
- True/False: t-SNE is a linear transformation that allows us to solve analogies on word vectors.
- False
- True
解析:tr-SNE is a non-linear dimensionality reduction technique.
- Suppose you download a pre-trained word embedding which has been trained on a huge corpus of text. You then use this word embedding to train an RNN for a language task of recognizing if someone is happy from a short snippet of text, using a small training set.
Then even if the word “ecstatic” does not appear in your small training set, your RNN might reasonably be expected to recognize “I’m ecstatic” as deserving a label y = 1 y=1 y=1.
- False
- True
解析: word vectors empower your model with an incredible ability to generalize. The vector for “ecstatic” would contain a positive/happy connotation which will probably make your model classify the sentence as a “1”.(泛化能力增强)
- Which of these equations do you think should hold for a good word embedding? (Check all that apply)
- e m a n − e u n c l e ≈ e w o m a n − e a u n t e_{man}−e_{uncle}≈e_{woman}−e_{aunt} eman−euncle≈ewoman−eaunt
- e m a n − e w o m a n ≈ e u n c l e − e a u n t e_{man}−e_{woman}≈e_{uncle}−e_{aunt} eman−ewoman≈euncle−eaunt
- e m a n − e w o m a n ≈ e a u n t − e u n c l e e_{man}−e_{woman}≈e_{aunt}−e_{uncle} eman−ewoman≈eaunt−euncle
- e m a n − e a u n t ≈ e w o m a n − e u n c l e e_{man}−e_{aunt}≈e_{woman}−e_{uncle} eman−eaunt≈ewoman−euncle
- Let A A A be an embedding matrix, and let o 4567 o_{4567} o4567 be a one-hot vector corresponding to word 4567. Then to get the embedding of word 4567, why don’t we call A ∗ o 4567 A * o_{4567} A∗o4567 in Python?
- It is computationally wasteful.
- The correct formula is A T ∗ o 4567 A^T∗o_{4567} AT∗o4567
- None of the answers are correct: calling the Python snippet as described above is fine.
- This doesn’t handle unknown words ().
解析:the element-wise multiplication will be extremely inefficient.
- When learning word embeddings, words are automatically generated along with the surrounding words.
- True
- False
解析: we pick a given word and try to predict its surrounding words or vice versa.
- In the word2vec algorithm, you estimate P ( t ∣ c ) P(t \mid c) P(t∣c), where t t t is the target word and c c c is a context word. How are t t t and c c c chosen from the training set? Pick the best answer.
- c c c is the sequence of all the words in the sentence before t t t
- c c c and t t t are chosen to be nearby words.
- c c c is a sequence of several words immediately before t t t
- c c c is the one word that comes immediately before t t t
- Suppose you have a 10000 word vocabulary, and are learning 500-dimensional word embeddings. The word2vec model uses the following softmax function:
Which of these statements are correct? Check all that apply.
- θ t θ_t θt and e c e_c ec are both trained with an optimization algorithm such as Adam or gradient descent.
- θ t θ_t θt and e c e_c ec are both 500 dimensional vectors.
- After training, we should expect θ t θ_t θt to be very close to e c e_c ec when t t t and c c c are the same word.
- θ t θ_t θt and e c e_c ec are both 10000 dimensional vectors.
- Suppose you have a 10000 word vocabulary, and are learning 500-dimensional word embeddings. The GloVe model minimizes this objective:
Which of these statements are correct? Check all that apply.
- θ i θ_i θi and e j e_j ej should be initialized to 0 at the beginning of training.
- θ i θ_i θi and e j e_j ej should be initialized randomly at the beginning of training.
- X i j X_ij Xij is the number of times word j appears in the context of word i.
- Theoretically, the weighting function f ( . ) f(.) f(.) must satisfy f ( 0 ) = 0 f(0)=0 f(0)=0
- You have trained word embeddings using a text dataset of t 1 t_1 t1 words. You are considering using these word embeddings for a language task, for which you have a separate labeled dataset of t 2 t_2 t2 words. Keeping in mind that using word embeddings is a form of transfer learning, under which of these circumstances would you expect the word embeddings to be helpful?
- When t 1 t_1 t1 is equal to t 2 t_2 t2
- When t 1 t_1 t1 is smaller than t 2 t_2 t2
- When t 1 t_1 t1 is larger than t 2 t_2 t2
解析:Transfer embeddings to new tasks with smaller training sets.