CS224n:Natural Language Precessing with Deep Learning
Lecture Notes:Part 1
Authors: Francois Chaubard etc.
This set of notes first introduces the GloVe model for training word vectors. Then it extends our discussion of word vectors (interchangeably called word embedding) by seeing how they can be evaluated intrinsically and extrinsically. As we proceed, we discuss the example of word analogies as an intrinsic evaluation technique and how it can be sued to tune word embedding techniques. We then discuss training model weights/parameters and word vectors for extrinsic tasks. Lastly we motivate artificial neural network as a class of models for natural language processing tasks.
1. Global Vectors for Word Representation
1.1 Comparison with Previous Methods
So far, wo have looked at two main classes of methods to find word embeddings. The first set are count-based and rely on matrix factorization. While these methods effectively leverage global statistical information, they are primarily used to capture word similarities and do poorly on tasks such as analogy, indicating a sub-optimal vector space structure. The other set of methods are shallow window-based (e.g. the skim-gram and the CBOW models), which learn word embeddings by making predictions in local context windows. These models demonstrate the capacity to capture complex linguistic patterns beyond word similarity, but fail to make use of the global co-occurrence statistics.
In comparison, GloVe constants of a weighted least squares model that trains on glob word-word co-occurrence counts and thus make efficient use of statistics. The model produces a word vector space with meaningful sub-structure. It shows state-of-art performance on the word analogy task, and outperforms other current methods on several word similarity tasks.
1.2 Co-occurrence Matrix
Let X denote the word-word con-occurrence matrix, where
X
i
j
X_{ij}
Xij indicates the number of times word
j
j
j occur in the context of word
i
i
i. Let
X
i
=
∑
k
X
i
k
X_i=\sum_kX_{ik}
Xi=∑kXik be the number of times of any word k appears in the context of word i. Finally, let
P
i
j
=
P
(
w
j
∣
w
i
)
=
X
i
j
X
i
P_{ij}=P(w_j|w_i)=\frac{X_{ij}}{X_i}
Pij=P(wj∣wi)=XiXij be the probability of j appearing in the context of word i.
Populating this matrix requires a single pass through the entire corpus to collect the statistics. For large corpus, this pass can be computationally expensive, but it is a one-time up-front cost.
1.3 Least Squares Objective
Recall that for the skip-gram model, we use softmax to compute the probability of word j j j appears in the context of word i i i:
Training proceeds in an on-line, stochastic fashion, but the implied global cross-entropy loss can be calculated as:
As the same words i and j can appear multiple times in the corpus, it is more efficient to first group together the same values for i and j:
Where the value of co-occurring frequency is given by the co-occurrence matrix X X X. One significant drawback of the cross-entropy loss is that it requires the distribution Q to be properly normalized, which involves the expensive summation over the entire vocabulary. Instead, wo use a least square objective in which the normalization factors in P P P and Q Q Q are discarded:
1.4 Conclusion
In conclusion, the GloVe model efficiently leverages gobal statistical information by training only on the nonzero elements in a word-word co-occurrence matrix, and produces a vector space with meaningful sub-structure. It consistenlu outperforms word2vec on the word analogy task, given the same corpus, vocabulary, window size, and training time. It achieves better results faster, and also obtains the best results irrespective of speed.
2. Evaluation of Word Vectors
(待补充)