CS224N-Notes02-GloVe, Evaluation and Training

MiaL

于 2019-07-17 00:14:26 发布

阅读量215

点赞数 1

分类专栏： CS224N-Stanford-Winter-2019-Notes

CS224N-Stanford-Winter-2019-Notes 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

CS224n：Natural Language Precessing with Deep Learning
Lecture Notes:Part 1
Authors: Francois Chaubard etc.

This set of notes first introduces the GloVe model for training word vectors. Then it extends our discussion of word vectors (interchangeably called word embedding) by seeing how they can be evaluated intrinsically and extrinsically. As we proceed, we discuss the example of word analogies as an intrinsic evaluation technique and how it can be sued to tune word embedding techniques. We then discuss training model weights/parameters and word vectors for extrinsic tasks. Lastly we motivate artificial neural network as a class of models for natural language processing tasks.

1. Global Vectors for Word Representation

1.1 Comparison with Previous Methods

So far, wo have looked at two main classes of methods to find word embeddings. The first set are count-based and rely on matrix factorization. While these methods effectively leverage global statistical information, they are primarily used to capture word similarities and do poorly on tasks such as analogy, indicating a sub-optimal vector space structure. The other set of methods are shallow window-based (e.g. the skim-gram and the CBOW models), which learn word embeddings by making predictions in local context windows. These models demonstrate the capacity to capture complex linguistic patterns beyond word similarity, but fail to make use of the global co-occurrence statistics.
In comparison, GloVe constants of a weighted least squares model that trains on glob word-word co-occurrence counts and thus make efficient use of statistics. The model produces a word vector space with meaningful sub-structure. It shows state-of-art performance on the word analogy task, and outperforms other current methods on several word similarity tasks.

1.2 Co-occurrence Matrix

Let X denote the word-word con-occurrence matrix, where $X_{ij}$ indicates the number of times word $j$ occur in the context of word $i$ . Let $X_i=\sum_kX_{ik}$ be the number of times of any word k appears in the context of word i. Finally, let $P_{ij}=P(w_j|w_i)=\frac{X_{ij}}{X_i}$ be the probability of j appearing in the context of word i.
Populating this matrix requires a single pass through the entire corpus to collect the statistics. For large corpus, this pass can be computationally expensive, but it is a one-time up-front cost.

1.3 Least Squares Objective

Recall that for the skip-gram model, we use softmax to compute the probability of word $j$ appears in the context of word $i$ :

Training proceeds in an on-line, stochastic fashion, but the implied global cross-entropy loss can be calculated as:

As the same words i and j can appear multiple times in the corpus, it is more efficient to first group together the same values for i and j:

Where the value of co-occurring frequency is given by the co-occurrence matrix $X$ . One significant drawback of the cross-entropy loss is that it requires the distribution Q to be properly normalized, which involves the expensive summation over the entire vocabulary. Instead, wo use a least square objective in which the normalization factors in $P$ and $Q$ are discarded:

1.4 Conclusion

In conclusion, the GloVe model efficiently leverages gobal statistical information by training only on the nonzero elements in a word-word co-occurrence matrix, and produces a vector space with meaningful sub-structure. It consistenlu outperforms word2vec on the word analogy task, given the same corpus, vocabulary, window size, and training time. It achieves better results faster, and also obtains the best results irrespective of speed.

2. Evaluation of Word Vectors

（待补充）

MiaL

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
CS224N-Notes02-GloVe, Evaluation and Training

CS224n：Natural Language Precessing with Deep LearningLecture Notes:Part 1Authors: Francois Chaubard etc.This set of notes first introduces the GloVe model for training word vectors. Then it extends...
复制链接

扫一扫