Making sense of word2vec

最新推荐文章于 2024-09-06 19:51:15 发布

MemRay

最新推荐文章于 2024-09-06 19:51:15 发布

阅读量1.5k

点赞数

分类专栏： Deep Learning

Deep Learning 专栏收录该内容

43 篇文章

订阅专栏

本文深入探讨了Gensim作者对word2vec和Glove的比较，特别关注了Levy和Goldberg提出的使用简单矩阵分解策略（如SVD）达到与word2vec相似效果的方法。文章详细解释了word2vec的工作原理，包括其在无监督学习中自动发现词汇关系的能力，并通过实例展示了如何使用矩阵分解方法进行改进。同时，文章还对比了word2vec、Glove以及基于SVD的改进方法在不同参数设置下的性能表现。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

转自：http://rare-technologies.com/making-sense-of-word2vec/

来自Gensim作者的blog，主要对word2vec和Glove进行了比较。不同于Licstar的这篇，这里没介绍太多模型。比较让我感兴趣的就是最后关于Levy & Goldberg对于使用简单的矩阵分解策略（SVD等）达到不亚于word2vec效果的探索，Ronan Collobert也有篇类似的研究，通过某个PCA变种就达到了不亚于deep learning的embedding效果。

One year ago, Tomáš Mikolov (together with his colleagues at Google) made some ripples by releasing word2vec, an unsupervised algorithm for learning the meaning behind words. In this blog post, I’ll evaluate some extensions that have appeared over the year.

In case you missed the buzz, word2vec was widely featured as a member of the “new wave” of machine learning algorithms based on neural networks, commonly referred to as deep learning (though word2vec itself is rather shallow). Using large amounts of unannotated plain text, word2vec learns relationships between words automatically. The output are vectors, one vector per word, with remarkable linear relationships that allow us to do things like vec(“king”) – vec(“man”) + vec(“woman”) =~ vec(“queen”), or vec(“Montreal Canadiens”) – vec(“Montreal”) + vec(“Toronto”)resembles the vector for “Toronto Maple Leafs”.

Apparently crows are good at that stuff, too: Crows Can Understand Analogies.

Check out my online word2vec demo and the blog series on optimizing word2vec in Python for more background.

So, what’s changed?

For one, Tomáš Mikolov no longer works for Google :-)

More relevantly, there was a lovely piece of research done by the good people at Stanford: Jeffrey Pennington, Richard Socher and Christopher Manning. They explicitly identified the objective that word2vec optimizes through its async stochastic gradient backpropagation algorithm, and neatly connected it to the well-established field of matrix factorizations.

And in case you’ve never heard of that — in short, word2vec ultimately learns word vectors and word context vectors. These can be viewed as two 2D matrices (of floats), of size #words x #dim each. Their method GloVe (Global Vectors) identified a matrix which, when factorized using the particular SGD algorithm of word2vec, yields out exactly these two matrices. So where word2vec was a bit hazy about what’s going on underneath, GloVe explicitly names the “objective” matrix, identifies the factorization, and provides some intuitive justification as to why this should give us working similarities.

Very nice and clear paper, go read it if you haven’t!

For example, if we have the following nine preprocessed sentences, and setwindow=5, the co-occurrence matrix looks like this:

 
   
        # nine input sentences 
       
 
        texts  
        = 
         [[ 
        'human' 
        ,  
        'interface' 
        ,  
        'computer' 
        ], 
       
 
          
        [ 
        'survey' 
        ,  
        'user' 
        ,  
        'computer' 
        ,  
        'system' 
        ,  
        'response' 
        ,  
        'time' 
        ], 
       
 
          
        [ 
        'eps' 
        ,  
        'user' 
        ,  
        'interface' 
        ,  
        'system' 
        ], 
       
 
          
        [ 
        'system' 
        ,  
        'human' 
        ,  
        'system' 
        ,  
        'eps' 
        ], 
       
 
          
        [ 
        'user' 
        ,  
        'response' 
        ,  
        'time' 
        ], 
       
 
          
        [ 
        'trees' 
        ], 
       
 
          
        [ 
        'graph' 
        ,  
        'trees' 
        ], 
       
 
          
        [ 
        'graph' 
        ,  
        'minors' 
        ,  
        'trees' 
        ], 
       
 
          
        [ 
        'graph' 
        ,  
        'minors' 
        ,  
        'survey' 
        ]] 
       

           
       
 
        # word-word co-occurrence matrix, with context window size of 5 
       
 
        [[ 
        0 
        1 
         1 
        1 
         1 
        1 
         1 
        1 
         0 
        0 
         0 
        0 
        ] 
       
 
          
        [ 
        1 
        0 
         1 
        0 
         0 
        2 
         0 
        0 
         1 
        0 
         0 
        0 
        ] 
       
 
          
        [ 
        1 
        1 
         0 
        0 
         0 
        1 
         0 
        1 
         1 
        0 
         0 
        0 
        ] 
       
 
          
        [ 
        1 
        0 
         0 
        0 
         1 
        1 
         2 
        2 
         0 
        0 
         0 
        0 
        ] 
       
 
          
        [ 
        1 
        0 
         0 
        1 
         0 
        1 
         1 
        1 
         0 
        0 
         1 
        1 
        ] 
       
 
          
        [ 
        1 
        2 
         1 
        1 
         1 
        2 
         1 
        2 
         3 
        0 
         0 
        0 
        ] 
       
 
          
        [ 
        1 
        0 
         0 
        2 
         1 
        1 
         0 
        2 
         0 
        0 
         0 
        0 
        ] 
       
 
          
        [ 
        1 
        0 
         1 
        2 
         1 
        2 
         2 
        0 
         1 
        0 
         0 
        0 
        ] 
       
 
          
        [ 
        0 
        1 
         1 
        0 
         0 
        3 
         0 
        1 
         0 
        0 
         0 
        0 
        ] 
       
 
          
        [ 
        0 
        0 
         0 
        0 
         0 
        0 
         0 
        0 
         0 
        0 
         2 
        1 
        ] 
       
 
          
        [ 
        0 
        0 
         0 
        0 
         1 
        0 
         0 
        0 
         0 
        2 
         0 
        2 
        ] 
       
 
          
        [ 
        0 
        0 
         0 
        0 
         1 
        0 
         0 
        0 
         0 
        1 
         2 
        0 
        ]] 
       
 
        # (rows/columns represent words: 
       
 
        # "computer human interface response survey system time user eps trees graph minors", 
       
 
        # in that order) 
       
 
 

Note how the matrix is very sparse and symmetrical; the implementation we’ll use below takes advantage of both these properties to train GloVe more efficiently.

The GloVe algorithm then transforms such raw integer counts into a matrix where the co-occurrences are weighted based on their distance within the window (word pairs farther apart get less co-occurrence weight):

 
   
        # same row/column names as above 
       
 
        [[  
        0.    
         0.5   
        1.    
         0.5   
        0.5   
         1.    
        0.33  
         1.    
        0.    
         0.    
        0.    
         0.  
        ] 
       
 
          
        [  
        0.    
         0.    
        1.    
         0.    
        0.    
         2.    
        0.    
         0.    
        0.5   
         0.    
        0.    
         0.  
        ] 
       
 
          
        [  
        0.    
         0.    
        0.    
         0.    
        0.    
         1.    
        0.    
         1.    
        0.5   
         0.    
        0.    
         0.  
        ] 
       
 
          
        [  
        0.    
         0.    
        0.    
         0.    
        0.25  
         1.    
        2.    
         1.33  
        0.    
         0.    
        0.    
         0.  
        ] 
       
 
          
        [  
        0.    
         0.    
        0.    
         0.    
        0.    
         0.33  
        0.2   
         1.    
        0.    
         0.    
        0.5   
         1.  
        ] 
       
 
          
        [  
        0.    
         0.    
        0.    
         0.    
        0.    
         0.    
        0.5   
         1.    
        1.67  
         0.    
        0.    
         0.  
        ] 
       
 
          
        [  
        0.    
         0.    
        0.    
         0.    
        0.    
         0.    
        0.    
         0.75  
        0.    
         0.    
        0.    
         0.  
        ] 
       
 
          
        [  
        0.    
         0.    
        0.    
         0.    
        0.    
         0.    
        0.    
         0.    
        1.    
         0.    
        0.    
         0.  
        ] 
       
 
          
        [  
        0.    
         0.    
        0.    
         0.    
        0.    
         0.    
        0.    
         0.    
        0.    
         0.    
        0.    
         0.  
        ] 
       
 
          
        [  
        0.    
         0.    
        0.    
         0.    
        0.    
         0.    
        0.    
         0.    
        0.    
         0.    
        1.5   
         1.  
        ] 
       
 
          
        [  
        0.    
         0.    
        0.    
         0.    
        0.    
         0.    
        0.    
         0.    
        0.    
         0.    
        0.    
         2.  
        ] 
       
 
          
        [  
        0.    
         0.    
        0.    
         0.    
        0.    
         0.    
        0.    
         0.    
        0.    
         0.    
        0.    
         0.  
        ]] 
       
 
 

then takes a log, and factorizes this matrix to produce the final word vectors.

This was really exciting news — it means the plain softmax word2vec essentially reduces to counting how many times words occur together, with some scaling thrown in. Technically, this is just a glorified cooccurrence_counts[word, other_word]++ in a loop, followed by any of the standard matrix factorization algorithms, both of which are well understood processes with efficient implementations.

GloVe vs word2vec

Oddly, the evaluation section of this GloVe paper didn’t match the quality of the rest. It had serious flaws in how the experiments compared GloVe to other methods. Several people called the authors out on the weirdness, most lucidly the Levy & Goldberg research duo from Bar Ilan university — check out their “apples to apples” blog post for a bit of academic drama. To summarize, when evaluated properly, paying attention to parameter settings, GloVe doesn’t really seem to outperform the original word2vec, let alone by 11% as the GloVe paper claimed.

Luckily, Maciej Kula implemented GloVe in Python, using Cython for performance. Using his neat implementation, we can try to make sense of the performance and accuracy ourselves.

Code to train GloVe in Python:

 
   
        from 
        gensim  
        import 
        utils, corpora, matutils, models 
       
 
        import 
        glove 
       

           
       
 
        # Restrict dictionary to the 30k most common words. 
       
 
        wiki  
        = 
         models.word2vec.LineSentence( 
        '/data/shootout/title_tokens.txt.gz' 
        ) 
       
 
        id2word  
        = 
         corpora.Dictionary(wiki) 
       
 
        id2word.filter_extremes(keep_n 
        = 
        30000 
        ) 
       
 
        word2id  
        = 
         dict 
        ((word,  
        id 
        )  
        for 
         id 
        , word  
        in 
        id2word.iteritems()) 
       

           
       
 
        # Filter all wiki documents to contain only those 30k words. 
       
 
        filter_text  
        = 
         lambda 
        text: [word  
        for 
        word  
        in 
        text  
        if 
        word  
        in 
        word2id] 
       
 
        filtered_wiki  
        = 
         lambda 
        : (filter_text(text)  
        for 
        text  
        in 
        wiki)   
        # generator 
       

           
       
 
        # Get the word co-occurrence matrix -- needs lots of RAM!! 
       
 
        cooccur  
        = 
         glove.Corpus() 
       
 
        cooccur.fit(filtered_wiki(), window 
        = 
        10 
        ) 
       

           
       
 
        # and train GloVe model itself, using 10 epochs 
       
 
        model_glove  
        = 
         glove.Glove(no_components 
        = 
        600 
        , learning_rate 
        = 
        0.05 
        ) 
       
 
        model_glove.fit(cooccur.matrix, epochs 
        = 
        10 
        ) 
       
 
 

And similarly for training word2vec:

 
        model_word2vec  
        = 
         models.Word2Vec(size 
        = 
        600 
        , window 
        = 
        10 
        ) 
       
        model_word2vec.build_vocab(filtered_wiki()) 
       
        model_word2vec.train(filtered_wiki())

The reason why we restricted the vocabulary to only 30,000 words is that Maciej’s implementation of GloVe requires memory quadratic in the number of words: it keeps that sparse matrix of all word x word co-occurrences in RAM. In contrast, the gensim word2vec implementation is happy with linear memory, so millions of words are not a problem there. This is not an intrinsic limitation of GloVe though; with a different implementation, the co-occurrence matrix could be assembled out-of-core (Map/Reduce seems ideal for the job), and the factorization could just stream over it with constant memory too, in a more gensim-like fashion.

Results for 600 dims, context window of 10, 1.9B words of EN Wikipedia.
algorithm	accuracy on the word analogy task	wallclock time	peak RAM [MB]
I/O only = iterating over wiki with sum(len(text) for text in filtered_wiki())	N/A	3m	25
GloVe, 10 epochs, learning rate 0.05	67.1%	4h12m	9,414
GloVe, 100 epochs, learning rate 0.05	67.3%	18h39m	9,452
word2vec, hierarchical skipgram, 1 epoch	57.4%	3h10m	266
word2vec, negative sampling with 10 samples, 1 epoch	68.3%	8h38m	628
word2vec, pre-trained GoogleNews model released by Tomáš Mikolov, 300 dims, 3,000,000 vocabulary	55.3%	?	?

Basically, where GloVe precomputes the large word x word co-occurrence matrix in memory and then quickly factorizes it, word2vec sweeps through the sentences in an online fashion, handling each co-occurrence separately. So, there is a tradeoff between taking more memory (GloVe) vs. taking longer to train (word2vec). Also, once computed, GloVe can re-use the co-occurrence matrix to quickly factorize with any dimensionality, whereas word2vec has to be trained from scratch after changing its embedding dimensionality.

Note that both implementations are fairly optimized, running on 8 threads (on an 8 core machine), using the exact same input corpus, text preprocessing, vocabulary and evaluation code, so that the numbers are directly comparable. Code here.

SPPMI and SVD

In a manner analogous to GloVe, Levy and Goldberg (the same researchers mentioned above) analyzed the objective function of word2vec with negative sampling. That’s the one that performed best in the table above, so I decided to check it out too.

Again, they manage to derive a beautifully simple connection to matrix factorization. This time, the word x context objective “source” matrix is computed differently to GloVe. Each matrix cell, corresponding to word $w$ and context word $c,$ is computed as $max(0.0, PMI(w, c) - log(k))$ , where $k$ is the number of negative samples in word2vec (for example, $k=10$ ). PMI is the standard pointwise mutual information — if we use the notation that word $w$ and context $c$ occurred together $\#wc$ times in the training corpus, then $PMI(w, c) = \log \frac{\#wc * \sum{w,c}\#wc}{\sum_c\#wc * \sum_w\#wc}$ (no smoothing).

The funky “SPPMI” name simply reflects that we’re subtracting $\log(k)$ from PMI (“shifting”) and that we’re taking the $max(0.0, SPMI)$ (“positive”; should be non-negative, really). So, Shifted Positive Pointwise Mutual Information.

For example, for the same nine texts we used above and $k=1$ , the SPPMI matrix looks like this:

No neural network training, no parameter tuning, we can directly take rows of this SPPMI matrix to be the word vectors. Very fast and simple. How does raw SPPMI compare to word2vec’s and GloVe’s factorizations though?

Comparison on 600 dims, context window 10, 1.9B words of EN Wikipedia.
algorithm	accuracy on the analogy task	wallclock time	peak RAM [MB]
word2vec, negative sampling k=10, 1 epoch	68.3%	8h38m	628
GloVe, learning rate 0.05, 10 epochs	67.1%	4h12m	9,414
SPPMI, k=1	48.7%	50m	3,433
SPPMI, k=10	30.3%	50m	3,429
SPPMI-SVD, k=1	39.4%	1h23m	3,426
SPPMI-SVD, k=10	3.8%	1h23m	3,444

The SPPMI-SVD method simply factorizes the sparse SPPMI matrix using Singular Value Decomposition (SVD), rather than the gradient descent methods of word2vec/GloVe, and uses the (dense) left singular vectors as the final word embeddings. SVD is a fast, scalable method with straightforward geometric interpretation, and it performed very well in the NIPS experiments of Levy & Goldberg, who suggested SSPMI-SVD.

In the table above, the quality of both SPPMI and SPPMI-SVD models is atrocious, especially for higher values of $k$ (more “shift”). I’m not sure why this is; I’ll try to get the original implementation of Levy & Goldberg to compare.

Also, I originally tried to get this table on 1,000 dims, rather than 600. But the GloVe implementation started failing, producing word vectors with NaNs in them, and weird <1% accuracies when I tried to combat that by decreasing its learning rate. Maciej is still working on that one, so if you're thinking of using GloVe in production, beware.EDIT: Successfully resolved, Maciej’s GloVe handles that fine now.

To make experiments easier, I wrote and published a script that takes the parsedEnglish Wikipedia and computes accuracy on the analogy task, using each of these different algorithms in turn. You can get it from GitHub and experiment with the various methods yourself, trying them on your own data / application.

What does that all mean?

Playing with Wikipedia is fun, but usually clients require more concrete insights from us.

How do we tweak word2vec to better model what we want?

How to tune word2vec model quality on a specific task (which is, in all likelihood, not“word analogies”)?

I’ll postpone that until the next post. Suffice to say that the performance depends on tuning the methods’ internal parameters, in non-obvious ways. The Bar Ilan powerhouse of Levy & Goldberg wrote a full paper on that, exploring the various parameter combinations (dynamic vs. fixed context window, subsampling, negative distribution smoothing, taking context vectors into account as well….). Their paper is under review now — I’ll post a link here as soon as it becomes public.

In the meanwhile, there has been some tentative research into the area of word2vec error analysis and tuning. Check out this web demo (by Levy & Goldberg again, from this paper), for investigating which contexts get activated for different words. This can lead to visual insights into which co-occurrences are responsible for a particular class of errors.

TL;DR: the word2vec implementation is still fine and state-of-the-art, you can continue using it :-)