利用Gensim训练关于英文维基百科的Word2Vec模型（Training Word2Vec Model on English Wikipedia by Gensim）

最新推荐文章于 2024-04-26 13:54:35 发布

beixiahuaideren

最新推荐文章于 2024-04-26 13:54:35 发布

阅读量8.3k

点赞数 1

文章标签：维基百科人工智能 word2vec gensim

Training Word2Vec Model on English Wikipedia by Gensim

更新：发现另一篇译文：中英文维基百科语料上的Word2Vec实验，该译文还提供了中文维基百科的做法。

在学习了word2vec和glove，一个很自然的方式是考虑去训练一个大型的语料库，对于这个任务，英文维基百科是一个理想的选择。在google了相关关键词比如“word2vec wikipedia”，“gensim word2vec wikipedia”，我在gensim谷歌组里看到，一个讨论“training word2vec on full Wikipedia”提供了一个正确的做法。虽然注意到也有别的选择比如wiki2vec，但我认为word2vec是更简单和有效率。

我从英文维基百科下载了数据（时间2015-03-01,大概11g）：

https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

首先，我们需要把xml格式的维基百科处理成文本格式，从process_wiki.py下面复制代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-
 
import logging
import os.path
import sys
 
from gensim.corpora import WikiCorpus
 
if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)
 
    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
 
    # check and process input arguments
    if len(sys.argv) < 3:
        print globals()['__doc__'] % locals()
        sys.exit(1)
    inp, outp = sys.argv[1:3]
    space = " "
    i = 0
 
    output = open(outp, 'w')
    wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        output.write(space.join(text) + "\n")
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")
 
    output.close()
    logger.info("Finished Saved " + str(i) + " articles")

注意这里有点不一样：

wiki = WikiCorpus(inp, lemmatize=False, dictionary={})

原来是：wiki = WikiCorpus(inp, dictionary={})

我们设置了lemmatize为False，不使用模式（pattern），因为模式会严重地减慢处理的速度。

执行“python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text”，我们得到：

2015-03-07 15:08:39,181: INFO: running process_enwiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
2015-03-07 15:11:12,860: INFO: Saved 10000 articles
2015-03-07 15:13:25,369: INFO: Saved 20000 articles
2015-03-07 15:15:19,771: INFO: Saved 30000 articles
2015-03-07 15:16:58,424: INFO: Saved 40000 articles
2015-03-07 15:18:12,374: INFO: Saved 50000 articles
2015-03-07 15:19:03,213: INFO: Saved 60000 articles
2015-03-07 15:19:47,656: INFO: Saved 70000 articles
2015-03-07 15:20:29,135: INFO: Saved 80000 articles
2015-03-07 15:22:02,365: INFO: Saved 90000 articles
2015-03-07 15:23:40,141: INFO: Saved 100000 articles
.....
2015-03-07 19:33:16,549: INFO: Saved 3700000 articles
2015-03-07 19:33:49,493: INFO: Saved 3710000 articles
2015-03-07 19:34:23,442: INFO: Saved 3720000 articles
2015-03-07 19:34:57,984: INFO: Saved 3730000 articles
2015-03-07 19:35:31,976: INFO: Saved 3740000 articles
2015-03-07 19:36:05,790: INFO: Saved 3750000 articles
2015-03-07 19:36:32,392: INFO: finished iterating over Wikipedia corpus of 3758076 documents with 2018886604 positions (total 15271374 articles, 2075130438 positions before pruning articles shorter than 50 words)
2015-03-07 19:36:32,394: INFO: Finished Saved 3758076 articles

在我的macpro处理5个小时（4核cpu和16g的RAM），我们获得了12g的wiki.en.text，一篇文章为一行，像这样，丢弃了标点符号：

anarchism is collection of movements and ideologies that hold the state to be undesirable unnecessary or harmful these movements advocate some form of stateless society instead often based on self governed voluntary institutions or non hierarchical free associations although anti statism is central to anarchism as political philosophy anarchism also entails rejection of and often hierarchical organisation in general as an anti dogmatic philosophy anarchism draws on many currents of thought and strategy anarchism does not offer fixed body of doctrine from single particular world view instead fluxing and flowing as philosophy there are many types and traditions of anarchism not all of which are mutually exclusive anarchist schools of thought can differ fundamentally supporting anything from extreme individualism to complete collectivism strains of anarchism have often been divided into the categories of social and individualist anarchism or similar dual classifications anarchism is usually considered radical left wing ideology and much of anarchist economics and anarchist legal philosophy reflect anti authoritarian interpretations of communism collectivism syndicalism mutualism or participatory economics etymology and terminology the term anarchism is compound word composed from the word anarchy and the suffix ism themselves derived respectively from the greek anarchy from anarchos meaning one without rulers from the privative prefix ἀν an without and archos leader ruler cf archon or arkhē authority sovereignty realm magistracy and the suffix or ismos isma from the verbal infinitive suffix...

第二步是训练从text里训练word2vec模型，你可以使用原始的word2vec二进制文件训练像tex8文件的相关模型，但这样似乎速度太慢了。像这篇文章，我们使用gensim的word2vec模型训练英文维基百科模型，从train_word2vec_model.py复制下面的代码：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
import os.path
import sys
import multiprocessing
 
from gensim.corpora import  WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
 
 
if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)
 
    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
 
    # check and process input arguments
 
    if len(sys.argv) < 3:
        print globals()['__doc__'] % locals()
        sys.exit(1)
    inp, outp = sys.argv[1:3]
 
    model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count())
 
    # trim unneeded model memory = use (much) less RAM
    model.init_sims(replace=True)
 
    model.save(outp)

执行"python train_word2vec_model.py wiki.en.text wiki.en.word2vec.model":

2015-03-07 20:11:47,796 : INFO :  running train_word2vec_model.py wiki.en.texx wiki.en.word2vec.model
2015-03-07 20:11:47,801 : INFO :  collecting all words and their counts
2015-03-07 20:11:47,823 : INFO :  PROGRESS: at sentence #0, processed 0 words and 0 word types
2015-03-07 20:12:09,816 : INFO :  PROGRESS: at sentence #10000, processed 29353579 words and 430650 word types
2015-03-07 20:12:29,920 : INFO :  PROGRESS: at sentence #20000, processed 54695775 words and 610833 word types
2015-03-07 20:12:45,654 : INFO :  PROGRESS: at sentence #30000, processed 75344844 words and 742274 word types
2015-03-07 20:13:02,623 : INFO :  PROGRESS: at sentence #40000, processed 93430415 words and 859131 word types
2015-03-07 20:13:13,613 : INFO :  PROGRESS: at sentence #50000, processed 106057188 words and 935606 word types
2015-03-07 20:13:20,383 : INFO :  PROGRESS: at sentence #60000, processed 114319016 words and 952771 word types
2015-03-07 20:13:25,511 : INFO :  PROGRESS: at sentence #70000, processed 121263134 words and 969526 word types
2015-03-07 20:13:30,756 : INFO :  PROGRESS: at sentence #80000, processed 127773799 words and 984130 word types
2015-03-07 20:13:42,144 : INFO :  PROGRESS: at sentence #90000, processed 142688762 words and 1062932 word types
2015-03-07 20:13:54,513 : INFO :  PROGRESS: at sentence #100000, processed 159550824 words and 1157644 word types
......
2015-03-07 20:36:02,246 : INFO :  PROGRESS: at sentence #3700000, processed 1999452503 words and 7990138 word types
2015-03-07 20:36:04,786 : INFO :  PROGRESS: at sentence #3710000, processed 2002777270 words and 8002903 word types
2015-03-07 20:36:07,423 : INFO :  PROGRESS: at sentence #3720000, processed 2006213923 words and 8019620 word types
2015-03-07 20:36:10,115 : INFO :  PROGRESS: at sentence #3730000, processed 2009762733 words and 8035408 word types
2015-03-07 20:36:12,595 : INFO :  PROGRESS: at sentence #3740000, processed 2013066196 words and 8045218 word types
2015-03-07 20:36:15,120 : INFO :  PROGRESS: at sentence #3750000, processed 2016363087 words and 8057784 word types
2015-03-07 20:36:17,057 : INFO :  collected 8069236 word types from a corpus of 2018886604 words and 3758076 sentences
2015-03-07 20:36:22,710 : INFO :  total 1969354 word types after removing those with count<5
2015-03-07 20:36:22,710 : INFO :  constructing a huffman tree from 1969354 words
2015-03-07 20:38:20,767 : INFO :  built huffman tree with maximum node depth 29
2015-03-07 20:38:23,219 : INFO :  resetting layer weights
2015-03-07 20:39:18,277 : INFO :  training model with 4 workers on 1969354 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0
2015-03-07 20:39:33,141 : INFO :  PROGRESS: at 0.01% words, alpha 0.02500, 18766 words/s
2015-03-07 20:39:34,874 : INFO :  PROGRESS: at 0.05% words, alpha 0.02500, 56782 words/s
2015-03-07 20:39:35,886 : INFO :  PROGRESS: at 0.07% words, alpha 0.02500, 76206 words/s
2015-03-07 20:39:41,163 : INFO :  PROGRESS: at 0.08% words, alpha 0.02499, 66533 words/s
2015-03-07 20:39:43,442 : INFO :  PROGRESS: at 0.09% words, alpha 0.02500, 70345 words/s
2015-03-07 20:39:47,604 : INFO :  PROGRESS: at 0.11% words, alpha 0.02498, 77893 words/s
......
2015-03-08 02:33:26,624 : INFO :  PROGRESS: at 99.19% words, alpha 0.00020, 93814 words/s
2015-03-08 02:33:27,976 : INFO :  PROGRESS: at 99.20% words, alpha 0.00020, 93814 words/s
2015-03-08 02:33:29,097 : INFO :  PROGRESS: at 99.21% words, alpha 0.00020, 93814 words/s
2015-03-08 02:33:30,465 : INFO :  PROGRESS: at 99.21% words, alpha 0.00020, 93814 words/s
2015-03-08 02:33:31,768 : INFO :  PROGRESS: at 99.22% words, alpha 0.00020, 93813 words/s
2015-03-08 02:33:32,839 : INFO :  PROGRESS: at 99.22% words, alpha 0.00020, 93814 words/s
2015-03-08 02:33:32,839 : INFO :  PROGRESS: at 99.22% words, alpha 0.00020, 93814 words/s
2015-03-08 02:33:33,535 : INFO :  reached the end of input; waiting to finish 8 outstanding jobs
2015-03-08 02:33:33,939 : INFO :  PROGRESS: at 99.23% words, alpha 0.00019, 93815 words/s
2015-03-08 02:33:34,998 : INFO :  PROGRESS: at 99.23% words, alpha 0.00019, 93815 words/s
2015-03-08 02:33:36,127 : INFO :  PROGRESS: at 99.24% words, alpha 0.00019, 93815 words/s
2015-03-08 02:33:36,961 : INFO :  training on 1994415728 words took 21258.7s, 93817 words/s
2015-03-08 02:33:36,996 : INFO :  precomputing L2-norms of word weight vectors
2015-03-08 02:33:58,490 : INFO :  saving Word2Vec object under wiki.en.word2vec.model, separately None
2015-03-08 02:33:58,666 : INFO :  not storing attribute syn0norm
2015-03-08 02:33:58,666 : INFO :  storing numpy array 'syn0' to wiki.en.word2vec.model.syn0.npy

在大概7小时后，我们得到了英文维基百科的模型"wiki.en.word2vec.model"，但发现模型里有点奇怪：

In [1]: import gensim
 
In [2]: model = gensim.models.Word2Vec.load("wiki.en.word2vec.model")
 
In [3]: model.most_similar("queen")
...python2.7/site-packages/gensim/models/word2vec.py:827: RuntimeWarning: invalid value encountered in divide
  self.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)
Out[3]: 
[(u'ahsns', nan),
 (u'ny\xedl', nan),
 (u'indradeo', nan),
 (u'jaimovich', nan),
 (u'addlepate', nan),
 (u'jagello', nan),
 (u'festenburg', nan),
 (u'picatic', nan),
 (u'tolosanum', nan),
 (u'mithoo', nan)]

正如gensim的作者Radim Řehůřek所说的，我认为问题来自于如下：

Thanks h3im.
Both numbers are identical, so there’s no problem with the dictionary/input.
I had another idea — inside the cython code, the maximum sentence length is clipped to 1,000 words. Any sentence longer than that will only consider the first 1,000 words.
In your case, you’re storing entire documents as a single sentence (1 wiki doc = 1 sentence). So this restriction may be kicking in.
Can you try increasing `DEF MAX_SENTENCE_LEN = 1000` to 10k for example, in word2vec_inner.pyx?
Or, alternatively, split documents into sentences, so each sentence is < 1,000 words long. Let me know, Radim

但发现在我的gensim（0.10.3版本）也设置了10k，我尝试一个wiki.en.text前100000行的小的文本“wiki.en.10w”，然后用脚本train_word2vec_model.py训练了word2vec模型“wiki.en.10w.model”，发现一切都ok了：

In [1]: import gensim
 
In [2]: model = gensim.models.Word2Vec.load("wiki.en.10w.model")
 
In [3]: model.most_similar("queen")
Out[3]: 
[(u'princess', 0.5976558327674866),
 (u'elizabeth', 0.591829776763916),
 (u'consort', 0.5514105558395386),
 (u'drottningens', 0.5454206466674805),
 (u'regnant', 0.5419434309005737),
 (u'f\xf6delsedag', 0.5259706974029541),
 (u'saovabha', 0.5250850915908813),
 (u'margrethe', 0.5195728540420532),
 (u'mary', 0.5035395622253418),
 (u'armgard', 0.5028442144393921)]
 
In [4]: model.most_similar("man")
Out[4]: 
[(u'woman', 0.6305292844772339),
 (u'boy', 0.5495858788490295),
 (u'girl', 0.5382533073425293),
 (u'bespectacled', 0.44303444027900696),
 (u'eutychus', 0.43531811237335205),
 (u'coochie', 0.42641448974609375),
 (u'soldier', 0.4228038191795349),
 (u'hater', 0.4212420582771301),
 (u'mannish', 0.4139400124549866),
 (u'bellybutton', 0.4139178991317749)]
 
In [5]: model.similarity("man", "woman")
Out[5]: 0.63052930788363182
 
In [6]: model.similarity("girl", "woman")
Out[6]: 0.59083314898425321

我决定保存原始的word2vec文档格式用于debug，因此如下修改train_word2vec_model.py：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
 
import logging
import os.path
import sys
import multiprocessing
 
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
 
if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)
 
    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
 
    # check and process input arguments
    if len(sys.argv) < 4:
        print globals()['__doc__'] % locals()
        sys.exit(1)
    inp, outp1, outp2 = sys.argv[1:4]
 
    model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5,
            workers=multiprocessing.cpu_count())
 
    # trim unneeded model memory = use(much) less RAM
    #model.init_sims(replace=True)
    model.save(outp1)
    model.save_word2vec_format(outp2, binary=False)

然后执行“python train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector”：

2015-03-09 22:48:29,588: INFO: running train_word2vec_model.py wiki.en.text wiki.en.text.model wiki.en.text.vector
2015-03-09 22:48:29,593: INFO: collecting all words and their counts
2015-03-09 22:48:29,607: INFO: PROGRESS: at sentence #0, processed 0 words and 0 word types
2015-03-09 22:48:50,686: INFO: PROGRESS: at sentence #10000, processed 29353579 words and 430650 word types
2015-03-09 22:49:08,476: INFO: PROGRESS: at sentence #20000, processed 54695775 words and 610833 word types
2015-03-09 22:49:22,985: INFO: PROGRESS: at sentence #30000, processed 75344844 words and 742274 word types
2015-03-09 22:49:35,607: INFO: PROGRESS: at sentence #40000, processed 93430415 words and 859131 word types
2015-03-09 22:49:44,125: INFO: PROGRESS: at sentence #50000, processed 106057188 words and 935606 word types
2015-03-09 22:49:49,185: INFO: PROGRESS: at sentence #60000, processed 114319016 words and 952771 word types
2015-03-09 22:49:53,316: INFO: PROGRESS: at sentence #70000, processed 121263134 words and 969526 word types
2015-03-09 22:49:57,268: INFO: PROGRESS: at sentence #80000, processed 127773799 words and 984130 word types
2015-03-09 22:50:07,593: INFO: PROGRESS: at sentence #90000, processed 142688762 words and 1062932 word types
2015-03-09 22:50:19,162: INFO: PROGRESS: at sentence #100000, processed 159550824 words and 1157644 word 
types
......
2015-03-09 23:11:52,977: INFO: PROGRESS: at sentence #3700000, processed 1999452503 words and 7990138 word types
2015-03-09 23:11:55,367: INFO: PROGRESS: at sentence #3710000, processed 2002777270 words and 8002903 word types
2015-03-09 23:11:57,842: INFO: PROGRESS: at sentence #3720000, processed 2006213923 words and 8019620 word types
2015-03-09 23:12:00,439: INFO: PROGRESS: at sentence #3730000, processed 2009762733 words and 8035408 word types
2015-03-09 23:12:02,793: INFO: PROGRESS: at sentence #3740000, processed 2013066196 words and 8045218 word types
2015-03-09 23:12:05,178: INFO: PROGRESS: at sentence #3750000, processed 2016363087 words and 8057784 word types
2015-03-09 23:12:07,013: INFO: collected 8069236 word types from a corpus of 2018886604 words and 3758076 sentences
2015-03-09 23:12:12,230: INFO: total 1969354 word types after removing those with count<5
2015-03-09 23:12:12,230: INFO: constructing a huffman tree from 1969354 words
2015-03-09 23:14:07,415: INFO: built huffman tree with maximum node depth 29
2015-03-09 23:14:09,790: INFO: resetting layer weights
2015-03-09 23:15:04,506: INFO: training model with 4 workers on 1969354 vocabulary and 400 features, using 'skipgram'=1 'hierarchical softmax'=1 'subsample'=0 and 'negative sampling'=0
2015-03-09 23:15:19,112: INFO: PROGRESS: at 0.01% words, alpha 0.02500, 19098 words/s
2015-03-09 23:15:20,224: INFO: PROGRESS: at 0.03% words, alpha 0.02500, 37671 words/s
2015-03-09 23:15:22,305: INFO: PROGRESS: at 0.07% words, alpha 0.02500, 75393 words/s
2015-03-09 23:15:27,712: INFO: PROGRESS: at 0.08% words, alpha 0.02499, 65618 words/s
2015-03-09 23:15:29,452: INFO: PROGRESS: at 0.09% words, alpha 0.02500, 70966 words/s
2015-03-09 23:15:34,032: INFO: PROGRESS: at 0.11% words, alpha 0.02498, 77369 words/s
2015-03-09 23:15:37,249: INFO: PROGRESS: at 0.12% words, alpha 0.02498, 74935 words/s
2015-03-09 23:15:40,618: INFO: PROGRESS: at 0.14% words, alpha 0.02498, 75399 words/s
2015-03-09 23:15:42,301: INFO: PROGRESS: at 0.16% words, alpha 0.02497, 86029 words/s
2015-03-09 23:15:46,283: INFO: PROGRESS: at 0.17% words, alpha 0.02497, 83033 words/s
2015-03-09 23:15:48,374: INFO: PROGRESS: at 0.18% words, alpha 0.02497, 83370 words/s
2015-03-09 23:15:51,398: INFO: PROGRESS: at 0.19% words, alpha 0.02496, 82794 words/s
2015-03-09 23:15:55,069: INFO: PROGRESS: at 0.21% words, alpha 0.02496, 83753 words/s
2015-03-09 23:15:57,718: INFO: PROGRESS: at 0.23% words, alpha 0.02496, 85031 words/s
2015-03-09 23:16:00,106: INFO: PROGRESS: at 0.24% words, alpha 0.02495, 86567 words/s
2015-03-09 23:16:05,523: INFO: PROGRESS: at 0.26% words, alpha 0.02495, 84850 words/s
2015-03-09 23:16:06,596: INFO: PROGRESS: at 0.27% words, alpha 0.02495, 87926 words/s
2015-03-09 23:16:09,500: INFO: PROGRESS: at 0.29% words, alpha 0.02494, 88618 words/s
2015-03-09 23:16:10,714: INFO: PROGRESS: at 0.30% words, alpha 0.02494, 91023 words/s
2015-03-09 23:16:18,467: INFO: PROGRESS: at 0.32% words, alpha 0.02494, 85960 words/s
2015-03-09 23:16:19,547: INFO: PROGRESS: at 0.33% words, alpha 0.02493, 89140 words/s
2015-03-09 23:16:23,500: INFO: PROGRESS: at 0.36% words, alpha 0.02493, 92026 words/s
2015-03-09 23:16:29,738: INFO: PROGRESS: at 0.37% words, alpha 0.02491, 88180 words/s
2015-03-09 23:16:32,000: INFO: PROGRESS: at 0.40% words, alpha 0.02492, 92734 words/s
2015-03-09 23:16:34,392: INFO: PROGRESS: at 0.42% words, alpha 0.02491, 93300 words/s
2015-03-09 23:16:41,018: INFO: PROGRESS: at 0.43% words, alpha 0.02490, 89727 words/s
.......
2015-03-10 05:03:31,849: INFO: PROGRESS: at 99.20% words, alpha 0.00020, 95350 words/s
2015-03-10 05:03:32,901: INFO: PROGRESS: at 99.21% words, alpha 0.00020, 95350 words/s
2015-03-10 05:03:34,296: INFO: PROGRESS: at 99.21% words, alpha 0.00020, 95350 words/s
2015-03-10 05:03:35,635: INFO: PROGRESS: at 99.22% words, alpha 0.00020, 95349 words/s
2015-03-10 05:03:36,730: INFO: PROGRESS: at 99.22% words, alpha 0.00020, 95350 words/s
2015-03-10 05:03:37,489: INFO: reached the end of input; waiting to finish 8 outstanding jobs
2015-03-10 05:03:37,908: INFO: PROGRESS: at 99.23% words, alpha 0.00019, 95350 words/s
2015-03-10 05:03:39,028: INFO: PROGRESS: at 99.23% words, alpha 0.00019, 95350 words/s
2015-03-10 05:03:40,127: INFO: PROGRESS: at 99.24% words, alpha 0.00019, 95350 words/s
2015-03-10 05:03:40,910: INFO: training on 1994415728 words took 20916.4s, 95352 words/s
2015-03-10 05:03:41,058: INFO: saving Word2Vec object under wiki.en.text.model, separately None
2015-03-10 05:03:41,209: INFO: not storing attribute syn0norm
2015-03-10 05:03:41,209: INFO: storing numpy array 'syn0' to wiki.en.text.model.syn0.npy
2015-03-10 05:04:35,199: INFO: storing numpy array 'syn1' to wiki.en.text.model.syn1.npy
2015-03-10 05:11:25,400: INFO: storing 1969354x400 projection weights into wiki.en.text.vector

在7个小时后面，我们得到了文本格式word2vec模式：wiki.en.text.vector：

1969354 400
the 0.129255 0.015725 0.049174 -0.016438 -0.018912 0.032752 0.079885 0.033669 -0.077722 -0.025709 0.012775 0.044153 0.134307 0.070499 -0.002243 0.105198 -0.016832 -0.028631 -0.124312 -0.123064 -0.116838 0.051181 -0.096058 -0.049734 0.017380 -0.101221 0.058945 0.013669 -0.012755 0.061053 0.061813 0.083655 -0.069382 -0.069868 0.066529 -0.037156 -0.072935 -0.009470 0.037412 -0.004406 0.047011 0.005033 -0.066270 -0.031815 0.023160 -0.080117 0.172918 0.065486 -0.072161 0.062875 0.019939 -0.048380 0.198152 -0.098525 0.023434 0.079439 0.045150 -0.079479 -0.051441 -0.021556 -0.024981 -0.045291 0.040284 -0.082500 0.014618 -0.071998 0.031887 0.043916 0.115783 -0.174898 0.086603 -0.023124 0.007293 -0.066576 -0.164817 -0.081223 0.058412 0.000132 0.064160 0.055848 0.029776 -0.103420 -0.007541 -0.031742 0.082533 -0.061760 -0.038961 0.001754 -0.023977 0.069616 0.095920 0.017136 0.067126 -0.111310 0.053632 0.017633 -0.003875 -0.005236 0.063151 0.039729 -0.039158 0.001415 0.021754 -0.012540 0.015070 -0.062636 -0.013605 -0.031770 0.005296 -0.078119 -0.069303 -0.080634 -0.058377 0.024398 -0.028173 0.026353 0.088662 0.018755 -0.113538 0.055538 -0.086012 -0.027708 -0.028788 0.017759 0.029293 0.047674 -0.106734 -0.134380 0.048605 -0.089583 0.029426 0.030552 0.141916 -0.022653 0.017204 -0.036059 0.061045 -0.000077 -0.076579 0.066747 0.060884 -0.072817 -0.051195 0.017663 0.043462 0.027486 -0.040694 0.025904 -0.075665 -0.000057 -0.076601 0.006704 -0.078985 -0.027770 -0.038087 0.097482 -0.001861 0.003741 -0.010897 0.042828 -0.037804 0.041546 -0.018394 -0.092459 0.010917 -0.004262 -0.113903 -0.037155 0.066674 0.096078 -0.114286 0.027908 -0.003139 -0.007529 -0.076928 0.025825 -0.090934 -0.013763 -0.057434 0.071827 -0.031783 -0.052096 0.107292 0.001864 -0.020808 0.043721 -0.024951 -0.046789 0.092858 0.037771 -0.006570 0.018282 -0.013571 -0.069215 0.019530 -0.080015 -0.078925 0.003094 0.044550 -0.046577 0.004945 -0.010885 -0.098681 0.044861 0.001618 -0.077582 -0.013834 0.024985 0.008541 -0.011861 0.023718 -0.018038 0.004162 -0.005827 -0.036836 0.081241 -0.028473 0.043937 0.005622 -0.004714 -0.029995 0.002236 -0.044635 -0.100051 0.006926 0.012636 -0.132891 -0.097755 -0.118586 0.038355 -0.034691 0.027983 0.074292 0.075199 0.033331 0.067474 -0.023996 0.024614 -0.039520 -0.110454 0.046004 -0.047849 0.023945 -0.022695 -0.053563 0.035277 0.011309 0.044326 0.026382 0.043251 0.004535 0.112228 0.022841 -0.068083 -0.122575 -0.053305 -0.005031 -0.078522 -0.044147 0.083576 0.005531 -0.063187 -0.032841 -0.067989 0.111359 0.125724 0.074154 0.040301 0.082240 0.015494 -0.066648 0.091087 0.095067 -0.059386 0.003256 -0.006734 -0.058248 0.020567 -0.006784 -0.017885 0.146956 -0.014679 -0.019453 -0.009875 -0.031508 0.002070 -0.002830 0.060321 0.056237 -0.080740 0.017465 0.016851 -0.067723 -0.061582 0.028104 0.067970 -0.024162 0.027407 0.075006 0.084483 -0.011534 0.129151 -0.072387 0.083424 -0.009501 0.041553 0.016603 0.002965 -0.027677 -0.110295 0.033986 0.028290 0.049621 0.001125 -0.018187 -0.001404 -0.024074 0.025322 -0.023594 -0.076071 0.107616 0.091381 -0.116943 0.109416 -0.045990 0.024346 0.152548 -0.010692 0.120887 -0.012670 -0.044978 -0.050880 -0.012535 -0.080475 0.036055 -0.050770 0.040417 -0.030957 -0.013680 0.001236 0.010180 -0.040136 -0.118249 0.017540 0.107725 -0.118492 -0.032438 -0.009072 -0.081345 -0.022384 0.045453 -0.008754 -0.098392 -0.113199 0.023589 0.017172 0.108523 -0.029611 0.041029 0.005958 0.010155 -0.036815 0.073110 -0.048424 -0.029022 -0.016711 -0.126587 0.045923 0.018589 0.113195 -0.002896 -0.051350 -0.007355 0.012278 0.093481 0.093676 -0.145230 -0.068279 -0.068407 0.008837 -0.012186 -0.136079 0.087961 0.041402 -0.058727 0.003030 0.008455 -0.062826 -0.139834 -0.014068 -0.115521 -0.117215 0.093502 0.026607 0.095726 -0.016339 0.033879 -0.022889 0.023565 0.028705
…

在ipython，我们像这样测试，注意到wiki.en.text.vector大概7g的大小，然后载入花费了很长的时间：

In [2]: import gensim
 
In [3]: model = gensim.models.Word2Vec.load_word2vec_format("wiki.en.text.vector", binary=False)
 
In [4]: model.most_similar("queen")
Out[4]: 
[(u'princess', 0.5760838389396667),
 (u'hyoui', 0.5671186447143555),
 (u'janggyung', 0.5598698854446411),
 (u'king', 0.5556215047836304),
 (u'dollallolla', 0.5540223121643066),
 (u'loranella', 0.5522741079330444),
 (u'ramphaiphanni', 0.5310937166213989),
 (u'jeheon', 0.5298476219177246),
 (u'soheon', 0.5243583917617798),
 (u'coronation', 0.5217245221138)]
 
In [5]: model.most_similar("man")
Out[5]: 
[(u'woman', 0.7120707035064697),
 (u'girl', 0.58659827709198),
 (u'handsome', 0.5637181997299194),
 (u'boy', 0.5425317287445068),
 (u'villager', 0.5084836483001709),
 (u'mustachioed', 0.49287813901901245),
 (u'mcgucket', 0.48355430364608765),
 (u'spider', 0.4804879426956177),
 (u'policeman', 0.4780033826828003),
 (u'stranger', 0.4750771224498749)]
 
In [6]: model.most_similar("woman")
Out[6]: 
[(u'man', 0.7120705842971802),
 (u'girl', 0.6736541986465454),
 (u'prostitute', 0.5765659809112549),
 (u'divorcee', 0.5429972410202026),
 (u'person', 0.5276163816452026),
 (u'schoolgirl', 0.5102938413619995),
 (u'housewife', 0.48748138546943665),
 (u'lover', 0.4858251214027405),
 (u'handsome', 0.4773051142692566),
 (u'boy', 0.47445783019065857)]
 
In [8]: model.similarity("woman", "man")
Out[8]: 0.71207063453821218
 
In [10]: model.doesnt_match("breakfast cereal dinner lunch".split())
Out[10]: 'cereal'
 
In [11]: model.similarity("woman", "girl")
Out[11]: 0.67365416785207421
 
In [13]: model.most_similar("frog")
Out[13]: 
[(u'toad', 0.6868536472320557),
 (u'barycragus', 0.6607867479324341),
 (u'grylio', 0.626731276512146),
 (u'heckscheri', 0.6208407878875732),
 (u'clamitans', 0.6150864362716675),
 (u'coplandi', 0.612680196762085),
 (u'pseudacris', 0.6108512878417969),
 (u'litoria', 0.6084023714065552),
 (u'raniformis', 0.6044802665710449),
 (u'watjulumensis', 0.6043726205825806)]

一切都没问题，但我载入numpy的时候，我仍然遇到“RuntimeWarning: invalid value encountered in divide”的问题：

In [1]: import gensim 
 
In [2]: model = gensim.models.Word2Vec.load("wiki.en.text.model")
 
In [3]: model.most_similar("man")
... RuntimeWarning: invalid value encountered in divide
  self.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)
 
Out[3]: 
[(u'ahsns', nan),
 (u'ny\xedl', nan),
 (u'indradeo', nan),
 (u'jaimovich', nan),
 (u'addlepate', nan),
 (u'jagello', nan),
 (u'festenburg', nan),
 (u'picatic', nan),
 (u'tolosanum', nan),
 (u'mithoo', nan)]

如果你看到这里了，能解决我遇到的问题，请指出，非常感谢。

beixiahuaideren

关注

1
点赞
踩
9

收藏

觉得还不错? 一键收藏
3
评论
利用Gensim训练关于英文维基百科的Word2Vec模型（Training Word2Vec Model on English Wikipedia by Gensim）

Training Word2Vec Model on English Wikipedia by Gensim在学习了word2vec和glove，一个很自然的方式是考虑去训练一个大型的语料库，对于这个任务，英文维基百科是一个理想的选择。在google了相关关键词比如“word2vec wikipedia”，“gensim word2vec wikipedia”，我在gensim谷歌组里看到
复制链接

扫一扫