1.linux安装
python版本
gensim word2vec
:
依赖库:Numpy和SciPy:
首先进行安装以上两个库:
ubuntu:
sudo apt-get install python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose
安装完后安装gensim:
pip install gensim
2.前提工作完成后进入关键部分:
中文数据(1.3G)下载:
https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
英文数据(11G)下载:
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
- 将xml的wiki数据转换成text格式,命令:python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import logging
import os.path
import sys
from gensim.corpora import WikiCorpus
if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# check and process input arguments
if len(sys.argv) < 3:
print globals()['__doc__'] % locals()
sys.exit(1)
inp, outp = sys.argv[1:3]
space = " "
i = 0
output = open(outp, 'w')
wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
for text in wiki.get_texts():
output.write(space.join(text) + "\n")
i = i + 1
if (i % 10000 == 0):
logger.info("Saved " + str(i) + " articles")
output.close()
logger.info("Finished Saved " + str(i) + " articles")
处理后存在 wiki.en.text里,格式:一篇文章一行,中间用空格分隔一些关键词
2.但处理过后的 wiki.en.text简体和繁体不分,我们需要用opencc来统一:指令:
opencc -i wiki.zh.text -o wiki.zh.text.jian -c zht2zhs.ini
需提前安装:
sudo apt-get install opencc
3.将处理好的wiki.zh.text.jian分词
分词用jieba:jieba.py:
#!/usr/bin/env python
#-*- coding:utf-8 -*-
import jieba
import jieba.analyse
import jieba.posseg as pseg
def cut_words(sentence):
return " ".join(jieba.cut(sentence)).encode('utf-8')
f = open("/home/xuanwei/工作/word2Vec/wiki.zh.text.jian")
target = open("/home/xuanwei/工作/word2Vec/wiki.zh.text.jian.seg", 'a+')
print 'open files:'
line = f.readlines(100000)
num_n=0
while line:
curr=[]
num_n+=1
for online in line:
curr.append(online)
after_cut=map(cut_words,curr)
target.writelines(after_cut)
print 'saved %d00000 articles' % num_n
line=f.readlines(100000)
f.close()
target.close()
时间耗时好长。。。
4.现在才进入最关键的一步:训练:
执行:python train_word2vec_model.py wiki.zh.text.jian.seg wiki.zh.text.model wiki.zh.text.vector
直接放代码(train_word2vec_model.py):
#!/usr/bin/env python
#-*- coding:utf-8 -*-
import logging
import os.path
import sys
import multiprocessing
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
if __name__ == '__main__':
program = os.path.basename(sys.argv[0])
logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# check and process input arguments
if len(sys.argv) < 4:
print globals()['__doc__'] % locals()
sys.exit(1)
inp, outp1, outp2 = sys.argv[1:4]
model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count())
model.save(outp1)
model.save_word2vec_format(outp2, binary=False)
5.最后我们看一下我们的模型,用其操作:
In [1]: import gensim
In [2]: model = gensim.models.Word2Vec.load("wiki.zh.text.model")
In [3]: model.most_similar(u"足球")
Out[3]:
[(u'\u8054\u8d5b', 0.6553816199302673),
(u'\u7532\u7ea7', 0.6530429720878601),
(u'\u7bee\u7403', 0.5967546701431274),
(u'\u4ff1\u4e50\u90e8', 0.5872289538383484),
(u'\u4e59\u7ea7', 0.5840631723403931),
(u'\u8db3\u7403\u961f', 0.5560152530670166),
(u'\u4e9a\u8db3\u8054', 0.5308005809783936),
(u'allsvenskan', 0.5249762535095215),
(u'\u4ee3\u8868\u961f', 0.5214947462081909),
(u'\u7532\u7ec4', 0.5177896022796631)]
In [4]: result = model.most_similar(u"足球")
In [5]: for e in result:
print e[0], e[1]
....:
联赛 0.65538161993
甲级 0.653042972088
篮球 0.596754670143
俱乐部 0.587228953838
乙级 0.58406317234
足球队 0.556015253067
亚足联 0.530800580978
allsvenskan 0.52497625351
代表队 0.521494746208
甲组 0.51778960228
至此,已完成了word2vec训练,word2vec就是将关键词映射为一个向量,包含有词之间的相关度等信息。
参考文献:
1,我爱自然语言处理: http://www.52nlp.cn/%E4%B8%AD%E8%8B%B1%E6%96%87%E7%BB%B4%E5%9F%BA%E7%99%BE%E7%A7%91%E8%AF%AD%E6%96%99%E4%B8%8A%E7%9A%84word2vec%E5%AE%9E%E9%AA%8C
2,CodeSky 代码之空 http://codesky.me/archives/ubuntu-python-jieba-word2vec-wiki-tutol.wind