wiki_word2vec_python实验


1.linux安装 python版本  gensim word2vec  :

依赖库:Numpy和SciPy:

首先进行安装以上两个库:

ubuntu:

sudo apt-get install python-numpy python-scipy python-matplotlib ipython ipython-notebook python-pandas python-sympy python-nose
安装完后安装gensim:

pip install gensim
2.前提工作完成后进入关键部分:

  中文数据(1.3G)下载:

https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

英文数据(11G)下载:

https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

  1. 将xml的wiki数据转换成text格式,命令:python process_wiki.py enwiki-latest-pages-articles.xml.bz2 wiki.en.text
process_wiki.py:python处理代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-
 
import logging
import os.path
import sys
 
from gensim.corpora import WikiCorpus
 
if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)
 
    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
 
    # check and process input arguments
    if len(sys.argv) < 3:
        print globals()['__doc__'] % locals()
        sys.exit(1)
    inp, outp = sys.argv[1:3]
    space = " "
    i = 0
 
    output = open(outp, 'w')
    wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        output.write(space.join(text) + "\n")
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")
 
    output.close()
    logger.info("Finished Saved " + str(i) + " articles")

处理后存在 wiki.en.text里,格式:一篇文章一行,中间用空格分隔一些关键词

2.但处理过后的 wiki.en.text简体和繁体不分,我们需要用opencc来统一:指令:

opencc -i wiki.zh.text -o wiki.zh.text.jian -c zht2zhs.ini

需提前安装:

sudo apt-get install opencc

3.将处理好的wiki.zh.text.jian分词

分词用jieba:jieba.py:

#!/usr/bin/env python
#-*- coding:utf-8 -*-
import jieba
import jieba.analyse
import jieba.posseg as pseg
def cut_words(sentence):
    return " ".join(jieba.cut(sentence)).encode('utf-8')
f = open("/home/xuanwei/工作/word2Vec/wiki.zh.text.jian")
target = open("/home/xuanwei/工作/word2Vec/wiki.zh.text.jian.seg", 'a+')
print 'open files:'
line = f.readlines(100000)
num_n=0
while line:
	curr=[]
	num_n+=1
	for online in line:
		curr.append(online)
	after_cut=map(cut_words,curr)
	target.writelines(after_cut)
	print 'saved %d00000 articles' % num_n
	line=f.readlines(100000)
	
f.close()
target.close()
时间耗时好长。。。

4.现在才进入最关键的一步:训练:

执行:python train_word2vec_model.py wiki.zh.text.jian.seg wiki.zh.text.model wiki.zh.text.vector

直接放代码(train_word2vec_model.py):

#!/usr/bin/env python
#-*- coding:utf-8 -*-
import logging
import os.path
import sys
import multiprocessing
from gensim.corpora import WikiCorpus
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)
    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
    # check and process input arguments
    if len(sys.argv) < 4:
        print globals()['__doc__'] % locals()
        sys.exit(1)
    inp, outp1, outp2 = sys.argv[1:4]
    model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count())
    model.save(outp1)
    model.save_word2vec_format(outp2, binary=False)
5.最后我们看一下我们的模型,用其操作:

In [1]: import gensim
 
In [2]: model = gensim.models.Word2Vec.load("wiki.zh.text.model")
 
In [3]: model.most_similar(u"足球")
Out[3]: 
[(u'\u8054\u8d5b', 0.6553816199302673),
 (u'\u7532\u7ea7', 0.6530429720878601),
 (u'\u7bee\u7403', 0.5967546701431274),
 (u'\u4ff1\u4e50\u90e8', 0.5872289538383484),
 (u'\u4e59\u7ea7', 0.5840631723403931),
 (u'\u8db3\u7403\u961f', 0.5560152530670166),
 (u'\u4e9a\u8db3\u8054', 0.5308005809783936),
 (u'allsvenskan', 0.5249762535095215),
 (u'\u4ee3\u8868\u961f', 0.5214947462081909),
 (u'\u7532\u7ec4', 0.5177896022796631)]
 
In [4]: result = model.most_similar(u"足球")
 
In [5]: for e in result:
    print e[0], e[1]
   ....:     
联赛 0.65538161993
甲级 0.653042972088
篮球 0.596754670143
俱乐部 0.587228953838
乙级 0.58406317234
足球队 0.556015253067
亚足联 0.530800580978
allsvenskan 0.52497625351
代表队 0.521494746208
甲组 0.51778960228
至此,已完成了word2vec训练,word2vec就是将关键词映射为一个向量,包含有词之间的相关度等信息。

参考文献:

1,我爱自然语言处理:    http://www.52nlp.cn/%E4%B8%AD%E8%8B%B1%E6%96%87%E7%BB%B4%E5%9F%BA%E7%99%BE%E7%A7%91%E8%AF%AD%E6%96%99%E4%B8%8A%E7%9A%84word2vec%E5%AE%9E%E9%AA%8C

2,  http://codesky.me/archives/ubuntu-python-jieba-word2vec-wiki-tutol.wind



评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值