gensim--word2vec
1. 安装gensim包,gensim的安装只需要使用:pip install gensim命令,耐心等待几分钟即可;
2. Word2vec的训练,下载语料库:
中文维基语料: https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
英文维基语料: https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Text8语料: http://mattmahoney.net/dc/text8.zip
3. 开始编写程序(这里基于text8.zip语料进行训练):
from gensim.models import word2vec
import gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s :%(message)',level=logging.INFO)
#加载预料
sentences=word2vec.Text8Corpus('text8.txt')
"""此处改成text8.txt是因为代码运行的时候一直报unicodedecodererror,所以直接使用记事本转utf-8编码
具体也不知道这样做是不是对的,先实现相关功能,以后稍作研究"""
model=word2vec.Word2Vec(sentences,size=200)
model.save('text.model')
4. 运行结束,会在text8.txt同目录下生成text.model文件
5. 继续利用model尝试一些相似度的计算程序:
(1)model.most_similar(positive=['woman','man','kiss','love'],negative=['girl'],topn=5)
#查找与woman mankiss love相近而与girl不相近的词
运行结果:
[('me',0.6618683338165283),
('lover', 0.6543294191360474),
('bride', 0.6529062986373901),
('lady', 0.6498624086380005),
('grace',0.6455775499343872)]
(2)model.similarity('woman','girl') #计算两个词的相似度
运行结果:0.7331330069237494
(3)model.most_similar(['father','girl'],['boy'],topn=3) #寻找对应的关系
运行结果:
[('mother',0.7856869697570801),
('grandmother', 0.722622275352478),
('wife',0.7174333930015564)]
(4)model.doesnt_match("banana pear peach red".split()) #查找不匹配的词
运行结果:'red'
(5)model['red'] # red的特征向量
运行结果:
array([-0.04533739,-0.6966845 , 0.4287293 , ...,-0.62464094,
-0.77876353, 0.32641527],dtype=float32)
备注:
利用model.similarity(w1,w2)计算两个向量之间的相似度,使用numpy的dot函数:
import numpy as np
np.dot(gensim.matutils.unitvec(w1),gensim.matutils.unitvec(w2))
其中gensim.matuils.unitvec(w1)的具体解释为:
gensim.matutils.unitvec(vec, norm='l2')
Scale a vector to unit length. The only exception is the zerovector, which is returned back unchanged.
Output will be in the same format as input (i.e., gensimvector=>gensim vector, or np array=>np array,scipy.sparse=>scipy.sparse).
即就是会将w1和w2单位化,再进行dot运算.