一、语料处理
import jieba
jieba.suggest_freq('沙瑞金', True)
# 避免分割特殊词
...
with open("./in_the_name_of_people.txt", encoding="utf-8") as file:
doc = file.read()
doc_cut = jieba.cut(doc)
res = " ".join(doc_cut)
with open("./cutcut.txt", 'w', encoding="utf-8") as wr:
wr.write(res)
二、模型训练
sentences最好是嵌套列表的形式,比如 [ [‘a’, ‘b’, ‘c’, …], [‘c’, ‘d’, …], […], …]。避免把标点符号也作为训练数据。
Word2Vec类的构造函数API可以查看文档,一般需要自行设定sg(选择CBOW或SKip-Gram),hs(训练算法,hierarchical softmax或negative sampling),min_count(过滤低频词),window(滑动窗口大小),size(词向量长度)。
import logging
from gensim.models import word2vec
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = word2vec.LineSentence('./cutcut.txt')
# LineSentence 读入文件默认一行一句话,用空格分隔好的
model = word2vec.Word2Vec(sentences, hs=1, min_count=1, window=3, size=100)
# 模型保存和读取
model.save("./people.model")
model = Word2Vec.load("people.model")
三、测试模型
利用训练得到的词向量进行一些工作
print(model.wv['沙瑞金'])
# 直接输出词向量
print(model.wv.similarity('沙瑞金', '高育良'))
print(model.wv.similar_by_word('高育良', topn=2))