用gensim训练word2vec 英文词向量模型
环境
- python==3.7
- gensim==4.0.1
文档
gensim官方文档 https://radimrehurek.com/gensim/models/word2vec.html
预处理
全部转成小写,去除符号,stopwords, 分词
构建sentence
生成的格式类似
[
[“i”, “love”, “you”],
[“you”, “like”, “apple”],
]
可以是个迭代器。
1. LineSentence
使用LineSentence,需要先将数据与处理好,放入一个txt。 一行是一个sentence, 每个单词之间用空格分割,去除里面的特殊符号标点符号,转成小写。返回结果是个迭代器
from gensim.models.word2vec import LineSentence
sentences = LineSentence("../data/word2vec/corpus.txt")
2. PathLineSentences
PathLineSentence跟使用LineSentence的区别在于会将文件夹里的所有文件迭代
sentences_path = PathLineSentences("../data/word2vec")
3. 自定义迭代器
可在此处预处理
from gensim.utils import simple_preprocess
import os
class Sentences(object):
"""
生成gensim sentence需要的格式, 可在这个类里进行预处理 ,可迭代对象,
[
["i", "love", "you"],
["you", "like", "apple"],
]
"""
def __init__(self, folder_path, remove_stopwords=True):
self.folder_path = folder_path
self.remove_stopwords = remove_stopwords
def __iter__(self):
for file_name in os.listdir(self.folder_path):
for line in open(os.path.join(self.folder_path, file_name), encoding="utf-8"):
content = simple_preprocess(line)
if self.remove_stopwords:
content = [x for x in content if x not in en_stopwords]
yield content
def __str__(self):
return "It is a iter, create sentence"
模型训练
sentences_my = Sentences("../data/word2vec")
model = Word2Vec(sentences=sentences_my, min_count=1, sg=1, hs=1)
model.save("w2v.bin")
# 词向量
# model.wv.vectors
# 获取某个单词的向量
# model.wv.get_vector("love")
# 模型加载
# model = Word2Vec.load("w2v.bin")
# print(model.wv.evaluate_word_analogies(datapath('questions-words.txt')))
# print(model.wv.evaluate_word_pairs(datapath('wordsim353.tsv')))