词向量
关于word2vec的原理网上很详细了
本文代码共2种词向量转换方式
1、独热编码
2、word2vec
1、独热编码
from sklearn.preprocessing import LabelEncoder
one-hot = LabelEncoder()
# 输入为列表好像也可以
word_vector = one-hot.fit_transform(df[‘列名’].values)
2、word2vec
import numpy as np
import gensim
X_train = ['字符串1', ‘字符串2’]
#输出是一个字典, key是词, 值是长度为size的列表
word2vec = gensim.models.Word2Vec(X_train, min_count =2, window=5, size=30)
def sent2vec(words):
words = [w for w in words]
vector = []
# 转换为数组
for w in words:
try:
M.append(model[w])
except:
continue
# 此时的数组是(n, size), n是words的词数
vector = np.array(vector)
# 要把(n,size)-> (1,size)
# 对每一列求和后标准化
v = vector.sum(axis=0)
return v / np.sqrt((v ** 2).sum())
参考资料:
https://www.kesci.com/home/project/5cbd99578c90d7002c81b52c