文章目录
1. 训练word2vec
# 训练词向量
w2v_data = []
for i in train_df['string'].values:
w2v_data.append(i.split())
for i in test_df['string'].values:
w2v_data.append(i.split())
model_word2vec = gensim.models.Word2Vec(sentences=w2v_data, size=100, window=5, min_count=1, workers=8, sg=0, iter=5) # sentences是list of list数据
wv = model_word2vec.wv
vocab_list = wv.index2word
word_idx_dict = {}
for idx, word in enumerate(vocab_list):
word_idx_dict[word] = idx + 1
vectors_arr = wv.vectors
vectors_arr = np.concatenate((np.zeros(100)[np.newaxis, :], vectors_arr), axis=0) # 此处0位置的向量指代的是padding
f_vectors = open('./word_seg_vectors_arr.pkl', 'wb') # 用作神经网络训练时Embedding层作为输入
pickle.dump(vectors_arr, f_vectors)
f_vectors.close()
with open(r'word2idx_vec.json', 'w') as f: # 保存word: index的对应关系
json.dump(word_idx_dict, f)
model_word2vec.save('word2vec.model') # 保存模型文件
2. 使用词向量
2.1 求多个词向量的平均值
word2vec_mean = [np.mean(model_word2vec[filter(lambda x: x in model_word2vec.wv.vocab.keys(), i.split())], axis = 0) for i in text_data]
word2vec_mean = np.array(word2vec_mean )
首先我们要清楚的是model_word2vec能同时对多个词求出词向量,词向量的维度是n*dim,n是词的个数,dim是词向量的维度,如model_wod2vec[‘apple’, ‘banana’],假设词向量维度为100,则得到向量的维度为2*100。
使用高阶函数filter(f, x),则对单个句子来说filter(lambda x: x in model_word2vec.vocab.keys(), i.split())
由于上述得到的是多个词的词向量,要求的是词的平均向量则np.mean(filter(lambda x: x in model_word2vec.vocab.keys(), i.split()) axis=0)
由于输入是列表,则再增加个列表生成式即可:[np.mean(filter(lambda x: x in model_word2vec.vocab.keys(), i.split()) axis=0) for i in text_data]
总结完以后,对每一步在大脑中进行形象化的操作。
3. 其他
3.1 查看word2vec模型的词:
model.wv.vocab
这是由于在gensim 1.0版本后用model.wv.vocab代替了model.vocab。所以如果用model.vocab就会报错:AttributeError: ‘Word2Vec’ object has no attribute ‘vocab’
3.2 unable to import ‘smart_open.gcs’, disabling that module
pip install smart_open==1.10.0