1.word2vec介绍
(1)word2vec是几位google的研究人员发布的一个工具包,是利用神经网络为单词寻找一个连续向量空间中的表示;
(2)主要包含两个模型:continuous bag of words(简称CBOW)和skip-gram;
(3)主要包含两种高效训练的方法:negative sampling和hierarchical softmax;
2.实现
在python中需要用到gensim,直接pip install gensim即可安装。用gensim.models.Word2Vec()即可。
import gensim
import pandas as pd
import numpy as np
vector_size = 100
def sentence2list(sentence):
return sentence.strip().split()
"""读取数据"""
print("data read begin...")
train_data = pd.read_csv('./new_data/train_set.csv')
test_data = pd.read_csv('./new_data/test_set.csv')
train_data.drop(columns=['article','id'], inplace = True)
test_data.drop(columns=['article'], inplace = True)
print("data read end...")
"""准备数据"""
print("准备数据... ")
sentences_train = list(train_data.loc[:, 'word_seg'].apply(sentence2list))
sentences_test = list(test_data.loc[:, 'word_seg'].apply(sentence2list))
sentences = sentences_train + sentences_test
print("准备数据完成! ")
print("开始训练...")
model = gensim.models.Word2Vec(sentences=sentences, size=vector_size, window=5, min_count=5, workers=8, sg=0, iter=5)
print("训练完成! ")
参考:
1. https://www.zhihu.com/topic/19886836/top-answers
2. https://github.com/Heitao5200/DGB/blob/master/feature/feature_code/train_word2vec.py