HyperParameters
train_word2vec(sentence_matrix, vocabulary_inv,
num_features = 300, min_word_count = 1, context = 10)
上面主要涉及了五个参数:
- sentences matrix: 一个整数矩阵,就是每句话对应一行,长度不够做pad, 每个单词用vocabulary里的value替代
- vocabulary_inv: dict{int: str}用来对应每个value表示哪个单词
- num_features: word vector的维度,一般建议是100-300
- min_word_count:最小的单词计数次数(Minimum times a word must appear to be included in the samples. High values help reduce model size.)
- context: context window size,表示每次用多少单词训练(The size of the window (in words) to use in training.)
为了运用到word2vec.Word2Vec,我们还需要设定两个hyperparameters, 这两个hyperparameters在这个函数里是固定的
num_workers = 2 # Number of threads to run parallel
downsampling = 1e-3 # Threshold to downsample frequent words
之所以需要vocaulary_inv也是因为word2vec.Word2Vec训练的是sentences而不是整数matrix.
# Get the sentences list of words, instead of index
sentences = [[vocabulary_inv[index] for index in sen] for sen in sentence_matrix]
然后就可以用函数word2vec.Word2Vec预训练word了。
# Initialize and train the model
embedding_model = word2vec.Word2Vec(sentences, size=num_features,
window=context, min_count=min_word_count,
sample=downsampling, workers=num_workers)
训练完之后保存预训练模型
if not exists(model_dir):
os.mkdir(model_dir)# 创建文件夹
print('Saving Word2Vec model \'%s\'' % split(model_name)[-1]) # print文件夹下的模型名
embedding_model.save(model_name) # 保存文件名
若之前就已经训练过这样一个模型,可以直接Load已训练集
if exists(model_name):
embedding_model = word2vec.Word2Vec.load(model_name)
print('Load existing Word2Vec model \'%s\'' % split(model_name)[-1])
要么是训练出了模型,要么是load了已训练模型,最终目的是要训练后的word_vector
# 对unknown words我们进行随机初始化
embedding_weights = {key: embedding_model.wv[word] if word in embedding_model.wv
else np.random.uniform(-0.25, 0.25, embedding_model.vector_size)
for key, word in vocabulary_inv.items()}
return embedding_weights