Train_word2vec in Convolutional Neural Network for Sentences Classification

最新推荐文章于 2021-02-05 03:27:28 发布

Jason24_Zeng

最新推荐文章于 2021-02-05 03:27:28 发布

阅读量203

点赞数

分类专栏： DL 文章标签： python 深度学习机器学习 tensorflow 自然语言处理

原文链接：https://github.com/BrambleXu/nlp-beginner-guide-keras/tree/master/cnn-text-classification

版权

DL 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

HyperParameters

train_word2vec(sentence_matrix, vocabulary_inv,
				num_features = 300, min_word_count = 1, context = 10)

上面主要涉及了五个参数：

sentences matrix: 一个整数矩阵，就是每句话对应一行，长度不够做pad, 每个单词用vocabulary里的value替代
vocabulary_inv: dict{int: str}用来对应每个value表示哪个单词
num_features: word vector的维度，一般建议是100-300
min_word_count:最小的单词计数次数（Minimum times a word must appear to be included in the samples. High values help reduce model size.）
context: context window size，表示每次用多少单词训练（The size of the window (in words) to use in training.）

为了运用到word2vec.Word2Vec,我们还需要设定两个hyperparameters，这两个hyperparameters在这个函数里是固定的

num_workers = 2 # Number of threads to run parallel
downsampling = 1e-3 # Threshold to downsample frequent words

之所以需要vocaulary_inv也是因为word2vec.Word2Vec训练的是sentences而不是整数matrix.

# Get the sentences list of words, instead of index
sentences = [[vocabulary_inv[index] for index in sen] for sen in sentence_matrix]

然后就可以用函数word2vec.Word2Vec预训练word了。

# Initialize and train the model
embedding_model = word2vec.Word2Vec(sentences, size=num_features,
                                            window=context, min_count=min_word_count,
                                            sample=downsampling, workers=num_workers)

训练完之后保存预训练模型

if not exists(model_dir):
    os.mkdir(model_dir)# 创建文件夹
print('Saving Word2Vec model \'%s\'' % split(model_name)[-1]) # print文件夹下的模型名
embedding_model.save(model_name) # 保存文件名

若之前就已经训练过这样一个模型，可以直接Load已训练集

if exists(model_name):
  	embedding_model = word2vec.Word2Vec.load(model_name)
   	print('Load existing Word2Vec model \'%s\'' % split(model_name)[-1])

要么是训练出了模型，要么是load了已训练模型，最终目的是要训练后的word_vector

# 对unknown words我们进行随机初始化
embedding_weights = {key: embedding_model.wv[word] if word in embedding_model.wv
                        else np.random.uniform(-0.25, 0.25, embedding_model.vector_size)
                     for key, word in vocabulary_inv.items()}
return embedding_weights