Train_word2vec in Convolutional Neural Network for Sentences Classification

HyperParameters

train_word2vec(sentence_matrix, vocabulary_inv,
				num_features = 300, min_word_count = 1, context = 10)

上面主要涉及了五个参数:

  • sentences matrix: 一个整数矩阵,就是每句话对应一行,长度不够做pad, 每个单词用vocabulary里的value替代
  • vocabulary_inv: dict{int: str}用来对应每个value表示哪个单词
  • num_features: word vector的维度,一般建议是100-300
  • min_word_count:最小的单词计数次数(Minimum times a word must appear to be included in the samples. High values help reduce model size.)
  • context: context window size,表示每次用多少单词训练(The size of the window (in words) to use in training.)

为了运用到word2vec.Word2Vec,我们还需要设定两个hyperparameters, 这两个hyperparameters在这个函数里是固定的

num_workers = 2 # Number of threads to run parallel
downsampling = 1e-3 # Threshold to downsample frequent words

之所以需要vocaulary_inv也是因为word2vec.Word2Vec训练的是sentences而不是整数matrix.

# Get the sentences list of words, instead of index
sentences = [[vocabulary_inv[index] for index in sen] for sen in sentence_matrix]       

然后就可以用函数word2vec.Word2Vec预训练word了。

# Initialize and train the model
embedding_model = word2vec.Word2Vec(sentences, size=num_features,
                                            window=context, min_count=min_word_count,
                                            sample=downsampling, workers=num_workers)       

训练完之后保存预训练模型

if not exists(model_dir):
    os.mkdir(model_dir)# 创建文件夹
print('Saving Word2Vec model \'%s\'' % split(model_name)[-1]) # print文件夹下的模型名
embedding_model.save(model_name) # 保存文件名 

若之前就已经训练过这样一个模型,可以直接Load已训练集

if exists(model_name):
  	embedding_model = word2vec.Word2Vec.load(model_name)
   	print('Load existing Word2Vec model \'%s\'' % split(model_name)[-1])

要么是训练出了模型,要么是load了已训练模型,最终目的是要训练后的word_vector

# 对unknown words我们进行随机初始化
embedding_weights = {key: embedding_model.wv[word] if word in embedding_model.wv
                        else np.random.uniform(-0.25, 0.25, embedding_model.vector_size)
                     for key, word in vocabulary_inv.items()}
return embedding_weights
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值