https://blog.csdn.net/lilong117194/article/details/82849054
https://www.jiqizhixin.com/articles/2018-05-15-10
https://zhuanlan.zhihu.com/p/40016964
吴恩达视频里讲的是:
语料小,建议用CBOW,语料大,用skip-gram
word2vec训练参数:
min_count:最小出现次数
min_count is for pruning剪枝 the internal dictionary.
model = gensim.models.Word2Vec(sentences, min_count=10)
default value of min_count=5
size:词向量维度
Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.
workers: default 3
for training parallelization, to speed up training:
The workers parameter only has an effect if you have Cython installed. Without Cython, you’ll only be able to use one core because of the GIL (and word2vec training will be miserably slow).
Memory:内存占用
Evaluating
model.accuracy(’./datasets/questions-words.txt’)
accuracy(questions, restrict_vocab=30000, most_similar=None, case_insensitive=True)
Online training / Resuming training
Training Loss Computation true/false
The parameter compute_loss can be used to toggle computation of loss while training the Word2Vec model. The computed loss is stored in the model attribute running_training_loss and can be retrieved using the function get_latest_training_loss as follows
细节见https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.accuracy
这个写的清
搜索即可