Options:
Parameters for training:
-train
Use text data from to train the model
-output
Use to save the resulting word vectors / word clusters
-size
Set size of word vectors; default is 100
-window
Set max skip length between words; default is 5
-sample
表示 采样的阈值,如果一个词在训练样本中出现的频率越大,那么就越会被采样 (负采样才使用到)
Set threshold for occurrence of words. Those that appear with higher frequency in the training data
will be randomly down-sampled; default is 1e-3, useful range is (0, 1e-5)
-hs
Use Hierarchical Softmax; default is 0 (not used)
-negative
Number of negative examples; default is 5, common values are 3 - 10 (0 = not used)
-threads
Use threads (default 12)
-iter
Run more training iterations (default 5)
-min-count
置最低频率,默认是5
This will discard words that appear less than times; default is 5
-alpha
学习速度(梯度下降步长)
Set the starting learning rate; default is 0.025 for skip-gram and 0.05 for CBOW
-classes
聚类簇个数
Output word classes rather than word vectors; default number of classes is 0 (vectors are written)
-debug
Set the debug mode (default = 2 = more info during training)
-binary
Save the resulting vectors in binary moded; default is 0 (off)
-save-vocab
The vocabulary will be saved to
-read-vocab
The vocabulary will be read from , not constructed from the training data
-cbow
Use the continuous bag of words model; default is 1 (use 0 for skip-gram model)
Examples:
//-negative + cbow + .txt
nohup ./word2vec -train data.txt -output vec.txt -size 200 -window 5 -sample 1e-4 -negative 5 -hs 0 -binary 0 -cbow 1 -iter 3 &
//-hs + cbow + .bin
nohup ./word2vec -train data.txt -output vec.bin -size 150 -window 5 -negative 0 -hs 1 -binary 1 -cbow 1 -iter 20 &
//kmeans
./word2vec -train out.txt -output classes.txt -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -classes 500
2G语料比较适合的参数
-negative 5 -window 5 -cbow 0 -size = 150 -iter 15 -hs 0 其它默认
-hs 0 代表不使用hs,即使用negative,negative默认为5(负采样5个词)
举例聚类:
nohup ./word2vec -train out.txt -output classes.txt -cbow 0 -size 200 -window 5 -hs 0 -sample 1e-3 -threads 12 -classes 200 -iter 15&