Google code word2vec toolkit
刚下载的工具包包含以下文件,还需要在trunk目录执行make编译
各文件用途
word2vec.c——训练cbow/skig-gram模型
distance.c——把词看成向量空间上的一个点,计算向量空间上点与点的距离
word-analogy.c——类比任务(vector(“法国”) - vector("巴黎) + vector(“英国”) --> vector(“伦敦”))
word2phrase.c——短语发现(比如New York这个短语,如果我们把它当做两个单词分开处理显然不合适)
compute-accuracy.c——定量的给出词向量在数据集上的类比word/phrase准确率
————————————————————————————————————————
demo-word.sh——训练词向量[word2vec、distance]
demo-analogy.sh——发现类比词[word2vec、word-analogy]
demo-classes.sh——kmeans词聚类[word2vec](输出词 词类别id)
demo-phrases.sh——发现短语词[word2phrase、word2vec、distance]
demo-word-accuracy.sh——计算word的类比词准确率[word2phrase、word2vec、compute-accuracy]
demo-phrase-accuracy.sh——计算phrase的类比词准确率[word2vec、compute-accuracy]
代码及数据集下载
训练
(1)先看看训练数据text8
一段英文文本~ 汉语语料,肯定是要先分词了~
(2)word2vec训练词向量
time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
./distance vectors.bin
(3)发现关系类比词
time ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -t hreads 20 -binary 1 -iter 15
./word-analogy vectors.bin
(4)词聚类
time ./word2vec -train text8 -output classes.txt -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -t hreads 20 -iter 15 -classes 500
sort classes.txt -k 2 -n > classes.sorted.txt
cat classes.sorted.txt | grep '135\|231\|444' | shuf | less
word 词类别id
(5)发现短语词
sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" < news.2012.en.shuffled | tr -c "A-Za-z'_ \n" " " > news.2012.en.sh uffled-norm0
time ./word2phrase -train news.2012.en.shuffled-norm0 -output news.2012.en.shuffled-norm0-phrase0 -threshold 20 0 -debug 2
time ./word2phrase -train news.2012.en.shuffled-norm0-phrase0 -output news.2012.en.shuffled-norm0-phrase1 -thre shold 100 -debug 2
tr A-Z a-z < news.2012.en.shuffled-norm0-phrase1 > news.2012.en.shuffled-norm1-phrase1
time ./word2vec -train news.2012.en.shuffled-norm1-phrase1 -output vectors-phrase.bin -cbow 1 -size 200 -window 10 -negative 25 -hs 0 -sample 1e-5 -threads 20 -binary 1 -iter 15
./distance vectors-phrase.bin