github地址
https://github.com/stanfordnlp/GloVe/tree/master/src
gloVe是和word2vector功能相似的模型,把句子的信息和全局的信息结合,目的是在语义和语句上都获得更好的表达效果,下面我们仅从使用的角度上看gloVe模型
模型目标:进行词的向量化表示,使得向量之间尽可能多地蕴含语义和语法的信息。
- 输入:语料库
- 输出:词向量
- 方法概述:首先基于语料库构建词的共现矩阵,然后基于共现矩阵和GloVe模型学习词向量。
-
To train your own GloVe vectors, first you'll need to prepare your corpus as a single text file with all words separated by a single space. If your corpus has multiple documents, simply concatenate documents together with a single space. If your documents are particularly short, it's possible that padding the gap between documents with e.g. 5 "dummy" words will produce better vectors. Once you create your corpus, you can train GloVe vectors using the following 4 tools. An example is included in
demo.sh
, which you can modify as necessary.This four main tools in this package are:
1) vocab_count
This tool requires an input corpus that should already consist of whitespace-separated tokens. Use something like the Stanford Tokenizer first on raw text. From the corpus, it constructs unigram counts from a corpus, and optionally thresholds the resulting vocabulary based on total vocabulary size or minimum frequency count.
2) cooccur
Constructs word-word cooccurrence statistics from a corpus. The user should supply a vocabulary file, as produced by
vocab_count
, and may specify a variety of parameters, as described by running./build/cooccur
.3) shuffle
Shuffles the binary file of cooccurrence statistics produced by
cooccur
. For large files, the file is automatically split into chunks, each of which is shuffled and stored on disk before being merged and shuffled together. The user may specify a number of parameters, as described by running./build/shuffle
.4) glove
Train the GloVe model on the specified cooccurrence data, which typically will be the output of the
shuffle
tool. The user should supply a vocabulary file, as given byvocab_count
, and may specify a number of other parameters, which are described by running./build/glove
.
-
下载源代码 需要修改
-
CFLAGS = -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result为CFLAGS = -lm -pthread -O2 -march=native -funroll-loops -Wno-unused-result
-
具体demo.sh执行例子
- ./vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
- ./cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
- ./shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
- ./glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
- octave -nodisplay -nodesktop -nojvm -nosplash < ./eval/read_and_evaluate.m 1>&2
-
语料下载
参考资料地址:https://nlp.stanford.edu/pubs/glove.pdf