参考链接:
- https://github.com/maciejkula/glove-python
- https://blog.csdn.net/sinat_26917383/article/details/83029140
- https://blog.csdn.net/beilizhang/article/details/108175380
说明
本教程需调用glove_python这个包,而不采用Stanford的GloVe,因为前者是python的比较亲民。
天坑
glove_python只支持到Python3.5,更高的版本是不行的。
如果你的电脑没有Python3.5,可以通过anaconda新建一个环境。参考这里
然后执行
pip install libpython
pip install glove_python
必要时,请使用清华镜像源:
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple glove-python
处理数据
本教程使用PTB(Penn Tree Bank)小型语料库的训练集,数据集每一行为一句话。处理方式如下:
with open('data/ptb/ptb.train.txt', 'r') as f:
lines = f.readlines()
raw_dataset = [st.split() for st in lines]
构建共现矩阵
这里可以设置你的窗口大小。
# construct a cooccurrence matrix from a corpus
corpus_model = Corpus()
corpus_model.fit(raw_dataset, window=10)
构建模型并训练
构建模型时可以选择词向量的维度(no_components)和学习率(learning_rate)
训练时可以选择训练的轮数(epochs)、线程数(no_threads)
glove = Glove(no_components=100, learning_rate=0.05)
glove.fit(corpus_model.matrix, epochs=10,
no_threads=1, verbose=True)
求相似词
注意要先添加词典
# Supply a word-id dictionary to allow similarity queries.
glove.add_dictionary(corpus_model.dictionary)
print(glove.most_similar('chip', number=10))
访问任意词的词向量
print(glove.word_vectors[glove.dictionary['chip']])
模型的保存与加载
1)GloVe模型的保存与加载
glove.save('glove.model')
glove = Glove.load('glove.model')
2)Corpus的保存与加载
corpus_model.save('corpus.model')
corpus_model = Corpus.load('corpus.model')
代码汇总
#coding:utf-8
from glove import Glove
from glove import Corpus
with open('data/ptb/ptb.train.txt', 'r') as f:
lines = f.readlines()
raw_dataset = [st.split() for st in lines]
# construct a cooccurrence matrix from a corpus
corpus_model = Corpus()
corpus_model.fit(raw_dataset, window=10)
#corpus_model.save('corpus.model')
print('Dict size: %s' % len(corpus_model.dictionary))
print('Collocations: %s' % corpus_model.matrix.nnz)
glove = Glove(no_components=100, learning_rate=0.05)
glove.fit(corpus_model.matrix, epochs=10,
no_threads=1, verbose=True)
# Supply a word-id dictionary to allow similarity queries.
glove.add_dictionary(corpus_model.dictionary)
print(glove.most_similar('chip', number=10))
print(glove.word_vectors[glove.dictionary['chip']])
# save and load
glove.save('glove.model')
glove = Glove.load('glove.model')
corpus_model.save('corpus.model')
corpus_model = Corpus.load('corpus.model')