gensim-doc2vec实战

最新推荐文章于 2024-08-06 07:13:06 发布

banlucainiao

最新推荐文章于 2024-08-06 07:13:06 发布

阅读量4.6k

点赞数 2

分类专栏： Natural Language Processing

Natural Language Processing 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

gensim-doc2vec实战

Posted on 2016-06-01 | In NLP , Meaning Representation | | 4224

gensim的doc2vec找不到多少资料，根据官方api探索性的做了些尝试。本文介绍了利用gensim的doc2vec来训练模型，infer新文档向量，infer相似度等方法，有一些不成熟的地方，后期会继续改进。

导入模块

# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf8')
import gensim, logging
import os
import jieba

# logging information
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

读取文件

# get input file, text format
f = open('trainingdata.txt','r')
input = f.readlines()
count = len(input)
print count

文件预处理，分词等

# read file and separate words
alldocs=[] # for the sake of check, can be removed
count=0 # for the sake of check, can be removed
for line in input:
    line=line.strip('\n')
    seg_list = jieba.cut(line)
    output.write(' '.join(seg_list) + '\n')
    alldocs.append(gensim.models.doc2vec.TaggedDocument(seg_list,count)) # for the sake of check, can be removed
    count+=1 # for the sake of check, can be removed

模型选择

gensim Doc2Vec 提供了 DM 和 DBOW 两个模型。gensim 的说明文档建议多次训练数据集并调整学习速率或在每次训练中打乱输入信息的顺序以求获得最佳效果。

# PV-DM w/concatenation - window=5 (both sides) approximates paper's 10-word total window size
Doc2Vec(sentences,dm=1, dm_concat=1, size=100, window=2, hs=0, min_count=2, workers=cores)
# PV-DBOW  
Doc2Vec(sentences,dm=0, size=100, hs=0, min_count=2, workers=cores)
# PV-DM w/average
Doc2Vec(sentences,dm=1, dm_mean=1, size=100, window=2, hs=0, min_count=2, workers=cores)

训练并保存模型

# train and save the model
sentences= gensim.models.doc2vec.TaggedLineDocument('output.seq')
model = gensim.models.Doc2Vec(sentences,size=100, window=3)
model.train(sentences)
model.save('all_model.txt')

保存文档向量

# save vectors
out=open("all_vector.txt","wb")
for num in range(0,count):
    docvec =model.docvecs[num]
    out.write(docvec)
    #print num
    #print docvec
out.close()

检验计算训练文档中的文档相似度

# test, calculate the similarity
# 注意 docid 是从0开始计数的
# 计算与训练集中第一篇文档最相似的文档
sims = model.docvecs.most_similar(0)
print sims
# get similarity between doc1 and doc2 in the training data
sims = model.docvecs.similarity(1,2)
print sims

infer向量，比较相似度

下面的代码用于检验模型正确性，随机挑一篇trained dataset中的文档，用模型重新infer，再计算与trained dataset中文档相似度，如果模型良好，相似度第一位应该就是挑出的文档。

# check
#############################################################################
# A good check is to re-infer a vector for a document already in the model. #
# if the model is well-trained,                                             #
# the nearest doc should (usually) be the same document.                    #
#############################################################################

print 'examing'
doc_id = np.random.randint(model.docvecs.count)  # pick random doc; re-run cell for more examples
print('for doc %d...' % doc_id)
inferred_docvec = model.infer_vector(alldocs[doc_id].words)
print('%s:\n %s' % (model, model.docvecs.most_similar([inferred_docvec], topn=3)))