gensim的doc2vec找不到多少资料,根据官方api探索性的做了些尝试。本文介绍了利用gensim的doc2vec来训练模型,infer新文档向量,infer相似度等方法,有一些不成熟的地方,后期会继续改进。
导入模块
# -*- coding: utf-8 -*- import sys reload(sys) sys.setdefaultencoding('utf8') import gensim, logging import os import jieba # logging information logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
读取文件
# get input file, text format f = open('trainingdata.txt','r') input = f.readlines() count = len(input) print count
文件预处理,分词等
# read file and separate words alldocs=[] # for the sake of check, can be removed count=0 # for the sake of check, can be removed for line in input: line=line.strip('\n') seg_list = jieba.cut(line) output.write(' '.join(seg_list) + '\n') alldocs.append(gensim.models.doc2vec.TaggedDocument(seg_list,count)) # for the sake of check, can be removed count+=1 # for the sake of check, can be removed
模型选择
gensim Doc2Vec 提供了 DM 和 DBOW 两个模型。gensim 的说明文档建议多次训练数据集并调整学习速率或在每次训练中打乱输入信息的顺序以求获得最佳效果。
# PV-DM w/concatenation - window=5 (both sides) approximates paper's 10-word total window size Doc2Vec(sentences,dm=1, dm_concat=1, size=100, window=2, hs=0, min_count=2, workers=cores) # PV-DBOW Doc2Vec(sentences,dm=0, size=100, hs=0, min_count=2, workers=cores) # PV-DM w/average Doc2Vec(sentences,dm=1, dm_mean=1, size=100, window=2, hs=0, min_count=2, workers=cores)
训练并保存模型
# train and save the model sentences= gensim.models.doc2vec.TaggedLineDocument('output.seq') model = gensim.models.Doc2Vec(sentences,size=100, window=3) model.train(sentences) model.save('all_model.txt')
保存文档向量
# save vectors out=open("all_vector.txt","wb") for num in range(0,count): docvec =model.docvecs[num] out.write(docvec) #print num #print docvec out.close()
检验 计算训练文档中的文档相似度
# test, calculate the similarity # 注意 docid 是从0开始计数的 # 计算与训练集中第一篇文档最相似的文档 sims = model.docvecs.most_similar(0) print sims # get similarity between doc1 and doc2 in the training data sims = model.docvecs.similarity(1,2) print sims
infer向量,比较相似度
下面的代码用于检验模型正确性,随机挑一篇trained dataset中的文档,用模型重新infer,再计算与trained dataset中文档相似度,如果模型良好,相似度第一位应该就是挑出的文档。
# check ############################################################################# # A good check is to re-infer a vector for a document already in the model. # # if the model is well-trained, # # the nearest doc should (usually) be the same document. # ############################################################################# print 'examing' doc_id = np.random.randint(model.docvecs.count) # pick random doc; re-run cell for more examples print('for doc %d...' % doc_id) inferred_docvec = model.infer_vector(alldocs[doc_id].words) print('%s:\n %s' % (model, model.docvecs.most_similar([inferred_docvec], topn=3)))