我正在使用推文准备Doc2Vec模型 . 每条推文的单词数组都被视为一个单独的文档,标记为“SENT_1”,SENT_2“等 .
taggeddocs = []
for index,i in enumerate(cleaned_tweets):
if len(i) > 2: # Non empty tweets
sentence = TaggedDocument(words=gensim.utils.to_unicode(i).split(), tags=[u'SENT_{:d}'.format(index)])
taggeddocs.append(sentence)
# build the model
model = gensim.models.Doc2Vec(taggeddocs, dm=0, alpha=0.025, size=20, min_alpha=0.025, min_count=0)
for epoch in range(200):
if epoch % 20 == 0:
print('Now training epoch %s' % epoch)
model.train(taggeddocs)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
我希望找到类似于给定推文的推文,说“SENT_2” . 怎么样?
我得到类似推文的标签:
sims = model.docvecs.most_similar('SENT_2')
for label, score in sims:
print(label)
它打印为:
SENT_4372
SENT_1143
SENT_4024
SENT_4759
SENT_3497
SENT_5749
SENT_3189
SENT_1581
SENT_5127
SENT_3798
但鉴于标签,我如何获得原始推文词/句子?例如 . 什么是推文,比如“SENT_3497” . 我可以查询到Doc2Vec模型吗?