我使用Python的gensim库来做潜在的语义索引。我看了网站上的教程,效果很好。现在我试图对其进行一些修改;每次添加文档时,我都希望运行lsi模型。
这是我的代码:stoplist = set('for a of the and to in'.split())
num_factors=3
corpus = []
for i in range(len(urls)):
print "Importing", urls[i]
doc = getwords(urls[i])
cleandoc = [word for word in doc.lower().split() if word not in stoplist]
if i == 0:
dictionary = corpora.Dictionary([cleandoc])
else:
dictionary.addDocuments([cleandoc])
newVec = dictionary.doc2bow(cleandoc)
corpus.append(newVec)
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus_tfidf, numTopics=num_factors, id2word=dictionary)
corpus_lsi = lsi[corpus_tfidf]
geturl是我编写的函数,它以字符串形式返回网站的内容。同样,如果我在执行tfidf和