几个流行的VSM算法:
Term Frequency * Inverse Document Frequency, Tf-Idf
from gensim import corpora, models, similarities
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
Latent Semantic Indexing, LSI (or sometimes LSA)
LSA是SVD在文本数据上的变体。
Those days we know that most current LSI models are not based on mere local weights, but on models that incorporate local, global and document normalization weights. Others incorporate entropy weights and link weights. We also know that modern models ignore stop words and terms that occur once in document. Term stemming and sorting in alphabetical order is optional.
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=200)
corpus_lsi = lsi[corpus_tfidf]
举例:用LSI来为docments和query排序:
- step 1. 计算文档向量矩阵A和查询矩阵q
- step 2. 对矩阵A进行SVD分解
- step 3. 用秩为N的矩阵来近似表示A,文档向量可以用V来表示。
- Step 4. 同样的对查询q进行表示: q=qTUkS−1k q = q T U k S k − 1
- Step 5. 计算query-document的余弦相似度
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow] # convert the query to LSI space
index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it
sims = index[vec_lsi] # perform a similarity query against the corpus
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims) # print sorted (document number, similarity score) 2-tuples
Probabilistic Latent Semantic Analysis, PLSA
Hoffman 于 1999 年提出的PLSA,Hoffman 认为一篇文档(Document) 可以由多个主题(Topic) 混合而成, 而每个Topic 都是词汇上的概率分布,文章中的每个词都是由一个固定的 Topic 生成的。
文档和文档之间是独立可交换的,同一个文档内的词也是独立可交换的,这是一个 bag-of-words 模型。 存在K个topic-word的分布,我们可以记为 φ1,⋯,φK φ 1 , ⋯ , φ K 对于包含M篇文档的语料 C=(d1,d2,⋯,dM) C = ( d 1 , d 2 , ⋯ , d M ) 中的每篇文档 dm d m ,都会有一个特定的doc-topic分布 θm θ m 。于是在 PLSA 这个模型中,第m篇文档 dm 中的每个词的生成概率为