文本处理—LSA、 LDA

最新推荐文章于 2024-07-11 15:28:43 发布

Shingle_

最新推荐文章于 2024-07-11 15:28:43 发布

阅读量3k

点赞数 2

分类专栏：机器学习自然语言处理文章标签：主题模型 LSA PLSA LDA

本文链接：https://blog.csdn.net/Shingle_/article/details/81989090

版权

本文介绍了几种流行的主题模型算法，包括Tf-Idf、LSA（Latent Semantic Indexing）、PLSA（Probabilistic Latent Semantic Analysis）和LDA（Latent Dirichlet Allocation）。LSA利用SVD分解文档矩阵，而PLSA和LDA则基于概率模型。LDA被视为PLSA的贝叶斯版本，适用于发现文本数据中的主题结构。此外，还讨论了如何计算文档之间的相似度和主题模型的并行化方法。

摘要由CSDN通过智能技术生成

几个流行的VSM算法：

Term Frequency * Inverse Document Frequency, Tf-Idf

from gensim import corpora, models, similarities

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

Latent Semantic Indexing, LSI (or sometimes LSA)

LSA是SVD在文本数据上的变体。

Those days we know that most current LSI models are not based on mere local weights, but on models that incorporate local, global and document normalization weights. Others incorporate entropy weights and link weights. We also know that modern models ignore stop words and terms that occur once in document. Term stemming and sorting in alphabetical order is optional.

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=200)
corpus_lsi = lsi[corpus_tfidf]

举例：用LSI来为docments和query排序：

step 1. 计算文档向量矩阵A和查询矩阵q
step 2. 对矩阵A进行SVD分解
step 3. 用秩为N的矩阵来近似表示A，文档向量可以用V来表示。
Step 4. 同样的对查询q进行表示： $q=q^T U_k S_k^{-1}$
Step 5. 计算query-document的余弦相似度

doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow] # convert the query to LSI space

index = similarities.MatrixSimilarity(lsi[corpus]) # transform corpus to LSI space and index it

sims = index[vec_lsi] # perform a similarity query against the corpus
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples

sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims) # print sorted (document number, similarity score) 2-tuples

Probabilistic Latent Semantic Analysis, PLSA

Hoffman 于 1999 年提出的PLSA，Hoffman 认为一篇文档(Document) 可以由多个主题(Topic) 混合而成，而每个Topic 都是词汇上的概率分布，文章中的每个词都是由一个固定的 Topic 生成的。

文档和文档之间是独立可交换的，同一个文档内的词也是独立可交换的，这是一个 bag-of-words 模型。存在K个topic-word的分布，我们可以记为 $\varphi_1,⋯, \varphi_K$ 对于包含M篇文档的语料 $C=(d_1,d_2,⋯,d_M)$ 中的每篇文档 $d_m$ ，都会有一个特定的doc-topic分布 $\theta_m$ 。于是在 PLSA 这个模型中，第m篇文档 dm 中的每个词的生成概率为