使用Python的Gensim库进行主题建模与文档相似度分析涉及几个步骤:文本预处理、建立词袋模型、训练主题模型、以及计算文档相似度。下面是一个详细的示例流程。
安装依赖
首先,确保你安装了Gensim和其他必要的库:
pip install gensim pandas nltk
文本预处理
使用Gensim进行主题建模和文档相似度分析之前,需要对文本进行预处理。以下是一个基本的文本预处理过程:
- 分词
- 去除停用词
- 词干化或词形还原
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2024)
import nltk
nltk.download('wordnet')
stemmer = SnowballStemmer("english")
def lemmatize_stemming(text):
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
result = []
for token in simple_preprocess(text):
if token not in STOPWORDS and len(token) > 3:
result.append(lemmatize_stemming(token))
return result
创建词袋模型
接下来,我们创建词袋模型。
from gensim import corpora
# 假设texts是预处理后的文本列表,每个文本是一个词列表
texts = [
preprocess("Human machine interface for lab abc computer applications"),
preprocess("A survey of user opinion of computer system response time"),
preprocess("The EPS user interface management system"),
preprocess("System and human system engineering testing of EPS"),
preprocess("Relation of user perceived response time to error measurement"),
preprocess("The generation of random binary unordered trees"),
preprocess("The intersection graph of paths in trees"),
preprocess("Graph minors IV Widths of trees and well quasi ordering"),
preprocess("Graph minors A survey")
]
# 创建字典
dictionary = corpora.Dictionary(texts)
# 创建语料库
corpus = [dictionary.doc2bow(text) for text in texts]
训练LDA主题模型
使用LDA(Latent Dirichlet Allocation)训练主题模型。
from gensim.models.ldamodel import LdaModel
# 训练LDA模型
lda_model = LdaModel(corpus, num_topics=3, id2word=dictionary, passes=15)
# 打印主题
for idx, topic in lda_model.print_topics(-1):
print('Topic: {} \nWords: {}'.format(idx, topic))
计算文档相似度
我们可以使用LDA模型生成的主题分布来计算文档之间的相似度。
from gensim.similarities import MatrixSimilarity
# 将语料库转换为主题分布
corpus_lda = lda_model[corpus]
# 创建相似度矩阵
index = MatrixSimilarity(corpus_lda)
# 假设new_doc是一个新的文档
new_doc = "Human computer interaction"
new_doc_bow = dictionary.doc2bow(preprocess(new_doc))
new_doc_lda = lda_model[new_doc_bow]
# 计算相似度
sims = index[new_doc_lda]
print(list(enumerate(sims)))
完整示例
综合以上步骤,以下是完整的代码示例:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from gensim.similarities import MatrixSimilarity
import nltk
nltk.download('wordnet')
np.random.seed(2024)
stemmer = SnowballStemmer("english")
def lemmatize_stemming(text):
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
result = []
for token in simple_preprocess(text):
if token not in STOPWORDS and len(token) > 3:
result.append(lemmatize_stemming(token))
return result
# 示例文本
texts = [
preprocess("Human machine interface for lab abc computer applications"),
preprocess("A survey of user opinion of computer system response time"),
preprocess("The EPS user interface management system"),
preprocess("System and human system engineering testing of EPS"),
preprocess("Relation of user perceived response time to error measurement"),
preprocess("The generation of random binary unordered trees"),
preprocess("The intersection graph of paths in trees"),
preprocess("Graph minors IV Widths of trees and well quasi ordering"),
preprocess("Graph minors A survey")
]
# 创建字典和语料库
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
# 训练LDA模型
lda_model = LdaModel(corpus, num_topics=3, id2word=dictionary, passes=15)
# 打印主题
for idx, topic in lda_model.print_topics(-1):
print('Topic: {} \nWords: {}'.format(idx, topic))
# 计算文档相似度
corpus_lda = lda_model[corpus]
index = MatrixSimilarity(corpus_lda)
# 新文档相似度
new_doc = "Human computer interaction"
new_doc_bow = dictionary.doc2bow(preprocess(new_doc))
new_doc_lda = lda_model[new_doc_bow]
sims = index[new_doc_lda]
print(list(enumerate(sims)))
这个示例展示了如何使用Gensim进行文本预处理、主题建模以及文档相似度分析。你可以根据具体需求调整预处理步骤、主题数和其他参数。