【python-sklearn】中文文本处理LDA主题模型分析

数据集和资料:
链接:LDA主题模型
提取码:rlns

数据概览
在这里插入图片描述

代码:

import os
import pandas as pd
import re
import jieba
import jieba.posseg as psg


#######预处理

output_path = 'D:/lda/result'
file_path = 'D:/lda/data'
os.chdir(file_path)
data=pd.read_excel("data.xlsx")#content type
os.chdir(output_path)
dic_file = "D:/lda/stop_dic/dict.txt"
stop_file = "D:/lda/stop_dic/stopwords.txt"


def chinese_word_cut(mytext):
    jieba.load_userdict(dic_file)
    jieba.initialize()
    try:
        stopword_list = open(stop_file,encoding ='utf-8')
    except:
        stopword_list = []
        print("error in stop_file")
    stop_list = []
    flag_list = ['n','nz','vn']
    for line in stopword_list:
        line = re.sub(u'\n|\\r', '', line)
        stop_list.append(line)
    
    word_list = []
    #jieba分词
    seg_list = psg.cut(mytext)
    for seg_word in seg_list:
        #word = re.sub(u'[^\u4e00-\u9fa5]','',seg_word.word) 
        word = seg_word.word
        find = 0
        for stop_word in stop_list:
            if stop_word == word or len(word)<2:     #this word is stopword
                    find = 1
                    break
        if find == 0 and seg_word.flag in flag_list:
            word_list.append(word)      
    return (" ").join(word_list)


data["content_cutted"] = data.content.apply(chinese_word_cut)




#######LDA分析

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation



def print_top_words(model, feature_names, n_top_words):
    tword = []
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        topic_w = " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
        tword.append(topic_w)
        print(topic_w)
    return tword



n_features = 1000 #提取1000个特征词语
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                max_features=n_features,
                                stop_words='english',
                                max_df = 0.5,
                                min_df = 10)
tf = tf_vectorizer.fit_transform(data.content_cutted)


n_topics = 8
lda = LatentDirichletAllocation(n_components=n_topics, max_iter=50,
                                learning_method='batch',
                                learning_offset=50,
#                                 doc_topic_prior=0.1,
#                                 topic_word_prior=0.01,
                               random_state=0)
lda.fit(tf)

###########每个主题对应词语
n_top_words = 25
tf_feature_names = tf_vectorizer.get_feature_names()
topic_word = print_top_words(lda, tf_feature_names, n_top_words)

###########输出每篇文章对应主题
import numpy as np
topics=lda.transform(tf)
topic = []
for t in topics:
    topic.append(list(t).index(np.max(t)))
data['topic']=topic
data.to_excel("data_topic.xlsx",index=False)
topics[0]#0 1 2 



###########可视化

import pyLDAvis
import pyLDAvis.sklearn

pyLDAvis.enable_notebook()
pic = pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)
pyLDAvis.display(pic)
pyLDAvis.save_html(pic, 'lda_pass'+str(n_topics)+'.html')
#去工作路径下找保存好的html文件
pyLDAvis.show(pic)


###########困惑度
import matplotlib.pyplot as plt

plexs = []
n_max_topics = 16
for i in range(1,n_max_topics):
    print(i)
    lda = LatentDirichletAllocation(n_components=i, max_iter=50,
                                    learning_method='batch',
                                    learning_offset=50,random_state=0)
    lda.fit(tf)
    plexs.append(lda.perplexity(tf))


n_t=15#区间最右侧的值。注意:不能大于n_max_topics
x=list(range(1,n_t))
plt.plot(x,plexs[1:n_t])
plt.xlabel("number of topics")
plt.ylabel("perplexity")
plt.show()


可视化结果:


请添加图片描述
如上图的困惑度曲线可知,当取值为8时效果较好

请添加图片描述

处理结果
在这里插入图片描述
参考:up主—>五连单排一班

  • 11
    点赞
  • 148
    收藏
    觉得还不错? 一键收藏
  • 10
    评论
首先需要明确的是,LDA(Latent Dirichlet Allocation)是一种主题模型,不是一种情感分析方法。但是可以在LDA模型的基础上进行情感分析。下面是一个基于LDA中文文本情感分析代码示例: 1. 数据预处理 首先需要对中文文本进行分词、去停用词等预处理操作。这里使用jieba分词库和stopwords中文停用词库。 ```python import jieba import codecs # 加载中文停用词库 with codecs.open('stopwords.txt','r',encoding='utf8') as f: stopwords = [line.strip() for line in f] # 对文本进行分词和去停用词处理 def cut_stop_words(text): words = jieba.cut(text) return [word for word in words if word not in stopwords] ``` 2. LDA模型训练 使用gensim库进行LDA模型训练。 ```python import gensim from gensim import corpora # 加载预处理后的文本 with codecs.open('data.txt','r',encoding='utf8') as f: texts = [cut_stop_words(line.strip()) for line in f] # 构建词典和语料库 dictionary = corpora.Dictionary(texts) corpus = [dictionary.doc2bow(text) for text in texts] # 训练LDA模型 lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=10) ``` 3. 情感分析 基于LDA模型主题分布,可以对文本进行情感分析。这里使用snownlp库进行情感分析。 ```python import snownlp # 对每个文本进行情感分析 def sentiment_analysis(text): topic_dist = lda_model.get_document_topics(dictionary.doc2bow(cut_stop_words(text)), minimum_probability=0.0) positive_prob = 0.0 negative_prob = 0.0 for topic_id, prob in topic_dist: topic_words = [word for word, _ in lda_model.show_topic(topic_id)] topic_text = ' '.join(topic_words) sentiment = snownlp.SnowNLP(topic_text).sentiments if sentiment > 0.5: positive_prob += prob else: negative_prob += prob if positive_prob > negative_prob: return 'positive' elif positive_prob < negative_prob: return 'negative' else: return 'neutral' ``` 以上就是一个基于LDA中文文本情感分析代码示例。需要注意的是,LDA模型训练需要较大的文本语料库,并且情感分析的准确度也受到LDA模型的影响。
评论 10
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值