Python LDA主题模型实战

import numpy as np
import lda 
12
X = lda.datasets.load_reuters()
X.shape
12
(395, 4258)
1
  • 这里说明X是395行4258列的数据,说明有395个训练样本
vocab = lda.datasets.load_reuters_vocab()
len(vocab)# 这里是所有的词汇
12
4258
1
  • 这里说明一个有4258个不重复的词语
1
  • 选取前十个训练数据看一看
title = lda.datasets.load_reuters_titles()
title[:10]
12
('0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20',
 '1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21',
 "2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23",
 '3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25',
 '4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25',
 "5 INDIA: Mother Teresa's condition unchanged, thousands pray. CALCUTTA 1996-08-25",
 '6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26',
 "7 INDIA: Mother Teresa's condition improves, many pray. CALCUTTA, India 1996-08-25",
 '8 INDIA: Mother Teresa improves, nuns pray for "miracle". CALCUTTA 1996-08-26',
 '9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26')
12345678910
1
  • 开始训练,这顶主题数目是20,迭代次数是1500次
model = lda.LDA(n_topics = 20, n_iter = 1500, random_state = 1) #初始化模型, n_iter   迭代次数
model.fit(X)

控制台输出:


INFO:lda:n_documents: 395
INFO:lda:vocab_size: 4258
INFO:lda:n_words: 84010
INFO:lda:n_topics: 20
INFO:lda:n_iter: 1500
INFO:lda:<0> log likelihood: -1051748
INFO:lda:<10> log likelihood: -719800
INFO:lda:<20> log likelihood: -699115
INFO:lda:<30> log likelihood: -689370
INFO:lda:<40> log likelihood: -684918
...
INFO:lda:<1450> log likelihood: -654884
INFO:lda:<1460> log likelihood: -655493
INFO:lda:<1470> log likelihood: -655415
INFO:lda:<1480> log likelihood: -655192
INFO:lda:<1490> log likelihood: -655728
INFO:lda:<1499> log likelihood: -655858





<lda.lda.LDA at 0x7effa0508550>
1234567891011121314151617181920212223
  • 查看20个主题中的词分布
topic_word = model.topic_word_
print(topic_word.shape)
topic_word

查看输出:

(20, 4258)





array([[3.62505347e-06, 3.62505347e-06, 3.62505347e-06, ...,
        3.62505347e-06, 3.62505347e-06, 3.62505347e-06],
       [1.87498968e-02, 1.17916463e-06, 1.17916463e-06, ...,
        1.17916463e-06, 1.17916463e-06, 1.17916463e-06],
       [1.52206232e-03, 5.05668544e-06, 4.05040504e-03, ...,
        5.05668544e-06, 5.05668544e-06, 5.05668544e-06],
       ...,
       [4.17266923e-02, 3.93610908e-06, 9.05698699e-03, ...,
        3.93610908e-06, 3.93610908e-06, 3.93610908e-06],
       [2.37609835e-06, 2.37609835e-06, 2.37609835e-06, ...,
        2.37609835e-06, 2.37609835e-06, 2.37609835e-06],
       [3.46310752e-06, 3.46310752e-06, 3.46310752e-06, ...,
        3.46310752e-06, 3.46310752e-06, 3.46310752e-06]])
  • 得到每个主题的前8个词
for i, topic_dist in enumerate(topic_word):
    print(np.array(vocab)[np.argsort(topic_dist)][:-9:-1])
12
['british' 'churchill' 'sale' 'million' 'major' 'letters' 'west' 'britain']
['church' 'government' 'political' 'country' 'state' 'people' 'party'
 'against']
['elvis' 'king' 'fans' 'presley' 'life' 'concert' 'young' 'death']
['yeltsin' 'russian' 'russia' 'president' 'kremlin' 'moscow' 'michael'
 'operation']
['pope' 'vatican' 'paul' 'john' 'surgery' 'hospital' 'pontiff' 'rome']
['family' 'funeral' 'police' 'miami' 'versace' 'cunanan' 'city' 'service']
['simpson' 'former' 'years' 'court' 'president' 'wife' 'south' 'church']
['order' 'mother' 'successor' 'election' 'nuns' 'church' 'nirmala' 'head']
['charles' 'prince' 'diana' 'royal' 'king' 'queen' 'parker' 'bowles']
['film' 'french' 'france' 'against' 'bardot' 'paris' 'poster' 'animal']
['germany' 'german' 'war' 'nazi' 'letter' 'christian' 'book' 'jews']
['east' 'peace' 'prize' 'award' 'timor' 'quebec' 'belo' 'leader']
["n't" 'life' 'show' 'told' 'very' 'love' 'television' 'father']
['years' 'year' 'time' 'last' 'church' 'world' 'people' 'say']
['mother' 'teresa' 'heart' 'calcutta' 'charity' 'nun' 'hospital'
 'missionaries']
['city' 'salonika' 'capital' 'buddhist' 'cultural' 'vietnam' 'byzantine'
 'show']
['music' 'tour' 'opera' 'singer' 'israel' 'people' 'film' 'israeli']
['church' 'catholic' 'bernardin' 'cardinal' 'bishop' 'wright' 'death'
 'cancer']
['harriman' 'clinton' 'u.s' 'ambassador' 'paris' 'president' 'churchill'
 'france']
['city' 'museum' 'art' 'exhibition' 'century' 'million' 'churches' 'set']
1234567891011121314151617181920212223242526
- 得到每句话在每个主题的分布,并得到每句话的最大主题
1
doc_topic = model.doc_topic_
print(doc_topic.shape)  # 主题分布式395行,20列的矩阵,其中每一行对应一个训练样本在20个主题上的分布
print("第一个样本的主题分布是",doc_topic[0]) # 打印一下第一个样本的主题分布
print("第一个样本的最终主题是",doc_topic[0].argmax())
1234
(395, 20)
第一个样本的主题分布是 [4.34782609e-04 3.52173913e-02 4.34782609e-04 9.13043478e-03
 4.78260870e-03 4.34782609e-04 9.13043478e-03 3.08695652e-02
 5.04782609e-01 4.78260870e-03 4.34782609e-04 4.34782609e-04
 3.08695652e-02 2.17826087e-01 4.34782609e-04 4.34782609e-04
 4.34782609e-04 3.95652174e-02 4.34782609e-04 1.09130435e-01]
第一个样本的最终主题是 8

转载至:https://blog.csdn.net/jiangzhenkang/article/details/84335646

PythonLDA情感模型是一个基于Python语言开发的情感分析模型。该模型通过使用LDA(Latent Dirichlet Allocation)算法,将文本数据进行主题建模和情感分析。 在这个模型中,数据预处理是一个重要的步骤。首先,需要对评论数据进行分词处理,将评论文本按照单词进行切分。然后,可以利用情感词典对每个单词进行情感值的计算,从而获得每个评论文本的情感倾向。 接下来,利用LDA算法对评论数据进行主题建模。LDA算法是一种无监督学习方法,用于发现隐藏在文本数据中的主题结构。通过LDA,可以将评论文本划分为不同的主题,每个主题包含一组相关的单词。 最后,将情感分析和主题建模结合起来,可以得到一个综合的情感模型。这个模型可以用于分析评论数据中的情感倾向,并将其归纳为不同的主题。通过这个模型,可以更加全面地理解用户对于某个产品或者事件的态度和观点。 总的来说,PythonLDA情感模型是一个综合运用了LDA算法和情感分析的模型,用于分析评论文本的情感倾向和主题结构。这个模型可以帮助我们深入理解用户的观点,并从中提取有价值的信息。<span class="em">1</span> #### 引用[.reference_title] - *1* [【项目实战Python实现基于LDA主题模型进行电商产品评论数据情感分析](https://download.csdn.net/download/weixin_42163563/80675031)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 100%"] [ .reference_list ]
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值