Python下LDA的基础用法

最新推荐文章于 2024-05-31 15:12:14 发布

zhourunan123

最新推荐文章于 2024-05-31 15:12:14 发布

阅读量8.6k

点赞数 2

分类专栏：机器学习文章标签： LDA

机器学习专栏收录该内容

19 篇文章 1 订阅

订阅专栏

"""
第一部分：载入数据

"""
import numpy as np
import lda
import lda.datasets
 
# document-term matrix
X = lda.datasets.load_reuters()
print("type(X): {}".format(type(X)))
print("shape: {}\n".format(X.shape))
print(X[:5, :5])
 
# the vocab
vocab = lda.datasets.load_reuters_vocab()
print("type(vocab): {}".format(type(vocab)))
print("len(vocab): {}\n".format(len(vocab)))
print(vocab[:5])
 
# titles for each story
titles = lda.datasets.load_reuters_titles()
print("type(titles): {}".format(type(titles)))
print("len(titles): {}\n".format(len(titles)))

print(titles[:5])


# # 载入LDA包数据集后，输出如下所示：
# X矩阵为395*4258，共395个文档，4258个单词，主要用于计算每行文档单词出现的次数（词频），然后输出X[5,5]矩阵；
# vocab为具体的单词，共4258个，它对应X的一行数据，其中输出的前5个单词，X中第0列对应church，其值为词频；
# titles为载入的文章标题，共395篇文章，同时输出0~4篇文章标题如下。

type(X): <class 'numpy.ndarray'>
shape: (395, 4258)

[[ 1  0  1  0  0]
 [ 7  0  2  0  0]
 [ 0  0  0  1 10]
 [ 6  0  1  0  0]
 [ 0  0  0  2 14]]
type(vocab): <class 'tuple'>
len(vocab): 4258

('church', 'pope', 'years', 'people', 'mother')
type(titles): <class 'tuple'>
len(titles): 395

('0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20', '1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21', "2 INDIA: Mother Teresa's condition said still unstable. CALCUTTA 1996-08-23", '3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25', '4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25')

# 下面是测试文档编号为0，单词编号为3117的数据，X[0,3117]：
# X[0,3117] is the number of times that word 3117 occurs in document 0
doc_id = 0
word_id = 3117
print("doc id: {} word id: {}".format(doc_id, word_id))
print("-- count: {}".format(X[doc_id, word_id]))
print("-- word : {}".format(vocab[word_id]))
print("-- doc  : {}".format(titles[doc_id]))

doc id: 0 word id: 3117
-- count: 2
-- word : heir-to-the-throne
-- doc  : 0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20

"""
第二部分：训练模型
其中设置20个主题，500次迭代

"""
model = lda.LDA(n_topics=20, n_iter=500, random_state=1)
model.fit(X)          # model.fit_transform(X) is also available

"""
第三部分：主题-单词（topic-word）分布
代码如下所示，计算'church', 'pope', 'years'这三个单词在各个主题(n_topocs=20，共20个主题)中的比重，同时输出前5个主题的比重和，其值均为1。

"""
topic_word = model.topic_word_
print("type(topic_word): {}".format(type(topic_word)))
print("shape: {}".format(topic_word.shape))
print(vocab[:3])
print(topic_word[:, :3])
 
for n in range(5):
    sum_pr = sum(topic_word[n,:])
    print("topic: {} sum: {}".format(n, sum_pr))

type(topic_word): <class 'numpy.ndarray'>
shape: (20, 4258)
('church', 'pope', 'years')
[[  2.72436509e-06   2.72436509e-06   2.72708945e-03]
 [  2.29518860e-02   1.08771556e-06   7.83263973e-03]
 [  3.97404221e-03   4.96135108e-06   2.98177200e-03]
 [  3.27374625e-03   2.72585033e-06   2.72585033e-06]
 [  8.26262882e-03   8.56893407e-02   1.61980569e-06]
 [  1.30107788e-02   2.95632328e-06   2.95632328e-06]
 [  2.80145003e-06   2.80145003e-06   2.80145003e-06]
 [  2.42858077e-02   4.66944966e-06   4.66944966e-06]
 [  6.84655429e-03   1.90129250e-06   6.84655429e-03]
 [  3.48361655e-06   3.48361655e-06   3.48361655e-06]
 [  2.98781661e-03   3.31611166e-06   3.31611166e-06]
 [  4.27062069e-06   4.27062069e-06   4.27062069e-06]
 [  1.50994982e-02   1.64107142e-06   1.64107142e-06]
 [  7.73480150e-07   7.73480150e-07   1.70946848e-02]
 [  2.82280146e-06   2.82280146e-06   2.82280146e-06]
 [  5.15309856e-06   5.15309856e-06   4.64294180e-03]
 [  3.41695768e-06   3.41695768e-06   3.41695768e-06]
 [  3.90980357e-02   1.70316633e-03   4.42279319e-03]
 [  2.39373034e-06   2.39373034e-06   2.39373034e-06]
 [  3.32493234e-06   3.32493234e-06   3.32493234e-06]]
topic: 0 sum: 1.0000000000000875
topic: 1 sum: 1.0000000000001148
topic: 2 sum: 0.9999999999998656
topic: 3 sum: 1.0000000000000042
topic: 4 sum: 1.0000000000000928

"""
第四部分：计算各主题Top-N个单词
下面这部分代码是计算每个主题中的前5个单词
"""
n = 5
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n+1):-1]
    print('*Topic {}\n- {}'.format(i, ' '.join(topic_words)))

*Topic 0
- government british minister west group
*Topic 1
- church first during people political
*Topic 2
- elvis king wright fans presley
*Topic 3
- yeltsin russian russia president kremlin
*Topic 4
- pope vatican paul surgery pontiff
*Topic 5
- family police miami versace cunanan
*Topic 6
- south simpson born york white
*Topic 7
- order church mother successor since
*Topic 8
- charles prince diana royal queen
*Topic 9
- film france french against actor
*Topic 10
- germany german war nazi christian
*Topic 11
- east prize peace timor quebec
*Topic 12
- n't told life people church
*Topic 13
- years world time year last
*Topic 14
- mother teresa heart charity calcutta
*Topic 15
- city salonika exhibition buddhist byzantine
*Topic 16
- music first people tour including
*Topic 17
- church catholic bernardin cardinal bishop
*Topic 18
- harriman clinton u.s churchill paris
*Topic 19
- century art million museum city

"""
第五部分：文档-主题（Document-Topic）分布
计算输入前10篇文章最可能的Topic

"""
doc_topic = model.doc_topic_
print("type(doc_topic): {}".format(type(doc_topic)))
print("shape: {}".format(doc_topic.shape))
for n in range(10):
    topic_most_pr = doc_topic[n].argmax()
    print("doc: {} topic: {}".format(n, topic_most_pr))

type(doc_topic): <class 'numpy.ndarray'>
shape: (395, 20)
doc: 0 topic: 8
doc: 1 topic: 1
doc: 2 topic: 14
doc: 3 topic: 8
doc: 4 topic: 14
doc: 5 topic: 14
doc: 6 topic: 14
doc: 7 topic: 14
doc: 8 topic: 14
doc: 9 topic: 8

"""
第六部分：两种作图分析
详见英文原文，
1.包括计算各个主题中单词权重分布的情况：

"""
import matplotlib.pyplot as plt
f, ax= plt.subplots(5, 1, figsize=(8, 6), sharex=True)
for i, k in enumerate([0, 5, 9, 14, 19]):
    ax[i].stem(topic_word[k,:], linefmt='b-',
               markerfmt='bo', basefmt='w-')
    ax[i].set_xlim(-50,4350)
    ax[i].set_ylim(0, 0.08)
    ax[i].set_ylabel("Prob")
    ax[i].set_title("topic {}".format(k))
 
ax[4].set_xlabel("word")
 
plt.tight_layout()
plt.show()
% matplotlib inline

# 第二种作图是计算文档具体分布在那个主题，代码如下所示：
import matplotlib.pyplot as plt
f, ax= plt.subplots(5, 1, figsize=(8, 6), sharex=True)
for i, k in enumerate([1, 3, 4, 8, 9]):
    ax[i].stem(doc_topic[k,:], linefmt='r-',
               markerfmt='ro', basefmt='w-')
    ax[i].set_xlim(-1, 21)
    ax[i].set_ylim(0, 1)
    ax[i].set_ylabel("Prob")
    ax[i].set_title("Document {}".format(k))
 
ax[4].set_xlabel("Topic")
 
plt.tight_layout()
plt.show()
% matplotlib inline

转自：点击打开链接

zhourunan123

关注

2
点赞
踩
37

收藏

觉得还不错? 一键收藏
0
评论
Python下LDA的基础用法

"""第一部分：载入数据"""import numpy as npimport ldaimport lda.datasets # document-term matrixX = lda.datasets.load_reuters()print("type(X): {}".format(type(X)))print("shape: {}\n".format(X.shape))...
复制链接

扫一扫

专栏目录