在Python中使用LDA处理文本

最新推荐文章于 2024-05-22 09:20:38 发布

baidu-liuming

最新推荐文章于 2024-05-22 09:20:38 发布

阅读量2.1k

点赞数

分类专栏：机器学习机器学习

机器学习同时被 2 个专栏收录

40 篇文章 3 订阅

订阅专栏

机器学习

30 篇文章 8 订阅

订阅专栏

说明：

原文：http://chrisstrelioff.ws/sandbox/2014/11/13/getting_started_with_latent_dirichlet_allocation_in_python.html

本文包含了上文的主要内容。

关于LDA：LDA漫游指南

使用的python库lda来自：https://github.com/ariddell/lda 。

gensim库也含有lda相关函数。

安装

$ pip install lda --user

示例

from __future__ import division, print_function

import numpy as np
import lda
import lda.datasets

# document-term matrix
X = lda.datasets.load_reuters()
print("type(X): {}".format(type(X)))
print("shape: {}\n".format(X.shape))
print(X[:5, :5])

'''输出：

type(X): <type 'numpy.ndarray'>
shape: (395L, 4258L)

[[ 1  0  1  0  0]
 [ 7  0  2  0  0]
 [ 0  0  0  1 10]
 [ 6  0  1  0  0]
 [ 0  0  0  2 14]]
'''

X为395*4298的矩阵，意味着395个文本，共4258个单词。值代表出现次数。

看一下是哪些单词：

# the vocab
vocab = lda.datasets.load_reuters_vocab()
print("type(vocab): {}".format(type(vocab)))
print("len(vocab): {}\n".format(len(vocab)))
print(vocab[:6])

'''输出
type(vocab): <type 'tuple'>
len(vocab): 4258

('church', 'pope', 'years', 'people', 'mother', 'last')
'''

X中第0列对应的单词是church，第1列对应的单词是pope

下面看一下文章标题：

# titles for each story
titles = lda.datasets.load_reuters_titles()
print("type(titles): {}".format(type(titles)))
print("len(titles): {}\n".format(len(titles)))
print(titles[:2])  # 前两篇文章的标题

'''输出
type(titles): <type 'tuple'>
len(titles): 395

('0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20', '1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21')
'''

训练数据，指定20个主题，500次迭代：

model = lda.LDA(n_topics=20, n_iter=500, random_state=1)
model.fit(X)

主题-单词（topic-word）分布：

topic_word = model.topic_word_
print("type(topic_word): {}".format(type(topic_word)))
print("shape: {}".format(topic_word.shape))

'''输出:
type(topic_word): <type 'numpy.ndarray'>
shape: (20L, 4258L)
'''

topic_word中一行对应一个topic，一行之和为1。看一看'church', 'pope', 'years'这三个单词在各个主题中的比重：

print(topic_word[:, :3])

'''输出
[[  2.72436509e-06   2.72436509e-06   2.72708945e-03]
 [  2.29518860e-02   1.08771556e-06   7.83263973e-03]
 [  3.97404221e-03   4.96135108e-06   2.98177200e-03]
 [  3.27374625e-03   2.72585033e-06   2.72585033e-06]
 [  8.26262882e-03   8.56893407e-02   1.61980569e-06]
 [  1.30107788e-02   2.95632328e-06   2.95632328e-06]
 [  2.80145003e-06   2.80145003e-06   2.80145003e-06]
 [  2.42858077e-02   4.66944966e-06   4.66944966e-06]
 [  6.84655429e-03   1.90129250e-06   6.84655429e-03]
 [  3.48361655e-06   3.48361655e-06   3.48361655e-06]
 [  2.98781661e-03   3.31611166e-06   3.31611166e-06]
 [  4.27062069e-06   4.27062069e-06   4.27062069e-06]
 [  1.50994982e-02   1.64107142e-06   1.64107142e-06]
 [  7.73480150e-07   7.73480150e-07   1.70946848e-02]
 [  2.82280146e-06   2.82280146e-06   2.82280146e-06]
 [  5.15309856e-06   5.15309856e-06   4.64294180e-03]
 [  3.41695768e-06   3.41695768e-06   3.41695768e-06]
 [  3.90980357e-02   1.70316633e-03   4.42279319e-03]
 [  2.39373034e-06   2.39373034e-06   2.39373034e-06]
 [  3.32493234e-06   3.32493234e-06   3.32493234e-06]]
'''

获取每个topic下权重最高的5个单词：

n = 5
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n+1):-1]
    print('*Topic {}\n- {}'.format(i, ' '.join(topic_words)))

'''输出：
*Topic 0
- government british minister west group
*Topic 1
- church first during people political
*Topic 2
- elvis king wright fans presley
*Topic 3
- yeltsin russian russia president kremlin
*Topic 4
- pope vatican paul surgery pontiff
*Topic 5
- family police miami versace cunanan
*Topic 6
- south simpson born york white
*Topic 7
- order church mother successor since
*Topic 8
- charles prince diana royal queen
*Topic 9
- film france french against actor
*Topic 10
- germany german war nazi christian
*Topic 11
- east prize peace timor quebec
*Topic 12
- n't told life people church
*Topic 13
- years world time year last
*Topic 14
- mother teresa heart charity calcutta
*Topic 15
- city salonika exhibition buddhist byzantine
*Topic 16
- music first people tour including
*Topic 17
- church catholic bernardin cardinal bishop
*Topic 18
- harriman clinton u.s churchill paris
*Topic 19
- century art million museum city
'''

文档-主题（Document-Topic）分布：

doc_topic = model.doc_topic_
print("type(doc_topic): {}".format(type(doc_topic)))
print("shape: {}".format(doc_topic.shape))

'''输出：
type(doc_topic): <type 'numpy.ndarray'>
shape: (395, 20)
'''

一篇文章对应一行，每行的和为1。

输入前10篇文章最可能的Topic：

for n in range(10):
    topic_most_pr = doc_topic[n].argmax()
    print("doc: {} topic: {}".format(n, topic_most_pr))

'''输出：
doc: 0 topic: 8
doc: 1 topic: 1
doc: 2 topic: 14
doc: 3 topic: 8
doc: 4 topic: 14
doc: 5 topic: 14
doc: 6 topic: 14
doc: 7 topic: 14
doc: 8 topic: 14
doc: 9 topic: 8
'''

关于数据集替换

下载包以后,把datasets.py里面的load_reuters()里面的reuters.ldac,load_reuters_vocab()里面的reuters.tokens,load_reuters_titles()里面的reuters.titles替换成自己的数据集就行了.数据集格式按照包里的生成就行.

baidu-liuming

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
2
评论
在Python中使用LDA处理文本

说明：原文：http://chrisstrelioff.ws/sandbox/2014/11/13/getting_started_with_latent_dirichlet_allocation_in_python.html本文包含了上文的主要内容。关于LDA：LDA漫游指南使用的python库lda来自：https://github.com/ariddell/lda 。gensim库也含有ld...
复制链接

扫一扫

专栏目录