gensim中实践LDA

最新推荐文章于 2024-05-17 08:57:45 发布

面向未来的历史

最新推荐文章于 2024-05-17 08:57:45 发布

阅读量1.1w

点赞数 1

分类专栏： NLP Macine Learning 文章标签： LDA 主题模型自然语言处理

本文链接：https://blog.csdn.net/a1368783069/article/details/52088302

版权

NLP 同时被 2 个专栏收录

5 篇文章 1 订阅

订阅专栏

Macine Learning

4 篇文章 0 订阅

订阅专栏

关于LDA主题模型的原理介绍，网上资料很多。
个人推荐http://download.csdn.net/detail/a1368783069/9592238
很详细的介绍的数学原理，以及推倒过程。

所以，后面主要通过一个实例，陈述LDA在python的gensim中的具体实现过程。

由于自己没有合适的中文语料，所以就用sklearn中自带的语料。
中英文本的实现过程是一致的，除了中文需要分词外。

语料获取

获取sklearn中自带的数据语料

from sklearn import datasets
news_dataset=datasets.fetch_20newsgroups(subset="all",remove=("headers","footers","quotes"))
documents=news_dataset.data
print(documents[0])

#"\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n"

文本处理

将文本处理为词向量
1,（英文文本）将文本进行处理，大写全都改小写，去掉无关符号，初步获得词向量。
（中文文本）将文本分词，去掉无关符号，构建词向量。
2, 从获取的词向量中去掉停止词。
3，构建语料词典。

from gensim.parsing.preprocessing import STOPWORDS
stopwords = STOPWORDS

def tokenize(text):
    text = text.lower()
    words = re.sub("\W"," ",text).split()
    words = [w for w in words if w not in stopwords]
    return words

processed_docs = [tokenize(doc) for doc in documents]
#obtain: (word_id:word)
word_count_dict = gensim.corpora.Dictionary(processed_docs)

word_count_dict.filter_extremes(no_below=20, no_above=0.1) 
# word must appear >10 times, and no more than 20% documents

将文档（词表）转为词袋（BOW）格式(token_id, token_count) 的二元组。

bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs]

LDA分析

lda_model = gensim.models.LdaModel(bag_of_words_corpus, num_topics=10, id2word=word_count_dict, passes=5)

或者使用并行LDA加快处理速度。
使用语法：

 gensim.models.ldamulticore.LdaMulticore(corpus=None, num_topics=100, id2word=None, workers=None, chunksize=2000, passes=1, batch=False, alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50, gamma_threshold=0.001, random_state=None)

输出主题

lda_model.print_topics(10)

输出结果：

[(0, '0.006*government + 0.005*going + 0.004*law + 0.004*q + 0.004*work + 0.003*public + 0.003*gun + 0.003*ll + 0.003*mr + 0.003*said'), 
(1, '0.018*god + 0.006*believe + 0.005*jesus + 0.005*said + 0.004*bible + 0.004*church + 0.004*christian + 0.004*life + 0.004*things + 0.004*christ'), 
(2, '0.687*ax + 0.052*max + 0.045*q + 0.023*p + 0.018*r + 0.016*g + 0.015*7 + 0.011*n + 0.010*145 + 0.008*pl'), 
(3, '0.014*x + 0.014*car + 0.007*bike + 0.005*cars + 0.004*ground + 0.004*engine + 0.004*miles + 0.004*road + 0.004*light + 0.003*power'), 
(4, '0.021*edu + 0.010*com + 0.010*mail + 0.009*information + 0.009*space + 0.007*data + 0.006*available + 0.006*1993 + 0.006*list + 0.006*send'), 
(5, '0.011*game + 0.010*team + 0.009*year + 0.007*games + 0.005*season + 0.005*hockey + 0.005*play + 0.004*league + 0.004*players + 0.004*years'), 
(6, '0.010*dos + 0.009*drive + 0.009*card + 0.007*windows + 0.007*disk + 0.007*thanks + 0.005*5 + 0.005*scsi + 0.005*hard + 0.005*problem'), 
(7, '0.076*0 + 0.033*4 + 0.029*5 + 0.028*6 + 0.026*7 + 0.024*w + 0.024*8 + 0.023*p + 0.022*9 + 0.021*c'), 
(8, '0.015*x + 0.013*file + 0.012*image + 0.008*program + 0.008*files + 0.008*windows + 0.007*key + 0.006*version + 0.006*window + 0.006*code'), 
(9, '0.007*armenian + 0.007*war + 0.007*jews + 0.006*armenians + 0.006*000 + 0.006*turkish + 0.005*world + 0.005*states + 0.005*turkey + 0.004*history')]

参考文章：
Gensim and LDA: a quick tour¶
http://nbviewer.jupyter.org/gist/boskaiolo/cc3e1341f59bfbd02726

models.ldamodel – Latent Dirichlet Allocation
http://radimrehurek.com/gensim/models/ldamodel.html

基于gensim的文本主题模型(LDA)分析
http://blog.csdn.net/u010297828/article/details/50464845

面向未来的历史

关注

1
点赞
踩
11

收藏

觉得还不错? 一键收藏
1
评论
gensim中实践LDA

????from sklearn import datasetsnews_dataset=datasets.fetch_20newsgroups(subset="all",remove=("headers","footers","quotes"))documents=news_dataset.dataprint(documents[0])#"\n\nI am sure some bashers
复制链接

扫一扫