gensim中实践LDA

关于LDA主题模型的原理介绍,网上资料很多。
个人推荐http://download.csdn.net/detail/a1368783069/9592238
很详细的介绍的数学原理,以及推倒过程。

所以,后面主要通过一个实例,陈述LDA在python的gensim中的具体实现过程。

由于自己没有合适的中文语料,所以就用sklearn中自带的语料。
中英文本的实现过程是一致的,除了中文需要分词外。

语料获取

获取sklearn中自带的数据语料

from sklearn import datasets
news_dataset=datasets.fetch_20newsgroups(subset="all",remove=("headers","footers","quotes"))
documents=news_dataset.data
print(documents[0])

#"\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n"

文本处理

将文本处理为词向量
1,(英文文本)将文本进行处理,大写全都改小写,去掉无关符号,初步获得词向量。
(中文文本)将文本分词,去掉无关符号,构建词向量。
2, 从获取的词向量中去掉停止词。
3, 构建语料词典。

from gensim.parsing.preprocessing import STOPWORDS
stopwords = STOPWORDS

def tokenize(text):
    text = text.lower()
    words = re.sub("\W"," ",text).split()
    words = [w for w in words if w not in stopwords]
    return words

processed_docs = [tokenize(doc) for doc in documents]
#obtain: (word_id:word)
word_count_dict = gensim.corpora.Dictionary(processed_docs)

word_count_dict.filter_extremes(no_below=20, no_above=0.1) 
# word must appear >10 times, and no more than 20% documents

将文档(词表)转为词袋(BOW)格式(token_id, token_count) 的二元组。

bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs]

LDA分析

lda_model = gensim.models.LdaModel(bag_of_words_corpus, num_topics=10, id2word=word_count_dict, passes=5)

或者使用并行LDA加快处理速度。
使用语法:

 gensim.models.ldamulticore.LdaMulticore(corpus=None, num_topics=100, id2word=None, workers=None, chunksize=2000, passes=1, batch=False, alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50, gamma_threshold=0.001, random_state=None)

输出主题

lda_model.print_topics(10)

输出结果:

[(0, '0.006*government + 0.005*going + 0.004*law + 0.004*q + 0.004*work + 0.003*public + 0.003*gun + 0.003*ll + 0.003*mr + 0.003*said'), 
(1, '0.018*god + 0.006*believe + 0.005*jesus + 0.005*said + 0.004*bible + 0.004*church + 0.004*christian + 0.004*life + 0.004*things + 0.004*christ'), 
(2, '0.687*ax + 0.052*max + 0.045*q + 0.023*p + 0.018*r + 0.016*g + 0.015*7 + 0.011*n + 0.010*145 + 0.008*pl'), 
(3, '0.014*x + 0.014*car + 0.007*bike + 0.005*cars + 0.004*ground + 0.004*engine + 0.004*miles + 0.004*road + 0.004*light + 0.003*power'), 
(4, '0.021*edu + 0.010*com + 0.010*mail + 0.009*information + 0.009*space + 0.007*data + 0.006*available + 0.006*1993 + 0.006*list + 0.006*send'), 
(5, '0.011*game + 0.010*team + 0.009*year + 0.007*games + 0.005*season + 0.005*hockey + 0.005*play + 0.004*league + 0.004*players + 0.004*years'), 
(6, '0.010*dos + 0.009*drive + 0.009*card + 0.007*windows + 0.007*disk + 0.007*thanks + 0.005*5 + 0.005*scsi + 0.005*hard + 0.005*problem'), 
(7, '0.076*0 + 0.033*4 + 0.029*5 + 0.028*6 + 0.026*7 + 0.024*w + 0.024*8 + 0.023*p + 0.022*9 + 0.021*c'), 
(8, '0.015*x + 0.013*file + 0.012*image + 0.008*program + 0.008*files + 0.008*windows + 0.007*key + 0.006*version + 0.006*window + 0.006*code'), 
(9, '0.007*armenian + 0.007*war + 0.007*jews + 0.006*armenians + 0.006*000 + 0.006*turkish + 0.005*world + 0.005*states + 0.005*turkey + 0.004*history')]

参考文章:
Gensim and LDA: a quick tour¶
http://nbviewer.jupyter.org/gist/boskaiolo/cc3e1341f59bfbd02726

models.ldamodel – Latent Dirichlet Allocation
http://radimrehurek.com/gensim/models/ldamodel.html

基于gensim的文本主题模型(LDA)分析
http://blog.csdn.net/u010297828/article/details/50464845

  • 1
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值