关于LDA主题模型的原理介绍,网上资料很多。
个人推荐http://download.csdn.net/detail/a1368783069/9592238
很详细的介绍的数学原理,以及推倒过程。
所以,后面主要通过一个实例,陈述LDA在python的gensim中的具体实现过程。
由于自己没有合适的中文语料,所以就用sklearn中自带的语料。
中英文本的实现过程是一致的,除了中文需要分词外。
语料获取
获取sklearn中自带的数据语料
from sklearn import datasets
news_dataset=datasets.fetch_20newsgroups(subset="all",remove=("headers","footers","quotes"))
documents=news_dataset.data
print(documents[0])
#"\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game. PENS RULE!!!\n\n"
文本处理
将文本处理为词向量
1,(英文文本)将文本进行处理,大写全都改小写,去掉无关符号,初步获得词向量。
(中文文本)将文本分词,去掉无关符号,构建词向量。
2, 从获取的词向量中去掉停止词。
3, 构建语料词典。
from gensim.parsing.preprocessing import STOPWORDS
stopwords = STOPWORDS
def tokenize(text):
text = text.lower()
words = re.sub("\W"," ",text).split()
words = [w for w in words if w not in stopwords]
return words
processed_docs = [tokenize(doc) for doc in documents]
#obtain: (word_id:word)
word_count_dict = gensim.corpora.Dictionary(processed_docs)
word_count_dict.filter_extremes(no_below=20, no_above=0.1)
# word must appear >10 times, and no more than 20% documents
将文档(词表)转为词袋(BOW)格式(token_id, token_count) 的二元组。
bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs]
LDA分析
lda_model = gensim.models.LdaModel(bag_of_words_corpus, num_topics=10, id2word=word_count_dict, passes=5)
或者使用并行LDA加快处理速度。
使用语法:
gensim.models.ldamulticore.LdaMulticore(corpus=None, num_topics=100, id2word=None, workers=None, chunksize=2000, passes=1, batch=False, alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50, gamma_threshold=0.001, random_state=None)
输出主题
lda_model.print_topics(10)
输出结果:
[(0, '0.006*government + 0.005*going + 0.004*law + 0.004*q + 0.004*work + 0.003*public + 0.003*gun + 0.003*ll + 0.003*mr + 0.003*said'),
(1, '0.018*god + 0.006*believe + 0.005*jesus + 0.005*said + 0.004*bible + 0.004*church + 0.004*christian + 0.004*life + 0.004*things + 0.004*christ'),
(2, '0.687*ax + 0.052*max + 0.045*q + 0.023*p + 0.018*r + 0.016*g + 0.015*7 + 0.011*n + 0.010*145 + 0.008*pl'),
(3, '0.014*x + 0.014*car + 0.007*bike + 0.005*cars + 0.004*ground + 0.004*engine + 0.004*miles + 0.004*road + 0.004*light + 0.003*power'),
(4, '0.021*edu + 0.010*com + 0.010*mail + 0.009*information + 0.009*space + 0.007*data + 0.006*available + 0.006*1993 + 0.006*list + 0.006*send'),
(5, '0.011*game + 0.010*team + 0.009*year + 0.007*games + 0.005*season + 0.005*hockey + 0.005*play + 0.004*league + 0.004*players + 0.004*years'),
(6, '0.010*dos + 0.009*drive + 0.009*card + 0.007*windows + 0.007*disk + 0.007*thanks + 0.005*5 + 0.005*scsi + 0.005*hard + 0.005*problem'),
(7, '0.076*0 + 0.033*4 + 0.029*5 + 0.028*6 + 0.026*7 + 0.024*w + 0.024*8 + 0.023*p + 0.022*9 + 0.021*c'),
(8, '0.015*x + 0.013*file + 0.012*image + 0.008*program + 0.008*files + 0.008*windows + 0.007*key + 0.006*version + 0.006*window + 0.006*code'),
(9, '0.007*armenian + 0.007*war + 0.007*jews + 0.006*armenians + 0.006*000 + 0.006*turkish + 0.005*world + 0.005*states + 0.005*turkey + 0.004*history')]
参考文章:
Gensim and LDA: a quick tour¶
http://nbviewer.jupyter.org/gist/boskaiolo/cc3e1341f59bfbd02726
models.ldamodel – Latent Dirichlet Allocation
http://radimrehurek.com/gensim/models/ldamodel.html
基于gensim的文本主题模型(LDA)分析
http://blog.csdn.net/u010297828/article/details/50464845