NLP 主题抽取 Topic LDA代码实践 gensim包代码

最新推荐文章于 2024-06-16 16:30:48 发布

Scofield_Phil

最新推荐文章于 2024-06-16 16:30:48 发布

阅读量1w

点赞数 2

分类专栏： NLP Machine/DeepLearning 文章标签： NLP sklearn python LDA gensim

本文链接：https://blog.csdn.net/scotfield_msn/article/details/72904651

版权

Machine/DeepLearning 同时被 2 个专栏收录

17 篇文章 6 订阅

订阅专栏

NLP

12 篇文章 0 订阅

订阅专栏

NLP 主题抽取Topic LDA代码实践 gensim包代码

原创作品，转载请注明出处：[ Mr.Scofield http://blog.csdn.net/scotfield_msn/article/details/72904651 ]

From RxNLP.

分享一个代码实践：用gensim包的LDA模型实践NLP的一个典型任务，主题抽取。

顺带提一点，对于NLP任务，最好的方式就是先在代码上跑通起来，然后再进行理论深究，最后自己实现DIY学习模型算法框架。

顺带再提一点，跑通NLP或者ML任务，推荐在Python下用成熟的包如sklearn、numpy等进行，高效。对自己要求严一点的话，再在Java下用相关包跑一遍，然后就能对比不同的语言平台下的差异了。

不废话了，上代码，注释很清楚明了(注释英文写的，将就着阅读吧谢谢)。

import gensim
from sklearn.datasets import fetch_20newsgroups
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.corpora import Dictionary
import os
from pprint import pprint

# 第一步准备数据，fetch_20newsgroups来自于sklearn的dataset

news_dataset = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

documents = news_dataset.data

print "In the dataset there are", len(documents), "textual documents"
print "And this is the first one:\n", documents[0]

# 第二步 token化句子（分词、去stopword、等等），并词袋表示出句子向量

def tokenize(text):
    return [token for token in simple_preprocess(text) if token not in STOPWORDS]


# print "After the tokenizer, the previous document becomes:\n", tokenize(documents[0])


# Next step: tokenize all the documents and build a count dictionary, that contains the count of the tokens over the complete text corpus.
processed_docs = [tokenize(doc) for doc in documents]
word_count_dict = Dictionary(processed_docs)
# print "In the corpus there are", len(word_count_dict), "unique tokens"

# print "\n",word_count_dict,"\n"

word_count_dict.filter_extremes(no_below=20, no_above=0.1)  # word must appear >10 times, and no more than 20% documents
# print "After filtering, in the corpus there are only", len(word_count_dict), "unique tokens"

bag_of_words_corpus = [word_count_dict.doc2bow(pdoc) for pdoc in processed_docs]  # bow all document of corpus

# 第三步 LDA 上模型

model_name = "./model.lda"
if os.path.exists(model_name):
    lda_model = gensim.models.LdaModel.load(model_name)
    print "loaded from old"
else:
    # preprocess()
    lda_model = gensim.models.LdaModel(bag_of_words_corpus, num_topics=100, id2word=word_count_dict, passes=5)#num_topics: the maximum numbers of topic that can provide
    lda_model.save(model_name)
    print "loaded from new"

# 第四步验证非登录句子或者文档的主题抽取能力情况，做了三个实验

# 1.
# if you don't assign the target document, then
# every running of lda_model.print_topics(k) gonna get top k topic_keyword from whole the corpora documents in the bag_of_words_corpus from 0-n.
# and if given a single new document, it will only analyse this document, and output top k topic_keyword from this document.

pprint(lda_model.print_topics(30,6))#by default num_topics=10, no more than LdaModel's; by default num_words=10, no limitation
print "\n"
# pprint(lda_model.print_topics(10))

# 2.
# when you assign a particular document for it to assign:
# pprint(lda_model[bag_of_words_corpus[0]].print_topics(10))
for index, score in sorted(lda_model[bag_of_words_corpus[0]], key=lambda tup: -1 * tup[1]):
    print "Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5))
print
print news_dataset.target_names[news_dataset.target[0]]  # bag_of_words_corpus align to news_dataset
print "\n"

# 3.
# process an unseed document
unseen_document = "In my spare time I either play badmington or drive my car"
print "The unseen document is composed by the following text:", unseen_document
print
bow_vector = word_count_dict.doc2bow(tokenize(unseen_document))
for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1 * tup[1]):
    print "Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 7))

refs:

http://nbviewer.jupyter.org/gist/boskaiolo/cc3e1341f59bfbd02726
http://www.voidcn.com/blog/u010297828/article/p-4995136.html
http://radimrehurek.com/gensim/models/ldamodel.html
http://blog.csdn.net/accumulate_zhang/article/details/62453672

Scofield_Phil

关注

2
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
NLP 主题抽取 Topic LDA代码实践 gensim包代码

NLP 主题抽取Topic LDA代码实践 gensim包代码分享一个代码实践：用gensim包的LDA模型实践NLP的一个典型任务，主题抽取。顺带提一点，对于NLP任务，最好的方式就是先在代码上跑通起来，然后再进行理论深究，最后自己实现DIY学习模型算法框架。顺带再提一点，跑通NLP或者ML任务，推荐在Python下用
复制链接

扫一扫