NLP 主题抽取 Topic LDA学习案例
数据准备中的相关参考资料见:https://blog.csdn.net/xiaoql520/article/details/79883409
后续参考资料见代码末尾。
# -*- coding: UTF-8 -*- import warnings warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim') import gensim from sklearn.datasets import fetch_20newsgroups from gensim.utils import simple_preprocess from gensim.parsing.preprocessing import STOPWORDS from gensim.corpora import Dictionary import os from pprint import pprint from sklearn.datasets import fetch_20newsgroups #准备数据 news_dataset = fetch_20newsgroups(subset='all',remove=('headers','footers','quotes'))#获取并缓存数据 documents = news_dataset.data print("In the dataset there are", len(documents), "textual documents") """ In the dataset there are 18846 textual documents """ print("And this is the first one:\n", documents[0]) """And this is the first one: I am sure some bashers of Pens fans are pretty confused about the lack of any kind of posts about the recent Pens massacre of the Devils. Actually, I am bit puzzled too and a bit relieved. However, I am going to put an end to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they are killing those Devils worse than I thought. Jagr just showed you why he is much better than his regular season stats. He is also a lot fo fun to watch in the playoffs. Bowman should let JAgr have a lot of fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final regular season game. PENS RULE!!! """ print("In the dataset ,the filenames are as follow:\n",news_dataset.filenames) """ In the dataset ,the filenames are as follow: ['C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-test\\rec.sport.hockey\\54367' 'C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.sys.ibm.pc.hardware\\60215' 'C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-train\\talk.politics.mideast\\76120' ... 'C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.sys.ibm.pc.hardware\\60695' 'C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.graphics\\38319' 'C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-test\\rec.autos\\103195'] """ print("In the dataset ,the target is as follow:\n",news_dataset.target) """ In the dataset ,the target is as follow: [10 3 17 ... 3 1 7] """ # token化句子(分词、去stopword等),并用词袋表示出句子向量 def tokenize(text): """ 将text分词,并去掉停用词。STOPWORDS -是指Stone, Denis, Kwantes(2010)的stopwords集合. :param text:需要处理的文本 :return:去掉停用词后的"词"序列 """ return [token for token in simple_preprocess(text) if token not in STOPWORDS] print("第一篇文档通过切词和去停用词后的结果为:\n", tokenize(documents[0])) """ 第一篇文档通过切词和去停用词后的结果为: ['sure', 'bashers', 'pens', 'fans', 'pretty', 'confused', 'lack', 'kind', 'posts', 'recent', 'pens', 'massacre', 'devils', 'actually', 'bit', 'puzzled', 'bit', 'relieved', 'going', 'end', 'non', 'pittsburghers', 'relief', 'bit', 'praise', 'pens', 'man', 'killing', 'devils', 'worse', 'thought', 'jagr', 'showed', 'better', 'regular', 'season', 'stats', 'lot', 'fo', 'fun', 'watch', 'playoffs', 'bowman', 'let', 'jagr', 'lot', 'fun', 'couple', 'games', 'pens', 'going', 'beat', 'pulp', 'jersey', 'disappointed', 'islanders', 'lose', 'final', 'regular', 'season', 'game', 'pens', 'rule'] """ #对文档集中所有文档进行切词和去停用词处理,得到对应的“词”序列集。 processed_docs = [tokenize(doc) for doc in documents] #Dictionary封装了规范化单词和它们的整数