NLP 主题抽取 Topic LDA学习案例（一）

最新推荐文章于 2022-03-03 20:15:02 发布

清萝卜头

最新推荐文章于 2022-03-03 20:15:02 发布

阅读量5.6k

点赞数 2

分类专栏： NLP 文章标签： LDA

本文链接：https://blog.csdn.net/xiaoQL520/article/details/79885118

版权

NLP 主题抽取 Topic LDA学习案例

数据准备中的相关参考资料见：https://blog.csdn.net/xiaoql520/article/details/79883409

后续参考资料见代码末尾。

# -*- coding: UTF-8 -*-

import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

import gensim
from sklearn.datasets import fetch_20newsgroups
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim.corpora import Dictionary
import os
from pprint import pprint
from sklearn.datasets import fetch_20newsgroups


#准备数据
news_dataset = fetch_20newsgroups(subset='all',remove=('headers','footers','quotes'))#获取并缓存数据
documents = news_dataset.data
print("In the dataset there are", len(documents), "textual documents")
"""
In the dataset there are 18846 textual documents
"""
print("And this is the first one:\n", documents[0])
"""And this is the first one:
I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!
"""
print("In the dataset ,the filenames are as follow:\n",news_dataset.filenames)
"""
In the dataset ,the filenames are as follow:
 ['C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-test\\rec.sport.hockey\\54367'
 'C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.sys.ibm.pc.hardware\\60215'
 'C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-train\\talk.politics.mideast\\76120'
 ...
 'C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.sys.ibm.pc.hardware\\60695'
 'C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.graphics\\38319'
 'C:\\Users\\xiaoQ\\scikit_learn_data\\20news_home\\20news-bydate-test\\rec.autos\\103195']
"""
print("In the dataset ,the target is as follow:\n",news_dataset.target)
"""
In the dataset ,the target is as follow:
 [10  3 17 ...  3  1  7]
"""


# token化句子（分词、去stopword等），并用词袋表示出句子向量
def tokenize(text):
    """
    将text分词，并去掉停用词。STOPWORDS -是指Stone, Denis, Kwantes(2010)的stopwords集合.
    :param text:需要处理的文本
    :return:去掉停用词后的"词"序列
    """
    return [token for token in simple_preprocess(text) if token not in STOPWORDS]
print("第一篇文档通过切词和去停用词后的结果为:\n", tokenize(documents[0]))
"""
第一篇文档通过切词和去停用词后的结果为:
 ['sure', 'bashers', 'pens', 'fans', 'pretty', 'confused', 'lack', 'kind', 'posts', 'recent', 
 'pens', 'massacre', 'devils', 'actually', 'bit', 'puzzled', 'bit', 'relieved', 'going', 'end',
  'non', 'pittsburghers', 'relief', 'bit', 'praise', 'pens', 'man', 'killing', 'devils', 'worse',
   'thought', 'jagr', 'showed', 'better', 'regular', 'season', 'stats', 'lot', 'fo', 'fun', 
   'watch', 'playoffs', 'bowman', 'let', 'jagr', 'lot', 'fun', 'couple', 'games', 'pens', 'going', 
   'beat', 'pulp', 'jersey', 'disappointed', 'islanders', 'lose', 'final', 'regular', 'season', 
   'game', 'pens', 'rule']
"""
#对文档集中所有文档进行切词和去停用词处理,得到对应的“词”序列集。
processed_docs = [tokenize(doc) for doc in documents]
#Dictionary封装了规范化单词和它们的整数