词频统计和词云绘制

最新推荐文章于 2024-05-09 10:00:00 发布

ryo007gnnu

最新推荐文章于 2024-05-09 10:00:00 发布

阅读量2.1k

点赞数 2

分类专栏：爬虫文章标签： python

本文链接：https://blog.csdn.net/ryo007gnnu/article/details/109067643

版权

爬虫专栏收录该内容

9 篇文章

订阅专栏

在前面我们已经提取了很多篇文章的内容，现在我们要对这些内容进行文本的词频统计和词云绘制，以观察热点内容。
在前文中，提取出的文本里有许多不需要的字符，比如列表的符号，因为是以列表的形式进行提取和存储的。将列表转换为字符串，可以使用’’.join()方法，大家可以自行处理一下，这里不再赘述。
词频统计的基本思路是将文本中所有的内容先进行分词，然后再进行停用词去除，停用词就是那些没有意义的词语，比如“的”，“那么”，“如果”等等。由于在结果中发现分词的结果还是存在很多不需要的词，因此先对它们进行了清洗。这里，我所有的文本全部放在了D盘的绿色金融文本库下。
os.walk就是对文件夹进行遍历，提取出文件夹下所有文件。
我们先拿到清洗过的文本：

import os
os.chdir(r'd:\\')
text=[]
text1=''
for root,dirs,files in os.walk(r'D:\绿色金融文本库'):
    for i in files:
        path=os.path.join(root,i)
        with open(path,'r',encoding='gb18030',errors='ignore') as f:
            text=f.readline()
            text1=text1+text
text1=text1.replace(' ','')
text1=text1.replace('新华社记者','')
text1=text1.replace('中国','')
text1=text1.replace('月','')
text1=text1.replace('近日','')
text1=text1.replace('日','')
text1=text1.replace('年','')
text1=text1.replace('中','')
text1=text1.replace('\n','')

文本清洗过后，需要进行分词和停用词取出，这里需要引入jieba包，并且需要一个停用词表（stopword.txt），停用词表可以在网上下载。
导入停用词表：

stopwords = {}.fromkeys([line.rstrip() for line in open(r'D:\stopword.txt',encoding='utf-8',errors='ignore')])

对文本分词并去除停用词：

import jieba
strings=jieba.cut(text1)
str=''
for i in strings:
    if i not in stopwords:
        str+=i

获得去除停用词后的分词文本：

str1=jieba.cut(str)

接下来就是词频统计和词云绘制，词频统计需要引入collections包

import collections
word_counts = collections.Counter(str1) # 对分词做词频统计
word_counts_top50 = word_counts.most_common(50) # 获取排名前50的词
word_counts_top50

在这里插入图片描述
最后生成词云，这里需要引入wordcloud包：

from wordcloud import WordCloud
wc = WordCloud(background_color = "black",max_words = 300,font_path='C:/Windows/Fonts/simkai.ttf',min_font_size = 15,max_font_size = 50,width = 600,height = 600)
wc.generate_from_frequencies(word_counts)
wc.to_file("wordcoud.png")