20大报告中的关键词和高频词对于学习有很大帮助。尝试利用分词、词云和图像处理三个库
实现一个简单的20大报告分词系统。
分词:jieba
词云:wordcloud
图像处理:imageio
用网上资源生成一个20大报告的txt文件和一个白底的中国地图,
然后选择图片尺寸,字体,底色,掩模,对意义不大的单字如“的、了、是,又”等做去除。
源码如下:
# 词频统计
import wordcloud as wc
import jieba
import imageio.v2 as img
with open("20d报告.txt") as f:
s = f.read()
ls = jieba.lcut(s) # 生成分词列表
text = ' '.join(ls) # 连接成字符串
mask = img.imread('chinamap.jpg')
stopwords = ["的","地","是","了","不","为","在","既","但","有","又","还","并","和","就","都","这"]
w = wc.WordCloud(font_path = "msyh.ttc",
mask = mask,
width = 1000,
height = 700,
background_color='white',
max_words = 100,
stopwords = stopwords).generate(text)
w.to_file('20大.png')
生成的词云图: