主要是学习了用jieba包来作分词,然后用wordcloud包来作词云,最后用matplotlib来输出图片
"""词云与输出图形的包"""
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
"""分词包 jieba"""
import jieba
"""与词云背景相关的包"""
from PIL import Image
import numpy as np
# 没有过滤词性
file = 'C:\\Users\Administrator\Desktop\python笔记\meg1.txt'
background_file = 'C:\\Users\Administrator\Desktop\python笔记\\background.jpg'
font_path = 'C:\windows\Fonts\STZHONGS.TTF'
mask = np.array(Image.open(background_file))
text = open(file, encoding='utf-8').read()
words = jieba.cut(text)
split = " ".join(words)
wordcloud = WordCloud(font_path=font_path
, mask=mask
, background_color='white'
, stopwords=STOPWORDS).generate(split)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
首先是读取要做词云的文本meg1,以及背景图片background,wordcloud默认字体不支持中文,可自行设定字体路径,也可以在wordcloud里修改默认字体(我修改了默认字体)
用jieba的cut方法对文本进行分词,cut方法还有一些参数,此处不讨论了
然后对分词应用join方法,即以空格来分隔词
最后应用WordCloud 的 generate方法来生成词列表,plt的imshow方法即根据词频来生成图片
取消坐标轴再显示即可
可以看出来,出现次数最高的几个词是我们,就是,什么,他们,时候 等没什么意义的次,有时我们更关心出现的名词。
由此可以用jieba的分词来操作
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import jieba
from PIL import Image
import numpy as np
import jieba.posseg as pseg
file = 'C:\\Users\Administrator\Desktop\python笔记\meg1.txt'
background_file = 'C:\\Users\Administrator\Desktop\python笔记\\background.jpg'
font_path = 'C:\windows\Fonts\STZHONGS.TTF'
mask = np.array(Image.open(background_file))
text = open(file, encoding='utf-8').read()
words = pseg.cut(text)
split = " "
for w in words:
if w.flag == 'n' and w.word != '时候' and w.word != '公司':
split = split + w.word + " "
wordcloud = WordCloud(mask=mask
, background_color='white'
, stopwords=STOPWORDS).generate(split)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
先要导入 jieba.posseg 包
import jieba.posseg as pseg
在分词时,不用jieba.cut,而是pseg.cut,再遍历期中元素,找出名词的元素
期中,w.flag为其词性,w.word为单词
flag所有词性可参考这篇文章 https://www.cnblogs.com/adienhsuan/p/5674033.html
输出后发现 公司 时候两个词任然占了很大比例,因此在代码里再删除了这两个词,输出后的图片效果较可观