刚上的python课学习词云这块部分的知识
知识点:读文件、jieba库、wordcloud库
结合分词和词云功能生成以“关羽”图片为形状的三国演义词云。
代码如下:
import jieba
import wordcloud
from PIL import Image
import numpy as np
import imageio
import os
import matplotlib.pyplot as plt
import re
f = open("./sanguoyanyi.txt",“r”,encoding = “utf-8”)
这个地方读取的是三国演义的txt文档,后面的encoding = "utf-8"是解决编码的问题的。
t = f.read()
f.close()
设置图像遮罩
mask = imageio.imread("./关羽.jpg")
请在下列exludes集合中,自行补充其他需要排除的词汇
excludes = {“将军”, “却说”, “二人”, “不可”, “荆州”, “不能”, “如此”}
words = jieba.lcut(t)
counts = {}
请扩展系列分支结构,转换更多替代词
for word in words:
if len(word) == 1:
continue
elif word == “玄德” or word == “玄德曰”:
rword = “刘备”
elif word == “孔明” or word == “孔明曰”:
rword = “诸葛亮”
elif word == “关公” or word == “云长”:
rword = “关羽”
elif word == “都督”:
rword = “周瑜”
elif word == “翼德”:
rword = “张飞”
elif word == “孟德” :
rword = “曹操”
elif word == “后主” :
rword = “刘禅”
else:
rword = word
counts[rword] = counts.get(rword, 0) + 1
for w in excludes:
del counts[w]
# 实现删除干扰词汇功能
items = list(counts.items())
items.sort(key=lambda item: item[1],reverse = True)
#生成词云函数所需要的文本段
txt = “”.join(t)
#调用wordcloud生成词云
w = wordcloud.WordCloud(
font_path = ‘Hiragino Sans GB.ttc’,
width = 500,height=400,
mask =mask,
max_words =200,
max_font_size=100,
background_color=‘white’,
font_step = 3,
color_func=wordcloud.ImageColorGenerator(mask),
prefer_horizontal=0.9)
w = w.generate(t)
plt.imshow(w) #显示词云
plt.axis(‘off’) #关闭坐标轴
plt.show() #显示图像
outputFileFolder = “output”
if os.path.exists(outputFileFolder) == False:
os.mkdir(outputFileFolder)
w.to_file(outputFileFolder + “/SanGuoWordCloudV1.png”)
运行结果:
注:我的是Mac系统,在#调用wordcloud生成词云
w = wordcloud.WordCloud(
font_path = ‘Hiragino Sans GB.ttc’,
这个部分因为系统的不同字体不同,我也是参考了很多地方的才知道这个地方的不同了,根据自己的操作系统来决定。
是个python 的小白啦,大噶🔥就理解理解啦