WordCloud库使用方法及词云图生成示例

曲措Oz

已于 2024-06-06 15:54:15 修改

阅读量3.7k

点赞数 27

分类专栏： python图表应用文章标签： python 开发语言

于 2024-03-27 16:56:00 首次发布

本文链接：https://blog.csdn.net/2301_78959712/article/details/137073940

版权

python图表应用专栏收录该内容

1 篇文章

订阅专栏

本文介绍了Python的WordCloud库，包括其安装步骤、基本函数和配置参数，以及如何结合jieba库进行文本处理和生成词云图，以豆瓣影评为例展示了实际应用过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、WordCloud库简介与安装

WordCloud又称文字云，是文本数据的视觉表示，由词汇组成类似云的彩色图形，用于展示大量文本数据。每个词的重要性以字体大小或颜色显示。

WordCloud库是一款Python的第三方库，可以用于生成词云，需要通过pip指令在终端安装。

pip install wordcloud

二、WordCloud库使用方法

（一）WordCloud库常用函数介绍

其中w为WordCloud对象：

函数	含义
wordcloud.WordCloud()	根据参数生成一个WordCloud对象
w.generate()	向对象w中加载文本
fit_words(self, frequencies)	根据词频生成词云
process_text(text)	将长文本分词并去除屏蔽词 (英语)
w.to_file()	将词云图存储为图像文件（.png或.jpg格式）

（二）WordcloudCloud对象配置参数

蓝色部分为常用参数：

参数	描述
font_path: string	字体路径
width: int(default = 400)	输出的画布宽度，默认为400像素
height: int(default = 200)	输出的画布高度，默认为200像素
prefer_horizontal: float(default = 0.90)	词语水平方向排版出现的频率，默认 0.9
mask: nd-array or None(default = None)	如果参数为空，则使用二维遮罩绘制词云；如果 mask 非空，设置的宽高值将被忽略，遮罩形状被 mask 取代。除全白（#FFFFFF）的部分不会绘制，其余部分会用于绘制词云
scale: float(default=1)	按照比例放大画布
min_font_size: int(default = 4)	显示的最小字体大小
max_font_size: int or None(default = None)	显示的最大字体大小
font_step: int(default=1)	字体步长
max_words: number(default = 200)	要显示词的最大个数
stopwords: set of strings or None	设置需要屏蔽的词，如果为空，则使用内置的STOPWORDS
background_color: color value(default = "black")	背景颜色
mode: string(default = "RGB")	当参数为“RGBA”并且background_color不为空时，背景为透明
relative_scaling: float(default=5)	词频和字体大小的关联性
color_func: callable, default=None	生成新颜色的函数，如果为空，则使用 self.color_func
regexp: string or None(optional)	使用正则表达式分隔输入的文本
collocations: bool,default = True	是否包括两个词的搭配
contour_width: int = 0	遮罩的轮廓线宽度
contour_color: str = 'black	遮罩的轮廓线颜色
colormap: Any or None	词云文字的配色集，默认为"viridis"

（三）jieba库的辅助应用

jieba库主要有以下三种模式：

1.精确模式
精确模式将文本按照最大概率进行切分，不存在冗余成分。

words=jieba.cut(sentence)

2.全模式
全模式将文本中所有可能的词语都切分出来，可能存在冗余。

words=jieba.cut(sentence,cut_all=True)

3.搜索引擎模式
搜索引擎模式根据词语的位置进行切分，精确模式下再切分长词语。

words=jieba.cut_for_search(sentence)

词语图中最常使用为第一种精确模式。

（四）配色集

三、词云图生成示例

结合爬虫基础（会在其他文章中详细展开）和jieba库的使用，可以尝试生成一张豆瓣影评词云图：

import requests
from bs4 import BeautifulSoup
import jieba
from wordcloud import WordCloud
import numpy as np
from PIL import Image

# 将豆瓣电影评论URL地址，赋值给变量url
url = "https://movie.douban.com/subject/10463953/comments?status=P"

# 将User-Agent以字典键对形式赋值给headers
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"}

# 将字典headers传递给headers参数，添加进requests.get()中，赋值给response
response = requests.get(url, headers=headers)

# 使用.text属性获取网页内容，并赋值给html
html = response.text

# 使用BeautifulSoup()传入变量html和解析器lxml，赋值给soup
soup = BeautifulSoup(html, "lxml")

# 使用find_all()查询soup中class="short"的节点，赋值给content_all
content_all = soup.find_all(class_="short")

#排除词
excludes={'span','class','short','not'}

#进行词汇整理
content=str(content_all)
words=jieba.cut(content)
text=""
for i in words:
    if len(i)>1:
        text+=" "
        text+=i

#打开图片文件赋值mask
mask=np.array(Image.open("Apple.png"))

# 创建WordCloud对象，赋值给wordCloud
wordCloud = WordCloud(background_color="white",repeat=False,max_words=100,max_font_size=300,colormap="Blues",font_path="Fonts/STXINGKA.TTF",mask=mask,stopwords=excludes)

#向WordCloud对象中加载文本text
wordCloud.generate(text)

#将词云图输出为图像文件
wordCloud.to_file("The Imitation Game.png")

# 使用print输出 success
print("success")