【Python】词云之 wordcloud库全解析

感谢地心引力

已于 2023-01-20 13:22:06 修改

阅读量2.2w

点赞数 7

分类专栏： Python数据分析文章标签： python 开发语言

于 2023-01-12 22:05:35 首次发布

本文为博主原创，未经博主允许，不得转载。

本文链接：https://blog.csdn.net/weixin_43764974/article/details/128663965

版权

Python数据分析专栏收录该内容

8 篇文章 2 订阅

订阅专栏

有用的话，欢迎姗莲✨✨✨✨✨✨✨✨✨✨✨✨✨

一基础用法

其中,self.Read_txt()是我的txt文本文件。

wd_0 = WordCloud(font_path='simhei.ttf',
                         # background_color='white',
                         colormap='autumn',
                         width=800,height=400,
                         collocations=True,
                         scale=4,
                         mask=mask_img).generate(self.Read_txt())
plt.imshow(wd_0,interpolation='bilinear')
plt.axis('off')
plt.show()

上面的代码已经可以生成一个清晰、有效的词云图像了。若要优化其他细节，可以参考下面的参数说明。

二、WordCloud类形参说明

通过参数可以指定词云图像的字体、大小、配色等。

WordCloud这个类的全部参数如下：

WordCloud(font_path=None, width=400, height=200, margin=2, ranks_only=None, prefer_horizontal=0.9, mask=None, scale=1, color_func=None, max_words=200, min_font_size=4, stopwords=None, random_state=None, background_color=‘black’, max_font_size=None, font_step=1, mode=‘RGB’, relative_scaling=‘auto’, regexp=None, collocations=True, colormap=None, normalize_plurals=True, contour_width=0, contour_color=‘black’, repeat=False, include_numbers=False, min_word_length=0, collocation_threshold=30)

一般来说设置字体、尺寸、配色、缩放等少量几个参数就足够了。本文对WordCloud所有的形参进行说明。

2.1 常用参数

2.11 字体 font_path

 |  font_path : string
 |      Font path to the font that will be used (OTF or TTF).
 |      Defaults to DroidSansMono path on a Linux machine. If you are on
 |      another OS or don't have this font, you need to adjust this path.

设置示例：

font_path='simhei.ttf'

默认值：Linux的DroidSansMono 路径，一般windows是没有的，需要设置。

否则词云可能无法正确显示文字内容（类似于乱码），一般字体文件名为 name.ttf这种格式。
查看本机全部字体可以在：C:\Windows\Fonts路径下查看，或者从控制面板>外观和个性化>字体打开。

字体名称并不是仿宋、楷体这种，而是要在对应的字体上右键>属性，来查看其名称，如华文彩云这个字体的名称是：STCAIYUN.TTF

在这里插入图片描述

2.12 画布尺寸 width、hight

 |  width : int (default=400)
 |      Width of the canvas.
 |  
 |  height : int (default=200)
 |      Height of the canvas.

默认是400x200，单位：像素。

2.13 比例（缩放）scale

 |  scale : float (default=1)
 |      Scaling between computation and drawing. For large word-cloud images,
 |      using scale instead of larger canvas size is significantly faster, but
 |      might lead to a coarser fit for the words.

若画布设置为 400x200，若scale = 5，则词云图像的尺寸变成 2000x1000（像素）。

建议设置较小的画布尺寸，然后缩放成目标大小当然缩放系数要适当，太大也不合适

若直接将画布尺寸设置成2000x1000，虽然尺寸是一样的，但加载时间会比设置scale=5长很多.

词云的像素尺寸越大，越清晰，词频较低的字号很小的词语也能看清了。

2.14 颜色(表) colormap

 |  colormap : string or matplotlib colormap, default="viridis"
 |      Matplotlib colormap to randomly draw colors from for each word.
 |      Ignored if "color_func" is specified.
 |  
 |      .. versionadded: 2.0

colormap 是一个预定义的 Matplotlib colormap**。

默认值是：viridis。

一般我们使用matplotlib的colormap即可：

Matplotlib 提供了多种预定义的 colormap，如 viridis、 jet、 winter、 summer、 spring、
autumn、 cool、 hot、 gray 等

格式：

colormap = 'spring'

注：如果设置了color_func参数，则这一项失效。

2.15 颜色函数 color_func

 |  color_func : callable, default=None
 |      Callable with parameters word, font_size, position, orientation,
 |      font_path, random_state that returns a PIL color for each word.
 |      Overwrites "colormap".
 |      See colormap for specifying a matplotlib colormap instead.
 |      To create a word cloud with a single color, use
 |      ``color_func=lambda *args, **kwargs: "white"``.
 |      The single color can also be specified using RGB code. For example
 |      ``color_func=lambda *args, **kwargs: (255,0,0)`` sets color to red.

color_func 可以是一个预定义的函数，或者是一个自定义函数，用来确定每个词的颜色。

一般我们使用colormap这个参数就够了。

color_func示例：
设置词云颜色为白色：

color_func=lambda *args, **kwargs: "white"

2.16 词语组合频率collocations

 |  collocations : bool, default=True
 |      Whether to include collocations (bigrams) of two words. Ignored if using
 |      generate_from_frequencies.

collocations 是一个用来控制词云图像中词语组合频率的参数。当 collocations 被设置为 True 时，wordcloud 将会考虑两个词之间的关系来计算它们的频率。

建议设置为False。

对比如下：

1. collocations=True
在这里插入图片描述

2.collocations=False
在这里插入图片描述

2.17 遮罩（蒙版）mask

 |  mask : nd-array or None (default=None)
 |      If not None, gives a binary mask on where to draw words. If mask is not
 |      None, width and height will be ignored and the shape of mask will be
 |      used instead. All white (#FF or #FFFFFF) entries will be considerd
 |      "masked out" while other entries will be free to draw on. [This
 |      changed in the most recent version!]

常常用来设置词云图像的形状，如果设置了mask，将由遮罩图像的尺寸来定义词云图像的尺寸。

mask的值是一个图像的二进制数据（矩阵），可用ndarray表示。

常常与2.18中的轮廓参数结合使用。

2.18 轮廓宽度和颜色 contour_width、contour_color

 |  contour_width: float (default=0)
 |      If mask is not None and contour_width > 0, draw the mask contour.
 |  
 |  contour_color: color value (default="black")
 |      Mask contour color.

遮罩（蒙版）图像的轮廓宽度和颜色。
如：

mask=mask_img,
contour_width= 10.0,
contour_color='blue',

在这里插入图片描述

2.2~2.3 不常用参数

2.21 词云边界 margin

默认值是2（像素）。

即词语显示区域距离整个图像边界的距离。

如果原始尺寸是 800x400，设置margin = 20，那么词云实际显示的区域大小为：760x360，图像大小不变，只是多了一个空白的边框。

这个参数再GUI相关的应用中基本都有，比如Android、微信小程序、前端、tkinter等等。

2.22 词语水平排版频率 prefer_horizontal

 |  prefer_horizontal : float (default=0.90)
 |      The ratio of times to try horizontal fitting as opposed to vertical.
 |      If prefer_horizontal < 1, the algorithm will try rotating the word
 |      if it doesn't fit. (There is currently no built-in way to get only
 |      vertical words.)

词语水平方向排版出现的频率，默认 0.9 ，这不用解释了吧。

2.23 显示词语的最大个数 max_words

 |  max_words : number (default=200)
 |      The maximum number of words.

这个参数有时候也会根据实际需求设置的。默认不超过200个。

2.24 最小、最大字体大小 min_font_size 、max_font_size

 |  min_font_size : int (default=4)
 |      Smallest font size to use. Will stop when there is no more room in this
 |      size.

 |  max_font_size : int or None (default=None)
 |      Maximum font size for the largest word. If None, height of the image is
 |      used.

一般不需要设置，最多设置一下最小字号。

2.25 字体步长 font_step

 |  font_step : int (default=1)
 |      Step size for the font. font_step > 1 might speed up computation but
 |      give a worse fit.

默认值即可。

2.26 停用（屏蔽）词 stopwords

 |  stopwords : set of strings or None
 |      The words that will be eliminated. If None, the build-in STOPWORDS
 |      list will be used. Ignored if using generate_from_frequencies.

要屏蔽的词，不设置则为内部默认的STOPWORDS。

2.27 背景色 background_color

 |  background_color : color value (default="black")
 |      Background color for the word cloud image.

图像背景色，十六进制或者英文名都可。

background_color='#450073',
# 或者
background_color='black',

2.28 色彩模式 mode

 |  mode : string (default="RGB")
 |      Transparent background will be generated when mode is "RGBA" and
 |      background_color is None.

默认是RGB色彩，如果设置为RGBA，并且背景色设置为None时，背景为透明。
即：

mode='RGBA',
background_color=None,

2.29 词语数量很少时重复 repeat

 |  repeat : bool, default=False
 |      Whether to repeat words and phrases until max_words or min_font_size
 |      is reached.

比如默认显示200个词语，但我的文本只有50个词，是否选择重复显示这些词语，知道数量达到200。默认不开启。

2.30词语最短长度 min_word_length

 |  min_word_length : int, default=0
 |      Minimum number of letters a word must have to be included.

2.31 是否包含数字 include_numbers

 |  include_numbers : bool, default=False
 |      Whether to include numbers as phrases or not.

2.32 正则表达式 regexp

 |  regexp : string or None (optional)
 |      Regular expression to split the input text into tokens in process_text.
 |      If None is specified, ``r"\w[\w']+"`` is used. Ignored if using
 |      generate_from_frequencies.

可以用来过滤用于生成词云的单词，只允许那些符合模式的单词包括在内。例如，您可以使用正则表达式只包括以某个字母开头或包含特定字符序列的单词。这样，您可以专注于您想要在词云中强调的特定单词或短语，或排除某些不想包括的单词。正则表达式在词云中的具体实现将取决于使用的特定库或工具。

该参数，在一些特定应用中还是有用的。

2.33 单词搭配(词组) 出现最低频率 collocation_threshold

 |  collocation_threshold: int, default=30
 |      Bigrams must have a Dunning likelihood collocation score greater than this
 |      parameter to be counted as bigrams. Default of 30 is arbitrary.

搭配是一组经常一起出现在文本中的单词。collocation_threshold参数控制单词对一起出现的最小次数，以便被视为搭配。如果一对单词的频率低于collocation_threshold，则不会被视为搭配，并且不会包括在词云中。这个参数允许您关注文本中最频繁出现的搭配，并排除较不常见的搭配。

2.34 单词复数转为单数 normalize_plurals

 |  normalize_plurals : bool, default=True
 |      Whether to remove trailing 's' from words. If True and a word
 |      appears with and without a trailing 's', the one with trailing 's'
 |      is removed and its counts are added to the version without
 |      trailing 's' -- unless the word ends with 'ss'. Ignored if using
 |      generate_from_frequencies.

normalize_plurals参数是用于将复数单词规范化为单数形式的参数。当该参数设置为True时，词云生成工具会将所有出现的复数单词转换为单数形式，以便在词云中统计词频。例如，如果"dogs"和"dog"都出现在文本中，那么在normalize_plurals设置为True时，它们将被视为同一个单词"dog"，并在词云中统计词频。这样可以减少不必要的单词数量，并使统计结果更具意义。

默认就开启这项功能的。

2.35 词语相对大小 relative_scaling

 |  relative_scaling : float (default='auto')
 |      Importance of relative word frequencies for font-size.  With
 |      relative_scaling=0, only word-ranks are considered.  With
 |      relative_scaling=1, a word that is twice as frequent will have twice
 |      the size.  If you want to consider the word frequencies and not only
 |      their rank, relative_scaling around .5 often looks good.
 |      If 'auto' it will be set to 0.5 unless repeat is true, in which
 |      case it will be set to 0.

用于控制词云中词语的相对大小的参数。这个参数的值越大，词语越大，反之越小。当relative_scaling设置为0时，所有单词的大小都相同，而当relative_scaling设置为1时，单词的大小与它们在文本中出现的频率成正比。该参数可以通过调整来更好地展示词云中词语的相对重要性。

2.36 是否仅显示高频词 ranks_only

用于控制是否仅在词云中显示高频词的参数。当这个参数设置为True时，仅会在词云中显示高频词，而不会显示低频词。这样可以使词云更简洁，并且更容易看出文本中最重要的单词。当这个参数设置为False时，所有词都会被显示在词云中。

2.37 随机生成器种子参数 random_state

 |      random_state : RandomState, int, or None, default=None
 |          If not None, a fixed random state is used. If an int is given, this
 |          is used as seed for a random.Random state.

random_state参数是用于控制词云生成过程中随机数生成器种子的参数。这个参数可以用来确保词云在每次生成时都是相同的，这样可以在多次执行相同的词云生成代码时得到相同的结果。这对于评估不同的参数或算法的效果非常有用。如果未指定random_state，每次生成的词云都会有所不同。

如：

from wordcloud import WordCloud

text = "This is a sample text for generating a word cloud"
wc = WordCloud(random_state=42).generate(text)

三、常用方法

3.1 词云生成相关

通常使用generate()方法就能生成词云了。这里还有其他几种生成词云的方式。

3.11 根据词频字典生成 fit_words()

fit_words() 方法是在 WordCloud 类中的一个函数，它的作用是根据给定的词频字典来生成词云。词频字典是一个键值对的字典，其中键是单词，值是该单词的频率。

举例：

    frequencies = {'word1': 10, 'word2': 20, 'word3': 5}
    from wordcloud import WordCloud
    wordcloud = WordCloud(font_path='simhei.ttf').fit_words(frequencies)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis("off")
    plt.show()

在这里插入图片描述

3.12 常用 generate()

常用的词云生成方式。

3.13 根据词频字典生成 generate_from_frequencies()

它的作用是根据给定的词频字典来生成词云。词频字典是一个键值对的字典，其中键是单词，值是该单词的频率。与 fit_words() 方法类似，但是 generate_from_frequencies() 方法是在生成词云之前预先配置词云的一个方法。

如：

frequencies = {'word1': 10, 'word2': 20, 'word3': 5}
from wordcloud import WordCloud
wordcloud = WordCloud().generate_from_frequencies(frequencies)

3.14 根据文本生成generate_from_text()

它的作用是根据给定的文本来生成词云。该文本可以是一个字符串或文件，它会先分析文本中的词汇并统计词频，然后根据词频生成词云。

如：

from wordcloud import WordCloud

with open('text.txt') as f:
    text = f.read()

wordcloud = WordCloud().generate_from_text(text)

或者直接使用字符串：

wordcloud = WordCloud().generate_from_text("This is a sample text for generating a wordcloud")

3.15 词频统计 process_text()

将文本分析成词频字典。该文本可以是一个字符串或文件，它会先分析文本中的词汇并统计词频。

如：

frequencies = WordCloud().process_text("This is a sample text for generating a wordcloud")
print(frequencies)

将输出：

{‘sample’: 1, ‘text’: 1, ‘generating’: 1, ‘wordcloud’: 1}

自动过滤了This 、a这些单词。

这个方法主要用来统计词语出现的频数。

3.16 单词重新上色 recolor()

  recolor(self, random_state=None, color_func=None, colormap=None)

不常用。

3.2 文件保存相关

3.21 保存为数组 to_array

 |  to_array(self)
 |      Convert to numpy array.
 |      
 |      Returns
 |      -------
 |      image : nd-array size (width, height, 3)
 |          Word cloud image as numpy matrix.

将WordCloud对象转换成一个numpy 数组。该数组表示 WordCloud 中每个词的大小和位置。

如：

wordcloud_array = wordcloud.to_array()

wordcloud_array 数组将包含所有词的位置和大小信息。

你可以使用numpy的shape属性来确认这个wordcloud_array的维度。

可以使用 matplotlib.pyplot.imshow() 来显示这个数组。

3.22 保存为文件 to_file

 |  to_file(self, filename)
 |      Export to image file.
 |      
 |      Parameters
 |      ----------
 |      filename : string
 |          Location to write to.
 |      
 |      Returns
 |      -------
 |      self

将 WordCloud 对象保存到一个文件中。它接受一个文件名作为参数，并将 WordCloud 图像保存到该文件中。默认情况下，图像将保存为 PNG 格式，但是也可以指定其他格式，如 JPEG 或 BMP。

如：

wd.to_file(f'2.PNG')

3.23 转换为PIL图像 to_image

 to_image(self)

将 WordCloud 对象转换为 PIL 图像。这意味着它会返回一个 PIL 图像对象, 而不是保存到文件中.

如：

image = wd.to_image()
image.show()

3.24 转换为SVG图像 to_svg

 |  to_svg(self, embed_font=False, optimize_embedded_font=True, embed_image=False)
 |      Export to SVG.

将 WordCloud 对象转换为 SVG (Scalable Vector Graphics) 格式。SVG 是一种矢量图形格式，可以在浏览器中显示并且可以缩放而不失真。
如：

    svg = wd.to_svg()
    with open("wordcloud.svg", "wb") as f:
        f.write(svg.encode())

即可生成SVG图像，双击即可在浏览器打开：
在这里插入图片描述

四、关于词云背景图片

需要白色背景图片，并且前景轮廓突出。这样效果最好。可以参考文末github仓库里面img/下面的图片示例。其中converted_img/目录下是已经转为白底的图片。

4.1 将图片背景色转换为白色

使用PIL库将图片转为白底的示例程序如下，本程序也支持透明底图片。

方法有很多，不限于此，只要是白底就好。

from PIL import Image
class Img_Convert:
    def __init__(self,img_name):
        self.name = img_name

    def img_convert(self):
        img = Image.open(fr'img/{self.name}')
        if img.mode != 'RGBA':
            img = img.convert('RGBA')
        width = img.width
        height = img.height
        img_new = Image.new('RGB',size=(width,height),color=(255,255,255))
        img_new.paste(img,(0,0),mask=img)
        # img_new.show()
        img_new.save(fr'img/converted_img/{self.name}','png')

if __name__ == '__main__':
    # z在这里填入原图名称(本程序的图片放在img/目录下)
    img_obj = Img_Convert('butterfly.png ')
    # 新生成的白底图片放在img/converted_img/目录下
    img_obj.img_convert()

4.2 词云背景设置示例

在词云中，使用PIL.Image.Open()函数打开白底图片，并将其转换为数组。

pic = np.array(Image.open("img/converted_img/1.png"))

在WordCloud类的形参中将mask设置为上面的图片即可：

wd = WordCloud(... mask = pic ...).genrate(text)

同时可以设置前景的轮廓线宽度和颜色，完整如：

    def wd_0(self):
        pic = np.array(Image.open("img/converted_img/1.png"))
        wd_0 = WordCloud(font_path='simhei.ttf',
                         background_color='black',
                         colormap='spring',
                         width=800, height=400,
                         collocations=False,
                         scale=3,
                         # min_font_size=1,
                         mask=pic,
                         contour_width=10.0,
                         contour_color='red',
                         ).generate(self.Read_txt())
        return wd_0

输出示例：

在这里插入图片描述

五、完整demo

txt等文件放在github了。

Addr：https://github.com/CQUPTLei/chatGPT_based

# -*- coding = utf-8 -*-
# @TIME :     2023-1-11 下午 3:06
# @Author :   CQUPTLei
# @File :     wordcloud_resolve.py
# @Software : PyCharm
# @Abstract : wordcloud参数解析与示例

import numpy as np
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from PIL import Image
# import cv2

class WD(object):
    # 参数为要生成词云的txt文件名称
    def __init__(self, txt_name):
        self.txt = txt_name

    # 读取要生成词云的txt文件,我已经进行了分词、去除停用词等处理
    def Read_txt(self):
        with open(f'{self.txt}.txt', 'r', encoding='UTF-8') as f:
            dm = f.read()
            return dm

    # 词云制作
    def wd_0(self):
        pic = np.array(Image.open("img/converted_img/flower.png"))
        wd_0 = WordCloud(font_path='simhei.ttf',
                         background_color='black',
                         colormap='spring',
                         width=800, height=400,
                         collocations=False,
                         scale=3,
                         # min_font_size=1,
                         mask=pic,
                         contour_width=10.0,
                         contour_color='white',
                         ).generate(self.Read_txt())
        return wd_0

    # 词云展示
    @staticmethod
    def wd_show(wd):
        plt.imshow(wd, interpolation='bilinear')
        plt.axis("off")
        plt.show()


if __name__ == '__main__':
    dm_obj = WD('anhao_dm')
    # 选择一个例子生成词云
    wd = dm_obj.wd_0()
    wd.to_file(f'img/wd_output_img/out.PNG')
    dm_obj.wd_show(wd)