Python词云

Silence_Jy

已于 2023-04-21 11:25:36 修改

阅读量1.5k

点赞数 1

文章标签： python matplotlib 开发语言

于 2023-04-16 10:38:05 首次发布

本文链接：https://blog.csdn.net/weixin_60896526/article/details/130179098

版权

词云图wordcloud

1.安装第三方库

$ji e ba 库、 ma tpl o tl ib 、 w or d c l o u d 库$
民图灵机

2.过程

1.使用 $ji e ba$ 库对数据进行分词整理，转为 $t x t$ 文件，转变为以空格分隔的词语字符串 $s t r in g$ 。
2.调用 $w or d co l u d$ 等函数绘制。

英文分词：

对于英文，使用空格来分开每个单词。
在这里插入图片描述

中文分词：

中文单独一个汉字跟词有时候就完全不是一个含义，所以中文分词比英文分词要难很多。
$ji e ba$ 分词利用一个中文词库，确定汉字之间的关联概率，概率大的组成词组，形成分词结果。

(1). $ji e ba$ 库分词的三种模式：
精确模式：把文本精确的分开，不存在冗余单词
全模式：把文中所有的词语都扫描出来，存在冗余单词
搜索引擎模式：在精确模式的基础上，对长词再次划分
(2).常用库函数：
在这里插入图片描述

有的时候，如果按照jieba正常分词，会把我们不希望分开的词语给分开，这个时候就会改变句子的意思。就如以下例子，我们希望不喜欢是一个词，不被分开：

import jieba
messages = jieba.cut("我不喜欢下雨天刮风")   #默认精确模式
print ( "/ ".join(messages)) 
运行结果：
我/ 不/ 喜欢/ 下雨天/ 刮风

这个时候，我们可以使用 suggest_freq(segment, tune=True) 可调节单个词语的词频，使其能（或不能）被分出来：

import jieba
messages = jieba.cut("我不喜欢下雨天刮风")   #默认精确模式
jieba.suggest_freq(('不喜欢'),tune=True)
print ( "/ ".join(messages)) 
运行结果：
我/ 不喜欢/ 下雨天/ 刮风

3.wordcloud的常用方法函数参数

参数：

1. $font\_path : string$ : 字体路径，格式：字体路径+后缀名，
如 $\ w i n d o w s \ F o n t \ w h i t e . t t f C:\backslash windows\backslash Font \backslash white.ttf$
2. $w i d t h : in t (d e f a u lt = 400)$ : 输出的画布宽度
3. $h e i g h t : in t (d e f a u lt = 200)$ ：输出的画布高度
4. $prefer\_horizontal : float(default=0.90)$ : 词语水平方向排版出现的频率，垂直方向做差。
5. $sc a l e : f l o a t (d e f a u lt = 1)$ : 按照比例放大画布，如设置 $sc a l e = 2$ ，则长宽都是原来的 $2$ 倍。
6. $min\_font\_size : int(default=4)$ : 显示的最小字体的大小。
7. $max\_words : int(default=200)$ : 显示的词的最大个数。
8. $background\_color : (default='black')$ ：背景颜色。
9. $max\_font\_size : int(default=None)$ : 显示的最大字体的大小。
10. $ma s k : n p . a rr a y 、 N o n e$ ：参数为空，默认词云形状为长方形。

函数：

1. $generate\_from\_text(text)$ ：根据文本生成词云。
2. $g e n er a t e (t e x t)$ : 根据文本生成词云。
3. $generate\_from\_frequencies(frequencies[, ...])$ : 根据词频生成词云。
4. $to\_file(filename)$ : 输出到文件。

def generate(self, text):
    """Generate wordcloud from text.
    The input "text" is expected to be a natural text. If you pass a sorted
    list of words, words will appear in your output twice. To remove this
    duplication, set ``collocations=False``.
    Alias to generate_from_text.
    Calls process_text and generate_from_frequencies.
    Returns
    -------
    self
    """
    return self.generate_from_text(text)

def generate_from_text(self, text):
    """Generate wordcloud from text.
    The input "text" is expected to be a natural text. If you pass a sorted
    list of words, words will appear in your output twice. To remove this
    duplication, set ``collocations=False``.
    Calls process_text and generate_from_frequencies.
    ..versionchanged:: 1.2.2
        Argument of generate_from_frequencies() is not return of
        process_text() any more.

    Returns
    -------
    self
    """
    words = self.process_text(text)
    self.generate_from_frequencies(words)
    return self