利用python来建立hamlet词云

最新推荐文章于 2023-11-26 02:30:00 发布

遇见夏日的风即是缘

最新推荐文章于 2023-11-26 02:30:00 发布

阅读量895

点赞数 2

文章标签： python matplotlib 开发语言

本文链接：https://blog.csdn.net/m0_66763187/article/details/127216384

版权

一.简单介绍

本次词云使用了python的三个模块，分别是re,wordcloud,matplotlib,这四个模块可以自行下载配置，当然了中间可能也会出现错误，请各位批评指正。

二.实现原理

1、准备好hamlet文件，后面使用re模块打开，re用来分词(针对于hamlet文件源)

2、把分词好的数据传给 wordcloud中的WordCloud方法，生成词云图

3、使用 matplotlib 进行图片的显示，或者保存成图片

三.对代码进行解读说明

import re
import wordcloud
import matplotlib.pyplot as plt

1. 这三个import所导入的模块就是我们构建词云所需要的模块

txt = open("D:\浏览器下载\hamlet.txt", "r", encoding='utf-8').read()
txt = txt.lower()
expect = '[,.;!()?*，。；：:、！《》（）]'
txt_date = re.sub(expect, '', txt)
words = txt_date.split()

2. 这里进行一些基础操作，首先是利用open()和read()方法打开文件并进行操作，随后使用lower()方法将所有大写字母转化为小写字母，最后使用re模块中的sub()和split()方法进行替换和分词处理

counts = {}
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word, 0) + 1

3. 将分词完后的文章利用for循环进行遍历，将遍历的词语存入字典counts中，在使用if else语句判断字典中是否出现相同的词语，如果出现则加1，get()方法的作用是：在返回指定键的值，如果键不在字典中返回 default 设置的默认值

for ch in {'not','with','for','be','as','what','him','the', 'and','to','that','this','of', 'you', 'an','we','it', 'my','me', 'in','your','he','is','his','but'}:
    del (counts[ch])
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)

4.因为需要删除一些介词，所以在这里也是使用for循环和字典中的del()方法来进行删除操作，再继续使用items()方法以列表的形式返回一个视图对象，sort()方法就是从高往低来进行排列,而lambda()就属于是一个匿名函数，让其按照字典的key值所对应的value值来进行排序

list1 = []
for i in range(10):
    word, count = items[i]
    print('{0}:{1}'.format(word, count))
    list1.append(word)
T = ','.join(list1)
print(T)    
wc = wordcloud.WordCloud(width = 3000, height = 2000,random_state = False ,background_color = 'white')
wc.generate(T)
plt.imshow(wc)
plt.axis('off')
plt.show()

5.使用for循环按照词语的出现频率来进行输出,使用join()方法生成字符串，在使用wordcloud中generate()方法生成词云，在使用matplotlib模块中的show()方法显示词云。

最后放上全部代码

import re
import wordcloud
import matplotlib.pyplot as plt
# 字频统计
# 1.打开文件
txt = open("D:\浏览器下载\hamlet.txt", "r", encoding='utf-8').read()
# 2.将所有大写字母转化为小写字母
txt = txt.lower()
# 3.去除所有的特殊字符
expect = '[,.;!()?*，。；：:、！《》（）]'
txt_date = re.sub(expect, '', txt)
# 4.对文本分词
words = txt_date.split()
# 5.出现频率
counts = {}
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word, 0) + 1
# 6.剔除冠词(the,and,of,a,to,be)
list1 = []
for ch in {'not','with','for','be','as','what','him','the', 'and','to','that','this','of', 'you', 'an','we','it', 'my','me', 'in','your','he','is','his','but'}:
    del (counts[ch])
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(10):
    word, count = items[i]
    print('{0}:{1}'.format(word, count))
    list1.append(word)
T = ','.join(list1)
print(T)    
wc = wordcloud.WordCloud(width = 3000, height = 2000,random_state = False ,background_color = 'white')
wc.generate(T)
plt.imshow(wc)
plt.axis('off')
plt.show()