Python语言程序设计----【第6周组合数据类型】之6.6 实例10: 文本词频统计

本文链接：https://blog.csdn.net/qq_36045093/article/details/104505282

一、问题分析

文本词频统计

- 需求：一篇文章，出现了哪些词？哪些词出现得最多？
- 该怎么做呢？

需要考虑的问题：英文文本与中文文本的处理有何不同

文本选择及下载链接

- 英文文本： Hamet 分析词频
https://python123.io/resources/pye/hamlet.txt
- 中文文本：《三国演义》分析人物
https://python123.io/resources/pye/threekingdoms.txt

二、"Hamlet英文词频统计"实例讲解

程序应实现：- 文本去噪及归一化 - 使用字典表达词频

#CalHamletV1.py

def getText():#读取文本，并进行归一化处理
    txt = open(r"C:\Users\PC\AppData\Local\Programs\Python\Python37\hamlet.txt", "r").read() #读取文本
    txt = txt.lower() #所有文本变为小写格式

    #用空格替换所有特殊符号
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
        txt = txt.replace(ch, " ")
    return txt

hamletTxt = getText()

words = hamletTxt.split()
#split()是字符串处理函数，通过空格将文本分割，并以列表形式返回给变量

counts = {}
#定义字典类型，存储单词和词频

for word in words:
    counts[word] = counts.get(word,0) + 1
#get()函数，检查当前检索单词是否已在字典中，不在则返回0，在则返回1
    
items = list(counts.items())#将字典类型转换为列表类型

items.sort(key=lambda x:x[1], reverse=True)

#sort()将一个列表按照键值对的2个元素的第2个元素排序，reverse=True表示从大到小排序
for i in range(10):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

运行结果：

三、"《三国演义》人物出场统计"实例讲解

#CalThreeKingdomsV1.py
import jieba
txt = open(r"C:\Users\PC\AppData\Local\Programs\Python\Python37\threekingdoms.txt", "r", encoding="utf-8").read()
words = jieba.lcut(txt)#jieba分词处理，得到列表类型的变量
counts = {}#构造字典变量
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(15):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

运行结果：

结果分析：观察上图，发现这样简单的分词处理还是不够的，存在误识别现象；因此不能在词频的基础之上，还要结合实际问题，进行人物统计。

故而新的升级版程序如下：

#CalThreeKingdomsV2.py
import jieba
txt = open(r"C:\Users\PC\AppData\Local\Programs\Python\Python37\threekingdoms.txt", "r", encoding="utf-8").read()
excludes = {"将军","却说","荆州","二人","不可","不能","如此"}#删除误识别的假人名
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "诸葛亮" or word == "孔明曰":#关联人物进行合并
        rword = "孔明"
    elif word == "关公" or word == "云长":
        rword = "关羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "刘备"
    elif word == "孟德" or word == "丞相":
        rword = "曹操"
    else:
        rword = word
    counts[rword] = counts.get(rword,0) + 1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

运行结果：