刚学完英文词频统计,现在我们来看一下中文人物出场统计
下面我们以《三国演义》为例,进行统计分析
一、解题思路
1.jieba库的使用
jieba库是优秀的中文第三方库,利用jieba库我们可以对中文文本分词获得单个的词语
2.词语筛选
本次统计的目的是获取《三国演义》中的人物出场次数,这就要求我们对词语进行筛选,
- 筛除一个字的词语(不可能是人名)
- 通过对输出的结果进行分析,将不符合的词语进行筛除,不断重复该步骤,直至输出的结果符合我们的期望
- 有的人物可能有多钟称谓,需要我们进行合并
3.出场次数排序
通过字典的值,对数据进行排序,输出出场次数排名前20的人物
二、代码实现
1.CalThreeKingdomsV1
代码
#CalThreeKingdomsV1.py
import jieba
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
else:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(15):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))
注意事项:
- 读取中文文本要修改编码方式为"utf-8",不然没有办法读取
- 利用jieba.lcut()方法,把文本精确的切分开,不存在冗余单词
- 利用字典对出场次数进行统计,利用sorted()方法进行排序
输出结果
我们可以看出输出结果并不是我们所期望的:
- “将军,却说,二人,不可,不能,如此,荆州”都不是人名
- “曹操”和“丞相”,“孔明”和“孔明曰”都是一个人
2.CalThreeKingdomsV2
将不符合的词语从字典中筛除,有多个称谓的进行合并处理
代码
#CalThreeKingdomsV2.py
import jieba
excludes = {"将军","却说","荆州","二人","不可","不能","如此"}
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "诸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "关公" or word == "云长":
rword = "关羽"
elif word == "玄德" or word == "玄德曰":
rword = "刘备"
elif word == "孟德" or word == "丞相":
rword = "曹操"
else:
rword = word
counts[rword] = counts.get(rword,0) + 1
for word in excludes:
del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)
for i in range(10):
word, count = items[i]
print ("{0:<10}{1:>5}".format(word, count))
输出结果
3.CalThreeKingdomsV3
经过对结果反复的筛选,终于得到了出场次数前20的人名:
代码
# CalThreeKingdomsV3.py
import jieba
excludes = {"将军", "却说", "荆州", "二人", "不可", "不能", "如此", "商议", "如何",
"主公", "军士", "左右", "军马", "引兵", "次日", "大喜", "天下", "东吴",
"于是", "今日", "不敢", "魏兵", "陛下", "一人", "都督", "人马", "不知",
"汉中", "只见", "众将", "蜀兵", "上马", "大叫", "太守", "此人", "夫人",
"后人", "背后", "城中", "一面", "何不", "大军", "忽报", "先生", "百姓",
"何故", "然后", "先锋", "不如", "赶来", "原来", "令人", "江东", "下马",
"喊声", "正是", "徐州", "忽然", "因此", "成都", "不见", "未知", "大败",
"大事", "之后", "一军", "引军", "起兵", "军中", "接应", "进兵", "大惊",
"可以"}
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "诸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "关公" or word == "云长":
rword = "关羽"
elif word == "玄德" or word == "玄德曰" or word == "先主":
rword = "刘备"
elif word == "孟德" or word == "丞相":
rword = "曹操"
elif word == "后主":
rword = "刘禅"
elif word == "天子":
rword = "刘协"
else:
rword = word
counts[rword] = counts.get(rword, 0) + 1
for word in excludes:
del counts[word]
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(20):
word, count = items[i]
print("{0:<10}{1:>5}".format(word, count))
输出结果:
备注:筛除的词语中有些是具有歧义的,如“先生”“夫人”
看到最后的结果,出场次数最多的是曹操,你是否感到惊讶~~~