“《三国演义》人物出场统计“实例讲解

刚学完英文词频统计,现在我们来看一下中文人物出场统计

下面我们以《三国演义》为例,进行统计分析

一、解题思路

1.jieba库的使用

jieba库是优秀的中文第三方库,利用jieba库我们可以对中文文本分词获得单个的词语

2.词语筛选

本次统计的目的是获取《三国演义》中的人物出场次数,这就要求我们对词语进行筛选,

  • 筛除一个字的词语(不可能是人名)
  • 通过对输出的结果进行分析,将不符合的词语进行筛除,不断重复该步骤,直至输出的结果符合我们的期望
  • 有的人物可能有多钟称谓,需要我们进行合并

3.出场次数排序

通过字典的值,对数据进行排序,输出出场次数排名前20的人物


二、代码实现

1.CalThreeKingdomsV1

代码

#CalThreeKingdomsV1.py
import jieba
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(15):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

注意事项:

  • 读取中文文本要修改编码方式为"utf-8",不然没有办法读取
  • 利用jieba.lcut()方法,把文本精确的切分开,不存在冗余单词
  • 利用字典对出场次数进行统计,利用sorted()方法进行排序

输出结果

 我们可以看出输出结果并不是我们所期望的:

  • “将军,却说,二人,不可,不能,如此,荆州”都不是人名
  • “曹操”和“丞相”,“孔明”和“孔明曰”都是一个人

2.CalThreeKingdomsV2

将不符合的词语从字典中筛除,有多个称谓的进行合并处理

代码

#CalThreeKingdomsV2.py
import jieba
excludes = {"将军","却说","荆州","二人","不可","不能","如此"}
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "诸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "关公" or word == "云长":
        rword = "关羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "刘备"
    elif word == "孟德" or word == "丞相":
        rword = "曹操"
    else:
        rword = word
    counts[rword] = counts.get(rword,0) + 1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

输出结果


3.CalThreeKingdomsV3

经过对结果反复的筛选,终于得到了出场次数前20的人名:

代码

# CalThreeKingdomsV3.py
import jieba
excludes = {"将军", "却说", "荆州", "二人", "不可", "不能", "如此", "商议", "如何",
            "主公", "军士", "左右", "军马", "引兵", "次日", "大喜", "天下", "东吴",
            "于是", "今日", "不敢", "魏兵", "陛下", "一人", "都督", "人马", "不知",
            "汉中", "只见", "众将", "蜀兵", "上马", "大叫", "太守", "此人", "夫人",
            "后人", "背后", "城中", "一面", "何不", "大军", "忽报", "先生", "百姓",
            "何故", "然后", "先锋", "不如", "赶来", "原来", "令人", "江东", "下马",
            "喊声", "正是", "徐州", "忽然", "因此", "成都", "不见", "未知", "大败",
            "大事", "之后", "一军", "引军", "起兵", "军中", "接应", "进兵", "大惊", 
            "可以"}
txt = open("threekingdoms.txt", "r", encoding='utf-8').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue
    elif word == "诸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "关公" or word == "云长":
        rword = "关羽"
    elif word == "玄德" or word == "玄德曰" or word == "先主":
        rword = "刘备"
    elif word == "孟德" or word == "丞相":
        rword = "曹操"
    elif word == "后主":
        rword = "刘禅"
    elif word == "天子":
        rword = "刘协"
    else:
        rword = word
    counts[rword] = counts.get(rword, 0) + 1
for word in excludes:
    del counts[word]
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(20):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

输出结果:

 备注:筛除的词语中有些是具有歧义的,如“先生”“夫人”

看到最后的结果,出场次数最多的是曹操,你是否感到惊讶~~~


评论 10
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

W_chuanqi

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值