python哈姆雷特词频统计_python day 17 文本词频统计

文本词频统计

一、概述

1.需求:一篇文章,出现了哪些词?哪些词出现得最多?

2.首先,要知道英文文本和中文文本的词频统计是不同的

二、“HAMLET”

1.噪音处理:提取单词,去除不必要的其他东西。

2.提取单词,split按空格切分,形成列表

3.统计单词和对应的词频,使用字典

4.词频按关键字:出现次数 排序,使用列表sort method

5.输出

Hamlet

def gettext():

text = open("hamlet.txt",'r').read()

text = text.lower()

for ch in '"#$%^&*()_+-,./<>=@{}[]\~\'':

text = text.replace(ch,'')

return text

hamlettext = gettext()

words = hamlettext.split()

counts = {}

for word in words:

counts[word]=counts.get(word,0)+1

items = list(counts.items())

items.sort(key = lambda x:x[1],reverse = True)

for i in range (20):

word,count = items[i]

print("{0:<10}{1:>5}".format(word,count))

1812747-20191229014051596-53445679.png

三、《三国演义》人名出场次数统计

1.第一版

#三国演义

#first,get words;second,count the times word appear in text;third,print the top 20

import jieba

txt = open('三国演义.txt','r',encoding='utf-8').read()

words = jieba.lcut(txt)

counts = {}

for word in words:

if len(word)==1:

continue

else:

counts[word] = counts.get(word,0) + 1

items = list(counts.items())

items.sort(key = lambda x:x[1],reverse = True)

for i in range (10):

word,times = items[i]

print('{0:<10}{1:>5}'.format(word,times))

发现问题:

孔明和孔明曰应该算作一个人

荆州等不是人名

改进:

从列表中删除非人名词组

在建立集合统计词语出场次数的时候,把孔明和孔明曰,算作一个次。

1812747-20191229014106816-173363822.png

2.第二版

import jieba

txt =open('D:/pythonfiles/三国演义.txt','r',encoding='utf-8').read()

excludes = {'将军','却说','荆州','二人','不可','不能','如此','如何','军士','商议','左右','军马','次日','引兵','大喜','天下','东吴','于是','今日','不敢'}

words = jieba.lcut(txt)

counts = {}

for word in words:

if len (word) == 1 :

continue

elif word == '诸葛亮' or word == '孔明曰':

reword = '孔明'

elif word == '玄德' or word == '玄德曰':

reword = '刘备'

elif word == '关公' or word == '云长':

reword = '关羽'

elif word == '孟德' or word == '丞相':

reword = '曹操'

else :

reword = word

counts[reword]=counts.get(reword,0)+1

for word in excludes:

del counts[word]

items = list(counts.items())

items.sort(key= lambda x:x[1],reverse = True)

for i in range(20):

word, count = items[i]

print('{:<10}{:>5}'.format(word,count))

依旧还是老问题,按照改进的方法,进一步优化即可。

1812747-20191229014127063-1355148596.png

3. 三国演义TOP20

import jieba

txt = open('三国演义.txt','r',encoding='utf-8').read()

txt = jieba.lcut(txt)

counts = {}

def countword(a):

global counts

counts[a] = counts.get(a,0)+1

for word in txt:

if len(word) == 1:

continue

elif word == '孔明曰' or word == '诸葛亮' :

word = '孔明'

countword(word)

elif word == '玄德' or word =='玄德曰' or word == '主公' or word == '先主':

word = '刘备'

countword(word)

elif word == '丞相' or word == '孟德':

word ='曹操'

countword(word)

elif word == '关公' or word == '云长':

word = '关羽'

countword(word)

elif word == '都督':

word = '周瑜'

countword(word)

elif word == '后主':

word = '刘禅'

countword(word)

else:

countword(word)

excluse = ['二人','不可','荆州','却说','不能','将军','如此','军士','如何','商议','左右','次日','引兵','大喜','天下','东吴','军马','于是',

'今日','不敢','陛下','魏兵','人马','一人','不知','汉中','众将','只见','蜀兵','大叫','上马','此人','太守','天子','背后','后人','城中'

,'一面','何不','忽报','大军','先生','何故','然后','先锋','夫人','不如','赶来','原来','令人','江东','徐州','正是','忽然','下马','喊声'

,'成都','因此','百姓','未知','大败','一军','大事','之后','不见','起兵','接应','军中','进兵','引军','大惊','可以']

for i in excluse:

del counts[i]

items = list(counts.items())

items.sort(key = lambda x:x[1],reverse = True)

for i in range(20):

item ,times = items[i]

print('{0:<10}{1:>5}'.format(item,times))

1812747-20191229022328947-1879736644.png

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值