文本词频统计
英文文本以空格或标点符号分隔词语,获得单词并统计数量相当容易
中文字符之间没有天然的分隔符,需要对中文文本进行分词,需要使用jieba函数
Hamlet英文词频统计——>英文词频统计的实例
创建hamlet.txt文档,再书写程序
def getText():
txt = open("D:\\Hamlet.txt",'r').read() #在此处表明盘符
txt = txt.lower() #将所有文本中的英文全部换为小写字母
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~':
txt = txt.replace(ch, ' ') #将文本中的特殊字符替换为空格
return txt
hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key = lambda x:x[1], reverse = True)
for i in range(10):
word, count = items[i]
print('{0:<10}{1:>5}'.format(word, count))
====================================================================================
结果:
the 1138
and 965
to 754
of 669
you 550
i 542
a 542
my 514
hamlet 462
in 436
三国演义人物出场统计——>中文词频统计
import jieba
txt = open("D:\\三国演义.txt", "r", encoding='utf-8').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
else:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key = lambda x:x[1], reverse=True)
for i in range(15):
word, count = items[i]
print("{0:<10}{1:>5}".format(word, count))
=======================================================================================
玄德 18
张角 10
黄巾 8
张梁 8
卢植 8
玄德曰 8
张飞 8
天下 7
三人 7
一名 6
刘焉 6
叔父 6
英雄 5
张宝 5
姓名 5