英文词频统计

taon1607

于 2020-06-23 13:07:38 发布

阅读量1.7k

点赞数 3

分类专栏：自然语言处理文章标签： python 自然语言处理大数据机器学习数据分析

本文链接：https://blog.csdn.net/taon1607/article/details/106921206

版权

自然语言处理专栏收录该内容

6 篇文章 0 订阅

订阅专栏

该案例以莎士比亚的四大悲剧之一《哈姆雷特》为例，来统计该文章中的词语出现的频率。总体的步骤为读入文本，大小写转换，特殊字符转换，分词，词频统计，排序。通过观察词语频率最高的几个词，我们大致可以了解该文章的主要内容。这一小节，我们没有涉及到英文文章中去停用词的操作。

停用词：出现的频率很高，但对文章表达主旨没有太大影响的词。在英文文章中，如：I, and, but, here, there, some之类的词语等。

文档链接：链接：https://pan.baidu.com/s/17ehiYKripA–noIjfFLBbQ
提取码：yuhq

下面是英文词频统计的代码示例：

#导入文本
f = open('./data/hamlet.txt','r')
txt = f.read()
print(txt)

#这里只打印部分内容
# The Tragedy of Hamlet, Prince of Denmark
# Shakespeare homepage | Hamlet | Entire play
# ACT I

# SCENE I. Elsinore. A platform before the castle.

# FRANCISCO at his post. Enter to him BERNARDO

#将文本内容全部转化为小写格式
txt = txt.lower()

#将特殊字符转化为空格
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':
    txt = txt.replace(ch,' ')
    
#以空格为分隔符，取出所有单词
words = txt.split()
print(words)
#['the','tragedy','of','hamlet','prince','of'......]

#查看词语的数量
len(words)
#32259

#查看无重复单词的数量
len(set(words))  #set()函数的功能就是去除序列中的重复元素
#4793

#统计词语的频率
counts = {}
for word in words:
    counts[word] = counts.get(word,0) + 1
    
#将统计得到的字典counts转换为列表
counts = list(counts.items())
print(counts)
#[('the', 1138),('tragedy', 3),('of', 669),('hamlet', 462),('prince', 10)......]

#对counts列表按照词云频率进行排序
counts.sort(key = lambda x:x[1],reverse = True)

#打印频率最高的前10个词语
for i in range(10):
    print(counts[i][0],counts[i][1])
# the 1138
# and 965
# to 754
# of 669
# you 550
# i 542
# a 542
# my 514
# hamlet 462
# in 436