【词频统计】--用python的jieba进行英文文本词频统计

最新推荐文章于 2024-07-10 09:15:55 发布

Fx_x

最新推荐文章于 2024-07-10 09:15:55 发布

阅读量1k

点赞数 1

分类专栏： Python 文章标签： python

本文链接：https://blog.csdn.net/Fx_2003/article/details/127594849

版权

Python 专栏收录该内容

9 篇文章 3 订阅

订阅专栏

1、基本思路：统计哈利波特小说中词频最高的前20个，去掉一些停用词（如is）

2、停用词（截取部分）

3、代码如下

4、小知识：元组可以用来这样赋值

1、基本思路：统计哈利波特小说中词频最高的前20个，去掉一些停用词（如is）

2、停用词（截取部分）

3、代码如下

# -*- coding: utf-8 -*-
"""
@File  : 04.py
@author: FxDr
@Time  : 2022/10/19 14:33
"""
import jieba

'''
    英文文本词频统计.
'''
# 打开文件，读取内容
txt = open("Harry Potter and The Half Blood Prince.txt", "r").read()

# 转小写
txt = txt.lower()
# 去掉一些特殊符号
for each in '"’—!|“#$%&()*+,-./:;<=>?@[\\]^{|}~”':
    txt = txt.replace(each, " ")  # 用空格代替特殊符号

# 文本分词
words = jieba.cut(txt)  # 默认用空格分离并以列表形式返回
stopword = []
with open('stopwords_EN.txt', 'r') as f:
    stopwords = f.read() # 一些无意义不需要统计的词

counts = {}

for word in words:
    if word not in stopwords:
        counts[word] = counts.get(word, 0) + 1

items = list(counts.items())  # 将字典转为列表
items.sort(key=lambda x: x[1], reverse=True)  # 按第二列排序，从高到低
print(type(items))  # 列表类型
print(type(items[0]))  # 元组类型
for i in range(20):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

其中

输出如下：harry 2815
dumbledore 1034
hermione 694
slughorn 397
snape 379
malfoy 371
professor 280
voldemort 243
ginny 233
hagrid 231
weasley 211
eyes 206
dark 200
voice 195
wand 191
door 179
moment 167
people 165
head 162
told 160

4、小知识：元组可以用来这样赋值

# -*- coding: utf-8 -*-
"""
@File  : 00.py
@author: FxDr
@Time  : 2022/10/30 2:07
"""
t = (1, 2)
print(type(t))
a, b = t
print(a, b)

Fx_x

关注

1
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
【词频统计】--用python的jieba进行英文文本词频统计

英文词频统计，哈利波特小说
复制链接

扫一扫

专栏目录

【词频统计】--用python的jieba进行英文文本词频统计

1、基本思路：统计哈利波特小说中词频最高的前20个，去掉一些停用词（如is）

2、停用词（截取部分）

3、代码如下

4、小知识：元组可以用来这样赋值

“相关推荐”对你有帮助么？