Python词频统计

昨天有个朋友让我帮他做个有关词频统计的实验报告,顺便一起发个博客。

简单说一下实验报告的要求:①统计一篇英文文章每个单词出现的次数

                                               ②删除所有的标点符号

                                               ③以字典的显示展示出来

                                               ④把大写字母全部改为小写

                                               ⑤附加要求:用pprint.pprint()输出

废话不多说,上干货!

目录

Step1、导入库

Step2、读取文章

Step3、文章内容处理

Step4、创建字典并写入值

Step5、处理字典

代码总和


Step1、导入库

词频统计需要用到 jieba 库,同时,用print.print() 输出还需要用到pprint库

import jieba
import pprint

Step2、读取文章

文章如下

文章1.txt

In recent years, there has been a big growth in the area of college athletics. Sports are so popular on campus that they have become an indispensable part in college education.The popularity of college athletics results from the fact that they enable students to reap three great benefits. The first great benefit is that sports help students study better. Sports offer a healthy form of escape from their monotonous routine studies. Taking part in swimming, jogging and ball games can make their academic life richer and fuller. Besides, sports contribute to students physical well being. According to the recent statistics, almost half of students’ bodies are increasingly weak and vulnerable to disease because of inactivity. School authorities, as well as students themselves, have come to realize that sports are an effective way to solve such kind of problems.And most importantly, sports help to promote the development of a student’s personality. Such traits of character as competitiveness, self-discipline, teamwork, strong will and endurance are encouraged and cultivated in sports. These are the things much needed not only on campus but also in their later career and life.College athletics plays a vital role in a student’s life, and it deserves further attention from colleges and universities.

接下来就是很简单的文件操作,直接上代码

text = open("文章1.txt","r").read()

Step3、文章内容处理

我们需要去掉英文文章常出现的逗号,句号,同时还要把大写字母改为小写,最后再用jieba库把文章中的每个单词分割出来

text_cut_douhao = str(text).replace(',', '')            #去除','
text_cut_juhao = str(text_cut_douhao).replace('.', '')  #去除'.'
text_lower = str.lower(text_cut_juhao)

words = jieba.lcut(text_lower)                      #str类型改成列表,并且每个单词分割出来
print(words)                                        #打印测试

Step4、创建字典并写入值

创建一个名为counts的字典,并初始化为空,然后利用for循环写入对应的值。其中

counts[word] = counts.get(word,0) + 1 表示,如果word在words中,就出现的数字+1,如果不再就返回0
counts = {}                      #创建一个字典
for word in words:               #利用for循环写入字典
    if len(word) == 1:           #去除加空格
        continue
    counts[word] = counts.get(word,0) + 1   #累加出现的次数

Step5、处理字典

用lamda处理排序,然后用for循环遍历输出,输出的时候用pprint,这里说一下pprint和print的区别,pprint输出的是一个完整的数据结构,而print输出的是对应的值

items = list(counts.items())                                #统计并赋值到items中
print(items)
items.sort(key=lambda x:x[1], reverse=True)                 #排序
for i in range(len(items)):                                 #len(item)计算的是英文文章的词汇数
    word, count = items[i]                                  #统计单词和单词出现的顺序
    pprint.pprint("{0:<15}{1:>5}".format(word, count))      #pprint.pprint输出的更加规范

代码总和

# coding=UTF-8
"""
 作者:程序员弘羽
 开发时间:2021/12/13 9:44
"""
import jieba
import pprint

text = open("文章1.txt","r").read()
# words = jieba.lcut(text)
text_cut_douhao = str(text).replace(',', '')            #去除','
text_cut_juhao = str(text_cut_douhao).replace('.', '')  #去除'.'
text_lower = str.lower(text_cut_juhao)

words = jieba.lcut(text_lower)                      #str类型改成列表,并且每个单词分割出来
print(words)                                        #打印测试

counts = {}                      #创建一个字典
for word in words:               #利用for循环写入字典
    if len(word) == 1:           #去除加空格
        continue
    counts[word] = counts.get(word,0) + 1   #累加出现的次数


items = list(counts.items())                                #统计并赋值到items中
print(items)
items.sort(key=lambda x:x[1], reverse=True)                 #排序
for i in range(len(items)):                                 #len(item)计算的是英文文章的词汇数
    word, count = items[i]                                  #统计单词和单词出现的顺序
    pprint.pprint("{0:<15}{1:>5}".format(word, count))      #pprint.pprint输出的更加规范

  • 5
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

程序员弘羽

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值