Python词频统计

程序员弘羽

已于 2023-03-02 17:41:07 修改

阅读量3.1k

点赞数 10

分类专栏： Python 文章标签： python 开发语言

于 2021-12-13 15:22:16 首次发布

本文链接：https://blog.csdn.net/weixin_45821611/article/details/121905778

版权

Python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

昨天有个朋友让我帮他做个有关词频统计的实验报告，顺便一起发个博客。

简单说一下实验报告的要求：①统计一篇英文文章每个单词出现的次数

②删除所有的标点符号

③以字典的显示展示出来

④把大写字母全部改为小写

⑤附加要求：用pprint.pprint()输出

废话不多说，上干货！

Step1、导入库

词频统计需要用到 jieba 库，同时，用print.print() 输出还需要用到pprint库

import jieba
import pprint

Step2、读取文章

文章如下

文章1.txt

In recent years, there has been a big growth in the area of college athletics. Sports are so popular on campus that they have become an indispensable part in college education.The popularity of college athletics results from the fact that they enable students to reap three great benefits. The first great benefit is that sports help students study better. Sports offer a healthy form of escape from their monotonous routine studies. Taking part in swimming, jogging and ball games can make their academic life richer and fuller. Besides, sports contribute to students physical well being. According to the recent statistics, almost half of students’ bodies are increasingly weak and vulnerable to disease because of inactivity. School authorities, as well as students themselves, have come to realize that sports are an effective way to solve such kind of problems.And most importantly, sports help to promote the development of a student’s personality. Such traits of character as competitiveness, self-discipline, teamwork, strong will and endurance are encouraged and cultivated in sports. These are the things much needed not only on campus but also in their later career and life.College athletics plays a vital role in a student’s life, and it deserves further attention from colleges and universities.

接下来就是很简单的文件操作，直接上代码

text = open("文章1.txt","r").read()

Step3、文章内容处理

我们需要去掉英文文章常出现的逗号，句号，同时还要把大写字母改为小写，最后再用jieba库把文章中的每个单词分割出来

text_cut_douhao = str(text).replace(',', '')            #去除','
text_cut_juhao = str(text_cut_douhao).replace('.', '')  #去除'.'
text_lower = str.lower(text_cut_juhao)

words = jieba.lcut(text_lower)                      #str类型改成列表,并且每个单词分割出来
print(words)                                        #打印测试

Step4、创建字典并写入值

创建一个名为counts的字典，并初始化为空，然后利用for循环写入对应的值。其中

counts[word] = counts.get(word,0) + 1 表示，如果word在words中，就出现的数字+1，如果不再就返回0

counts = {}                      #创建一个字典
for word in words:               #利用for循环写入字典
    if len(word) == 1:           #去除加空格
        continue
    counts[word] = counts.get(word,0) + 1   #累加出现的次数

Step5、处理字典

用lamda处理排序，然后用for循环遍历输出，输出的时候用pprint，这里说一下pprint和print的区别，pprint输出的是一个完整的数据结构，而print输出的是对应的值

items = list(counts.items())                                #统计并赋值到items中
print(items)
items.sort(key=lambda x:x[1], reverse=True)                 #排序
for i in range(len(items)):                                 #len(item)计算的是英文文章的词汇数
    word, count = items[i]                                  #统计单词和单词出现的顺序
    pprint.pprint("{0:<15}{1:>5}".format(word, count))      #pprint.pprint输出的更加规范

代码总和

# coding=UTF-8
"""
 作者：程序员弘羽
 开发时间：2021/12/13 9:44
"""
import jieba
import pprint

text = open("文章1.txt","r").read()
# words = jieba.lcut(text)
text_cut_douhao = str(text).replace(',', '')            #去除','
text_cut_juhao = str(text_cut_douhao).replace('.', '')  #去除'.'
text_lower = str.lower(text_cut_juhao)

words = jieba.lcut(text_lower)                      #str类型改成列表,并且每个单词分割出来
print(words)                                        #打印测试

counts = {}                      #创建一个字典
for word in words:               #利用for循环写入字典
    if len(word) == 1:           #去除加空格
        continue
    counts[word] = counts.get(word,0) + 1   #累加出现的次数


items = list(counts.items())                                #统计并赋值到items中
print(items)
items.sort(key=lambda x:x[1], reverse=True)                 #排序
for i in range(len(items)):                                 #len(item)计算的是英文文章的词汇数
    word, count = items[i]                                  #统计单词和单词出现的顺序
    pprint.pprint("{0:<15}{1:>5}".format(word, count))      #pprint.pprint输出的更加规范