python之统计文本中出现最多的单词

最新推荐文章于 2023-04-12 17:25:16 发布

阿土的炼丹炉

最新推荐文章于 2023-04-12 17:25:16 发布

阅读量1.5w

点赞数 25

分类专栏： Python

本文链接：https://blog.csdn.net/qq_43527713/article/details/114482509

版权

词频统计英文文章六级作文信息检索文本分析

关键词由CSDN通过智能技术生成

Python 专栏收录该内容

6 篇文章

订阅专栏

文章目录

在很多情况下，会遇到这样的问题：对于一篇给定文章，希望统计其中多次出现的词语，进而概要分析文章的内容。这个问题的解决可用于对网络信息进行自动检索和归档。 n 在信息爆炸时代，这种归档或分类十分有必要。这就是“词频统计”问题。

说明：本文设txt为字符串

问题：文本词频统计 -统计一篇英文词频

方法：

第一步：分解并提取英文文章的单词
第二步：对每个单词进行计数
第三步：对单词的统计值从高到低进行排序

具体实现步骤

第一步：分解并提取文章中的单词

通过txt.lower()函数将字母变成小写，排除原文大小写差异对词频统计的干扰。为统一分隔方式，可以将各种特殊字符和标点符号使用txt.replace()方法替换成空格，再使用txt.split()方法提取单词。

txt=txt.lower()
for s in ',.\n ':
    txt=txt.replace(s,' ')
list=txt.split()

第二步：对每个单词进行计数

count={}
for word in list:
	if word in counts: 
		counts[word] = counts[word] + 1 
	else: 
		counts[word] = 1

或者，这个处理逻辑可以更简洁的表示为如下代码：


for word in list:
    count[word]=count.get(word,0)+1

第三步：对单词的统计值从高到低进行排序
由于字典类型没有顺序，需要将其转换为有顺序的列表类型，再使用sort()方法和lambda函数配合实现根据单词次数对元素进行排序。

sort=sorted(count.items(), key=lambda item:item[1],reverse=True)
print(sort)

应用例子:统计一篇六级作文中的词频

txt='''To be successful in a job interview or in almost any interview situation, the applicants houlddemonstrate certain personal and professional qualities.

　　Most likely, the first and often a lasting impression of a person is determined by the clotheshe wears. The job applicant should take care to appear well-groomed and modestly dressed, avoiding the extremes of too pompous or too casual attire .

　　Besides care for personal appearance, he should pay close attention to his manner of speaking, which should be neither ostentatious nor familiar but rather straight forward, grammaticallyaccurate, and in a friendly way.

　　In addition, he should be prepared to talk knowledgeably about the requirements of theposition, for which he is applying in relation to his own professional experience and interests.

　　And finally, the really impressive applicant must convey a sense of self-confidence andenthusiasm for work, as these are factors all interviewers value highly.

　　If the job seeker displays the above-mentioned characteristics, he, with a little luck, willcertainly succeed in the typical personnel interview.'''
for s in ',.\n ':
    txt=txt.replace(s,' ')
txt=txt.lower()
list=txt.split()
print(list)

count=dict()
for i in list:
    count[i]=count.get(i,0)+1
print(count)
sort=sorted(count.items(), key=lambda item:item[1],reverse=True)
print(sort)

结果如下：

[('the', 10), ('in', 6), ('a', 6), ('and', 6), ('to', 5), ('of', 5), ('should', 4), ('he', 4), ('be', 3), ('job', 3), ('interview', 3), ('for', 3), ('or', 2), ('personal', 2), ('professional', 2), ('is', 2), ('applicant', 2), ('care', 2), ('too', 2), ('his', 2), ('which', 2), ('successful', 1), ('almost', 1), ('any', 1), ('situation', 1), ('applicants', 1), ('houlddemonstrate', 1), ('certain', 1), ('qualities', 1), ('most', 1), ('likely', 1), ('first', 1), ('often', 1), ('lasting', 1), ('impression', 1), ('person', 1), ('determined', 1), ('by', 1), ('clotheshe', 1), ('wears', 1), ('take', 1), ('appear', 1), ('well-groomed', 1), ('modestly', 1), ('dressed', 1), ('avoiding', 1), ('extremes', 1), ('pompous', 1), ('casual', 1), ('attire', 1), ('besides', 1), ('appearance', 1), ('pay', 1), ('close', 1), ('attention', 1), ('manner', 1), ('speaking', 1), ('neither', 1), ('ostentatious', 1), ('nor', 1), ('familiar', 1), ('but', 1), ('rather', 1), ('straight', 1), ('forward', 1), ('grammaticallyaccurate', 1), ('friendly', 1), ('way', 1), ('addition', 1), ('prepared', 1), ('talk', 1), ('knowledgeably', 1), ('about', 1), ('requirements', 1), ('theposition', 1), ('applying', 1), ('relation', 1), ('own', 1), ('experience', 1), ('interests', 1), ('finally', 1), ('really', 1), ('impressive', 1), ('must', 1), ('convey', 1), ('sense', 1), ('self-confidence', 1), ('andenthusiasm', 1), ('work', 1), ('as', 1), ('these', 1), ('are', 1), ('factors', 1), ('all', 1), ('interviewers', 1), ('value', 1), ('highly', 1), ('if', 1), ('seeker', 1), ('displays', 1), ('above-mentioned', 1), ('characteristics', 1), ('with', 1), ('little', 1), ('luck', 1), ('willcertainly', 1), ('succeed', 1), ('typical', 1), ('personnel', 1)]