python之统计文本中出现最多的单词


在很多情况下,会遇到这样的问题:对于一篇给 定文章,希望统计其中多次出现的词语,进而概 要分析文章的内容。这个问题的解决可用于对网 络信息进行自动检索和归档。 n 在信息爆炸时代,这种归档或分类十分有必要。 这就是“词频统计”问题。

说明:本文设txt为字符串

问题:文本词频统计 -统计一篇英文词频

方法:

  1. 第一步:分解并提取英文文章的单词
  2. 第二步:对每个单词进行计数
  3. 第三步:对单词的统计值从高到低进行排序

具体实现步骤

第一步:分解并提取文章中的单词

通过txt.lower()函数将字母变成小写,排除原文大 小写差异对词频统计的干扰。为统一分隔方式,可 以将各种特殊字符和标点符号使用txt.replace()方法 替换成空格,再使用txt.split()方法提取单词。

txt=txt.lower()
for s in ',.\n ':
    txt=txt.replace(s,' ')
list=txt.split()

第二步:对每个单词进行计数

count={}
for word in list:
	if word in counts: 
		counts[word] = counts[word] + 1 
	else: 
		counts[word] = 1

或者,这个处理逻辑可以更简洁的表示为如下代码:


for word in list:
    count[word]=count.get(word,0)+1

第三步:对单词的统计值从高到低进行排序
由于字典类型没有顺序,需要将其转换为有顺序的 列表类型,再使用sort()方法和lambda函数配合实 现根据单词次数对元素进行排序。

sort=sorted(count.items(), key=lambda item:item[1],reverse=True)
print(sort)

应用例子:统计一篇六级作文中的词频

txt='''To be successful in a job interview or in almost any interview situation, the applicants houlddemonstrate certain personal and professional qualities.

  Most likely, the first and often a lasting impression of a person is determined by the clotheshe wears. The job applicant should take care to appear well-groomed and modestly dressed, avoiding the extremes of too pompous or too casual attire .

  Besides care for personal appearance, he should pay close attention to his manner of speaking, which should be neither ostentatious nor familiar but rather straight forward, grammaticallyaccurate, and in a friendly way.

  In addition, he should be prepared to talk knowledgeably about the requirements of theposition, for which he is applying in relation to his own professional experience and interests.

  And finally, the really impressive applicant must convey a sense of self-confidence andenthusiasm for work, as these are factors all interviewers value highly.

  If the job seeker displays the above-mentioned characteristics, he, with a little luck, willcertainly succeed in the typical personnel interview.'''
for s in ',.\n ':
    txt=txt.replace(s,' ')
txt=txt.lower()
list=txt.split()
print(list)

count=dict()
for i in list:
    count[i]=count.get(i,0)+1
print(count)
sort=sorted(count.items(), key=lambda item:item[1],reverse=True)
print(sort)

结果如下:

[('the', 10), ('in', 6), ('a', 6), ('and', 6), ('to', 5), ('of', 5), ('should', 4), ('he', 4), ('be', 3), ('job', 3), ('interview', 3), ('for', 3), ('or', 2), ('personal', 2), ('professional', 2), ('is', 2), ('applicant', 2), ('care', 2), ('too', 2), ('his', 2), ('which', 2), ('successful', 1), ('almost', 1), ('any', 1), ('situation', 1), ('applicants', 1), ('houlddemonstrate', 1), ('certain', 1), ('qualities', 1), ('most', 1), ('likely', 1), ('first', 1), ('often', 1), ('lasting', 1), ('impression', 1), ('person', 1), ('determined', 1), ('by', 1), ('clotheshe', 1), ('wears', 1), ('take', 1), ('appear', 1), ('well-groomed', 1), ('modestly', 1), ('dressed', 1), ('avoiding', 1), ('extremes', 1), ('pompous', 1), ('casual', 1), ('attire', 1), ('besides', 1), ('appearance', 1), ('pay', 1), ('close', 1), ('attention', 1), ('manner', 1), ('speaking', 1), ('neither', 1), ('ostentatious', 1), ('nor', 1), ('familiar', 1), ('but', 1), ('rather', 1), ('straight', 1), ('forward', 1), ('grammaticallyaccurate', 1), ('friendly', 1), ('way', 1), ('addition', 1), ('prepared', 1), ('talk', 1), ('knowledgeably', 1), ('about', 1), ('requirements', 1), ('theposition', 1), ('applying', 1), ('relation', 1), ('own', 1), ('experience', 1), ('interests', 1), ('finally', 1), ('really', 1), ('impressive', 1), ('must', 1), ('convey', 1), ('sense', 1), ('self-confidence', 1), ('andenthusiasm', 1), ('work', 1), ('as', 1), ('these', 1), ('are', 1), ('factors', 1), ('all', 1), ('interviewers', 1), ('value', 1), ('highly', 1), ('if', 1), ('seeker', 1), ('displays', 1), ('above-mentioned', 1), ('characteristics', 1), ('with', 1), ('little', 1), ('luck', 1), ('willcertainly', 1), ('succeed', 1), ('typical', 1), ('personnel', 1)]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值