python词频统计网页_【Python】词频统计

本文介绍了如何使用Python进行英文和中文文本的词频统计,通过去除噪声、分词和字典统计,展示统计过程,并给出了优化版本的代码示例,包括对不相关词的排除。
摘要由CSDN通过智能技术生成

导读热词需求:一篇文章,出现了哪些词?哪些词出现得最多?

英文文本词频统计

统计英文词频分为两步:

文本去噪及归一化

使用字典表达词频

#CalHamletV1.py

def getText():

txt = open("hamlet.txt","r").read()

txt = txt.lower()

for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':

txt = txt.replace(ch," ") #将文本中特殊字符替换为空格

return txt

hamletTxt = getText()

words = hamletTxt.split()

counts = {}

for word in words:

counts[word] = counts.get(word,0) + 1

items = list(counts.items())

items.sort(key=lambda x:x[1],reverse=True)

for i in range(10):

word,count = items[i]

print ("{0:<10}{1:>5}".format(word,count))

运行结果:

the 1138

and 965

to 754

of 669

you 550

i 542

a 542

my 514

hamlet 462

in 436

中文文本词频统计

统计中文词频分为两步:

中文文本分词

使用字典表达词频

#CalThreeKingdomsV1.py

import jieba

txt = open("threekingdoms.txt","r",encoding='utf-8').read()

words = jieba.lcut(txt)

counts = {}

for word in words:

if len(word) == 1:

continue

else:

counts[word] = counts.get(word,reverse=True)

for i in range(15):

word,count))

运行结果:

曹操 953

孔明 836

将军 772

却说 656

玄德 585

关公 510

丞相 491

二人 469

不可 440

荆州 425

玄德曰 390

孔明曰 390

不能 384

如此 378

张飞 358

能很明显的看到有一些不相关或重复的信息

优化版本

统计中文词频分为三步:

中文文本分词

使用字典表达词频

扩展程序解决问题

我们将不相关或重复的信息放在 excludes 集合里面进行排除。

#CalThreeKingdomsV2.py

import jieba

excludes = {"将军","却说","荆州","二人","不可","不能","如此"}

txt = open("threekingdoms.txt",encoding='utf-8').read()

words = jieba.lcut(txt)

counts = {}

for word in words:

if len(word) == 1:

continue

elif word == "诸葛亮" or word == "孔明曰":

rword = "孔明"

elif word == "关公" or word == "云长":

rword = "关羽"

elif word == "玄德" or word == "玄德曰":

rword = "刘备"

elif word == "孟德" or word == "丞相":

rword = "曹操"

else:

rword = word

counts[rword] = counts.get(rword,0) + 1

for word in excludes:

del counts[word]

items = list(counts.items())

items.sort(key=lambda x:x[1],count))

考研英语词频统计

将词频统计应用到考研英语中,我们可以统计出出现次数较多的关键单词。

文本链接: https://pan.baidu.com/s/1Q6uVy-wWBpQ0VHvNI_DQxA 密码: fw3r

# CalHamletV1.py

def getText():

txt = open("86_17_1_2.txt"," ") #将文本中特殊字符替换为空格

return txt

pyTxt = getText() #获得没有任何标点的txt文件

words = pyTxt.split() #获得单词

counts = {} #字典,键值对

excludes = {"the","a","of","to","and","in","b","c","d","is",\

"was","are","have","were","had","that","for","it",\

"on","be","as","with","by","not","their","they",\

"from","more","but","or","you","at","has","we","an",\

"this","can","which","will","your","one","he","his","all","people","should","than","points","there","i","what","about","new","if","”",\

"its","been","part","so","who","would","answer","some","our","may","most","do","when","1","text","section","2","many","time","into",\

"10","no","other","up","following","【答案】","only","out","each","much","them","such","world","these","sheet","life","how","because","3","even",\

"work","directions","use","could","now","first","make","years","way","20","those","over","also","best","two","well","15","us","write","4","5","being","social","read","like","according","just","take","paragraph","any","english","good","after","own","year","must","american","less","her","between","then","children","before","very","human","long","while","often","my","too",\

"40","four","research","author","questions","still","last","business","education","need","information","public","says","passage","reading","through","women","she","health","example","help","get","different","him","mark","might","off","job","30","writing","choose","words","economic","become","science","society","without","made","high","students","few","better","since","6","rather","however","great","where","culture","come",\

"both","three","same","government","old","find","number","means","study","put","8","change","does","today","think","future","school","yet","man","things","far","line","7","13","50","used","states","down","12","14","16","end","11","making","9","another","young","system","important","letter","17","chinese","every","see","s","test","word","century","language","little",\

"give","said","25","state","problems","sentence","food","translation","given","child","18","longer","question","back","don’t","19","against","always","answers","know","having","among","instead","comprehension","large","35","want","likely","keep","family","go","why","41","home","law","place","look","day","men","22","26","45","it’s","others","companies","countries","once","money","24","though",\

"27","29","31","say","national","ii","23","based","found","28","32","past","living","university","scientific","–","36","38","working","around","data","right","21","jobs","33","34","possible","feel","process","effect","growth","probably","seems","fact","below","37","39","history","technology","never","sentences","47","true","scientists","power","thought","during","48","early","parents",\

"something","market","times","46","certain","whether","000","did","enough","problem","least","federal","age","idea","learn","common","political","pay","view","going","attention","happiness","moral","show","live","until","52","49","ago","percent","stress","43","44","42","meaning","51","e","iii","u","60","anything","53","55","cultural","nothing","short","100","water","car","56","58","【解析】","54","59","57","v","。","63","64","65","61","62","66","70","75","f","【考点分析】","67","here","68","71","72","69","73","74","选项a","ourselves","teachers","helps","参考范文","gdp","yourself","gone","150"}

for word in words:

if word not in excludes:

counts[word] = counts.get(word,count))

x = len(counts)

print(x)

r = 0

next = eval(input("1继续"))

while next == 1:

r += 100

for i in range(r,r+100):

word,count = items[i]

print ("\"{}\"".format(word),end = ",")

next = eval(input("1继续"))

相关文章

总结

以上是编程之家为你收集整理的【Python】词频统计全部内容,希望文章能够帮你解决【Python】词频统计所遇到的程序开发问题。

如果觉得编程之家网站内容还不错,欢迎将编程之家网站推荐给程序员好友。

本图文内容来源于网友网络收集整理提供,作为学习参考使用,版权属于原作者。

如您喜欢交流学习经验,点击链接加入交流1群:1065694478(已满)交流2群:163560250

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值