python词频统计网页_【Python】词频统计

最新推荐文章于 2024-04-23 21:54:08 发布

QHJJ

最新推荐文章于 2024-04-23 21:54:08 发布

阅读量472

点赞数

文章标签： python词频统计网页

本文链接：https://blog.csdn.net/weixin_35628611/article/details/114463144

版权

本文介绍了如何使用Python进行英文和中文文本的词频统计，通过去除噪声、分词和字典统计，展示统计过程，并给出了优化版本的代码示例，包括对不相关词的排除。

摘要由CSDN通过智能技术生成

导读热词需求：一篇文章，出现了哪些词？哪些词出现得最多？

英文文本词频统计

统计英文词频分为两步：

文本去噪及归一化

使用字典表达词频

#CalHamletV1.py

def getText():

txt = open("hamlet.txt","r").read()

txt = txt.lower()

for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~':

txt = txt.replace(ch," ") #将文本中特殊字符替换为空格

return txt

hamletTxt = getText()

words = hamletTxt.split()

counts = {}

for word in words:

counts[word] = counts.get(word,0) + 1

items = list(counts.items())

items.sort(key=lambda x:x[1],reverse=True)

for i in range(10):

word,count = items[i]

print ("{0:<10}{1:>5}".format(word,count))

运行结果：

the 1138

and 965

to 754

of 669

you 550

i 542

a 542

my 514

hamlet 462

in 436

中文文本词频统计

统计中文词频分为两步：

中文文本分词

使用字典表达词频

#CalThreeKingdomsV1.py

import jieba

txt = open("threekingdoms.txt","r",encoding='utf-8').read()

words = jieba.lcut(txt)

counts = {}

for word in words:

if len(word) == 1:

continue

else:

counts[word] = counts.get(word,reverse=True)

for i in range(15):

word,count))

运行结果：

曹操 953

孔明 836

将军 772

却说 656

玄德 585

关公 510

丞相 491

二人 469

不可 440

荆州 425

玄德曰 390

孔明曰 390

不能 384

如此 378

张飞 358

能很明显的看到有一些不相关或重复的信息

优化版本

统计中文词频分为三步：

中文文本分词

使用字典表达词频

扩展程序解决问题

我们将不相关或重复的信息放在 excludes 集合里面进行排除。

#CalThreeKingdomsV2.py

import jieba

excludes = {"将军","却说","荆州","二人","不可","不能","如此"}

txt = open("threekingdoms.txt",encoding='utf-8').read()

words = jieba.lcut(txt)

counts = {}

for word in words:

if len(word) == 1:

continue

elif word == "诸葛亮" or word == "孔明曰":

rword = "孔明"

elif word == "关公" or word == "云长":

rword = "关羽"

elif word == "玄德" or word == "玄德曰":

rword = "刘备"

elif word == "孟德" or word == "丞相":

rword = "曹操"

else:

rword = word

counts[rword] = counts.get(rword,0) + 1

for word in excludes:

del counts[word]

items = list(counts.items())

items.sort(key=lambda x:x[1],count))

考研英语词频统计

将词频统计应用到考研英语中，我们可以统计出出现次数较多的关键单词。

文本链接: https://pan.baidu.com/s/1Q6uVy-wWBpQ0VHvNI_DQxA 密码: fw3r

# CalHamletV1.py

def getText():

txt = open("86_17_1_2.txt"," ") #将文本中特殊字符替换为空格

return txt

pyTxt = getText() #获得没有任何标点的txt文件

words = pyTxt.split() #获得单词

counts = {} #字典，键值对

excludes = {"the","a","of","to","and","in","b","c","d","is",\

"was","are","have","were","had","that","for","it",\

"on","be","as","with","by","not","their","they",\

"from","more","but","or","you","at","has","we","an",\

"this","can","which","will","your","one","he","his","all","people","should","than","points","there","i","what","about","new","if","”",\

"its","been","part","so","who","would","answer","some","our","may","most","do","when","1","text","section","2","many","time","into",\

"10","no","other","up","following","【答案】","only","out","each","much","them","such","world","these","sheet","life","how","because","3","even",\

"work","directions","use","could","now","first","make","years","way","20","those","over","also","best","two","well","15","us","write","4","5","being","social","read","like","according","just","take","paragraph","any","english","good","after","own","year","must","american","less","her","between","then","children","before","very","human","long","while","often","my","too",\

"40","four","research","author","questions","still","last","business","education","need","information","public","says","passage","reading","through","women","she","health","example","help","get","different","him","mark","might","off","job","30","writing","choose","words","economic","become","science","society","without","made","high","students","few","better","since","6","rather","however","great","where","culture","come",\

"both","three","same","government","old","find","number","means","study","put","8","change","does","today","think","future","school","yet","man","things","far","line","7","13","50","used","states","down","12","14","16","end","11","making","9","another","young","system","important","letter","17","chinese","every","see","s","test","word","century","language","little",\

"give","said","25","state","problems","sentence","food","translation","given","child","18","longer","question","back","don’t","19","against","always","answers","know","having","among","instead","comprehension","large","35","want","likely","keep","family","go","why","41","home","law","place","look","day","men","22","26","45","it’s","others","companies","countries","once","money","24","though",\

"27","29","31","say","national","ii","23","based","found","28","32","past","living","university","scientific","–","36","38","working","around","data","right","21","jobs","33","34","possible","feel","process","effect","growth","probably","seems","fact","below","37","39","history","technology","never","sentences","47","true","scientists","power","thought","during","48","early","parents",\

"something","market","times","46","certain","whether","000","did","enough","problem","least","federal","age","idea","learn","common","political","pay","view","going","attention","happiness","moral","show","live","until","52","49","ago","percent","stress","43","44","42","meaning","51","e","iii","u","60","anything","53","55","cultural","nothing","short","100","water","car","56","58","【解析】","54","59","57","v","。","63","64","65","61","62","66","70","75","f","【考点分析】","67","here","68","71","72","69","73","74","选项a","ourselves","teachers","helps","参考范文","gdp","yourself","gone","150"}

for word in words:

if word not in excludes:

counts[word] = counts.get(word,count))

x = len(counts)

print(x)

r = 0

next = eval(input("1继续"))

while next == 1:

r += 100

for i in range(r,r+100):

word,count = items[i]

print ("\"{}\"".format(word),end = ",")