阶段作业1:完整的中英文词频统计


步骤:

1.准备utf-8编码的文本文件file

2.通过文件读取字符串 str

3.对文本进行预处理

4.分解提取单词 list

5.单词计数字典 set , dict

6.按词频排序 list.sort(key=)

7.排除语法型词汇,代词、冠词、连词等无语义词

8.输出TOP(20)

一、.英文歌曲 词频统计

str2='''I will run, I will climb, I will soar
I'm undefeated
Jumpiing out of my skin, pull the chord
Yeah I believe it
The past, is everything we were
don't make us who we are
So I'll dream, until I make it real,
and all I see is stars
Its not until you fall that you fly
When your dreams come alive you're unstoppable
Take a shot, chase the sun, find the beautiful
We will glow in the dark turning dust to gold
And we'll dream it possible
possible
And we'll dream it possible
I will chase, I will reach, I will fly
Until I'm breaking, until I'm breaking
Out of my cage, like a bird in the night
I know I'm changing, I know I'm changing
In, into something big, better than before
And if it takes, takes a thousand lives
Then it's worth fighting for
Its not until you fall that you fly
When your dreams come alive you're unstoppable
Take a shot, chase the sun, find the beautiful
We will glow in the dark turning dust to gold
And we'll dream it possible
it possible
From the bottom to the top
We're sparking wild fire's
Never quit and never stop
The rest of our lives
From the bottom to the top
We're sparking wild fire's
Never quit and never stop
Its not until you fall that you fly
When your dreams come alive you're unstoppable
Take a shot, chase the sun, find the beautiful
We will glow in the dark turning dust to gold
And we'll dream it possible
possible
And we'll dream it possible'''.lower()
#aa = '''."?!'''
#for word in aa:
#   str2 =str2.replace('word','')
str2 =str2.replace('\n',' ')
str2 =str2.replace(',',' ')
print(str2)#去除特殊符号

str2 = str2.strip()#去掉首尾空格
str2 = str2.split()#通过指定分隔符对字符串进行切片
print(str2)

print('统计每个单词出现的次数为:')
for word in str2:
   print(word,str2.count(word))

strSet=set(str2)
newSet={'a','will','it','out','of','my','the','i','in','to','when','and'}
strSet1=strSet-newSet#去除介词和其他
print(strSet1)


strdict={}          #单词计数字典
for word in strSet1:
    strdict[word] = str2.count(word)
print(len(strdict),strdict)

strList = list(strdict.items())
def takesecond(elem):#定义函数
        return elem[1]
#strList.sort(key=lambda x:x[1],reverse=True)#匿名函数
strList.sort(key=takesecond,reverse=True)#按照数值大小进行排序
print(strList)


for i in range(20):
    print (strList[i])#前二十

  

2.中文小说 词频统计

import jieba
f=open('《活着》.txt','r',encoding='utf-8')
life=f.read()
f.close()
lifelist=list(jieba.cut(life))
lifedict={}
for word in lifelist:
    if len(word)==1:
        continue
    else:
        lifedict[word]=lifedict.get(word,0)+1

wordlist=list(lifedict.items())
wordlist.sort(key=lambda x:x[1],reverse=True)

for a in range(15):
    print(wordlist[a])

  

 

转载于:https://www.cnblogs.com/1998hxw/p/9794699.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值