1.英文词频统
下载一首英文的歌词或文章
article = '''An empty street
An empty house
A hole inside my heart
I'm all alone
The rooms are getting smaller
I wonder how
I wonder why
I wonder where they are
The days we had
The songs we sang together
Oh yeah
And oh my love
I'm holding on forever
Reaching for a love that seems so far
So i say a little prayer
And hope my dreams will take me there
Where the skies are blue to see you once again, my love
Over seas and coast to coast
To find a place i love the most
Where the fields are green to see you once again, my love
I try to read
I go to work
I'm laughing with my friends
But i can't stop to keep myself from thinking
Oh no I wonder how
I wonder why
I wonder where they are
The days we had
The songs we sang together
Oh yeah And oh my love
I'm holding on forever
Reaching for a love that seems so far Mark:
To hold you in my arms
To promise you my love
To tell you from the heart
You're all i'm thinking of
I'm reaching for a love that seems so far
So i say a little prayer
And hope my dreams will take me there
Where the skies are blue to see you once again, my love
Over seas and coast to coast
To find a place i love the most
Where the fields are green to see you once again,my love
say a little prayer
dreams will take me there
Where the skies are blue to see you once again '''
将所有,.?!’:等分隔符全部替换为空格
sep = ''':.,?!'''
for i in sep:
article = article.replace(i,' ');
将所有大写转换为小写
article = article.lower();
生成单词列表
article_list = article.split();
print(article_list);
生成词频统计
# # ①统计,遍历集合
# article_dict={}
# article_set =set(article_list)-exclude# 清除重复的部分
# for w in article_set:
# article_dict[w] = article_list.count(w)
# # 遍历字典
# for w in article_dict:
# print(w,article_dict[w])
#方法②,遍历列表
article_dict={}
for w in article_list:
article_dict[w] = article_dict.get(w,0)+1
# 排除不要的单词
for w in exclude:
del (article_dict[w]);
for w in article_dict:
print(w,article_dict[w])
排序
dictList = list(article_dict.items())
dictList.sort(key=lambda x:x[1],reverse=True);
排除语法型词汇,代词、冠词、连词
exclude = {'the','to','is','and'}
for w in exclude:
del (article_dict[w]);
输出词频最大TOP20
for i in range(20):
print(dictList[i])
将分析对象存为utf-8编码的文件,通过文件读取的方式获得词频分析内容。
file = open("test.txt", "r",encoding='utf-8');
article = file.read();
file.close()
2.中文词频统计
下载一长篇中文文章。
从文件读取待分析文本。
news = open('gzccnews.txt','r',encoding = 'utf-8')
安装与使用jieba进行中文分词。
pip install jieba
import jieba
list(jieba.lcut(news))
生成词频统计
排序
排除语法型词汇,代词、冠词、连词
输出词频最大TOP20(或把结果存放到文件里)
import jieba
#打开文件
file = open("gzccnews.txt",'r',encoding="utf-8")
notes = file.read();
file.close();
#替换标点符号
sep = ''':。,?!;∶ ...“”'''
for i in sep:
notes = notes.replace(i,' ');
notes_list = list(jieba.cut(notes));
#排除单词
exclude =[' ','\n','我','你','边','上','说,'了','的','那','些','什','么','话','呢']
#方法②,遍历列表
notes_dict={}
for w in notes_list:
notes_dict[w] = notes_dict.get(w,0)+1
# 排除不要的单词
for w in exclude:
del (notes_dict[w]);
for w in notes_dict:
print(w,notes_dict[w])
# 降序排序
dictList = list(notes_dict.items())
dictList.sort(key=lambda x:x[1],reverse=True);
print(dictList)
#输出词频最大TOP20
for i in range(20):
print(dictList[i])
#把结果存放到文件里
outfile = open("top20.txt","a")
for i in range(20):
outfile.write(dictList[i][0]+" "+str(dictList[i][1])+"\n")
outfile.close();
将代码与运行结果截图发布在博客上。