python词频分析
昨天看到几行关于用 python 进行词频分析的代码,深刻感受到了 python 的强大之处。(尤其是最近自己为了在学习 c 语言感觉被它的语法都快搞炸了,python 从来没有那么多要求)
代码如下:
import re
def parse(text):
# 使用正则表达式去除标点符号和换行符
text = re.sub(r'[^\w ]', ' ', text)
text = text.lower() #转化为小写
word_list = text.split(' ') #生成列表
# 去除空白单词
word_list = filter(None, word_list)
# 生成单词和词频的字典
word_cnt = {}
for word in word_list:
if word not in word_cnt:
word_cnt[word] = 0
word_cnt[word] += 1
# 按照词频排序
sorted_word_cnt = sorted(word_cnt.items(), key=lambda kv: kv[1], reverse=True) #逆序排列
return sorted_word_cnt
with open('in.txt', 'r') as fin: #读取文件
text = fin.read()
word_and_freq = parse(text)
with open('out.txt', 'w') as fout: #将结果写入文件
for word, freq in word_and_freq:
fout.write('{} {}\n'.format(word,freq))
另外分析材料如下
I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today.
I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today.
I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be