python词频分析
昨天看到几行关于用 python 进行词频分析的代码,深刻感受到了 python 的强大之处。(尤其是最近自己为了在学习 c 语言感觉被它的语法都快搞炸了,python 从来没有那么多要求)
代码如下:
import re
def parse(text):
# 使用正则表达式去除标点符号和换行符
text = re.sub(r'[^\w ]', ' ', text)
text = text.lower() #转化为小写
word_list = text.split(' ') #生成列表
# 去除空白单词
word_list = filter(None, word_list)
# 生成单词和词频的字典
word_cnt = {}
for word in word_list:
if word not in word_cnt:
word_cnt[word] = 0
word_cnt[word] += 1
# 按照词频排序
sorted_word_cnt = sorted(word_cnt.items(), key=lambda kv: kv[1], reverse=True) #逆序排列
return sorted_word_cnt
with open('in.txt', 'r') as fin: #读取文件
text = fin.read()
word_and_freq = parse(text)
with open('out.txt', 'w') as fout: #将结果写入文件
for word, freq in word_and_freq:
fout.write('{} {}\n'.format(word,freq))
另外分析材料如下
I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today.
I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today.
I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be

这篇博客介绍了如何使用Python进行词频分析,通过jieba库对文本进行分词,并探讨了处理大文件时的注意事项,如使用with语句、避免一次性加载整个文件。还讨论了正则表达式去除标点符号的问题,以及停用词表在过滤常见词汇中的作用。
最低0.47元/天 解锁文章
2856

被折叠的 条评论
为什么被折叠?



