python 对excel文件进行分词并进行词频统计_python 词频分析

最新推荐文章于 2024-07-30 07:25:40 发布

weixin_39980893

最新推荐文章于 2024-07-30 07:25:40 发布

阅读量7.5k

点赞数 4

文章标签： python 对excel文件进行分词并进行词频统计

这篇博客介绍了如何使用Python进行词频分析，通过jieba库对文本进行分词，并探讨了处理大文件时的注意事项，如使用with语句、避免一次性加载整个文件。还讨论了正则表达式去除标点符号的问题，以及停用词表在过滤常见词汇中的作用。

摘要由CSDN通过智能技术生成

python词频分析

昨天看到几行关于用 python 进行词频分析的代码，深刻感受到了 python 的强大之处。(尤其是最近自己为了在学习 c 语言感觉被它的语法都快搞炸了，python 从来没有那么多要求)

代码如下：

import re
def parse(text):
    # 使用正则表达式去除标点符号和换行符
    text = re.sub(r'[^\w ]', ' ', text)
    text = text.lower() #转化为小写
    word_list = text.split(' ')  #生成列表

    # 去除空白单词
    word_list = filter(None, word_list)

    # 生成单词和词频的字典
    word_cnt = {}
    for word in word_list:
        if word not in word_cnt:
            word_cnt[word] = 0
        word_cnt[word] += 1

    # 按照词频排序
    sorted_word_cnt = sorted(word_cnt.items(), key=lambda kv: kv[1], reverse=True) #逆序排列

    return sorted_word_cnt

with open('in.txt', 'r') as fin:    #读取文件
 text = fin.read()  

word_and_freq = parse(text)

with open('out.txt', 'w') as fout:   #将结果写入文件
 for word, freq in word_and_freq:
  fout.write('{} {}\n'.format(word,freq))

另外分析材料如下

 I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today.

I have a dream that one day down in Alabama, with its vicious racists, . . . one day right there in Alabama little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers. I have a dream today.

I have a dream that one day every valley shall be exalted, every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be