python一键统计word文件中文词频

m0_66450607

已于 2024-01-18 18:35:38 修改

阅读量424

点赞数 8

分类专栏： python实用技巧文章标签： word python 中文分词

于 2024-01-18 18:32:46 首次发布

本文链接：https://blog.csdn.net/m0_66450607/article/details/135681125

版权

python实用技巧专栏收录该内容

2 篇文章 0 订阅

订阅专栏

在学习过程中，掌握核心的概念非常重要，而哪些是核心的概念，一般重要的都会反复出现，所以，我们把需要学习的文档去统计出现频率最高的词汇，往往就是核心的概念。下面就把如何把一篇word文档统计词汇的频率方法简单整理如下：

#使用jieba库进行中文分词，使用collections库中的Counter类统计词频，最后使用pandas库将结果输出到Excel文件。
import jieba
from collections import Counter
import pandas as pd
from docx import Document

# 读取word文件
def read_word(file_path):
    doc = Document(file_path)
    text = []
    for para in doc.paragraphs:
        text.append(para.text)
    return ' '.join(text)

# 分词并统计词频
def count_words(text):
    words = jieba.lcut(text)
    word_counts = Counter(words)
    return word_counts

# 将结果输出到Excel
def output_to_excel(word_counts, file_path):
    df = pd.DataFrame(list(word_counts.items()), columns=['词语', '出现次数'])
    df.to_excel(file_path, index=False)

# 主函数
def main():
    file_path_input=r'C:\Users\simaxiaohu\Desktop\11.docx'
    file_path_output=r'C:\Users\simaxiaohu\Desktop\统计结果.xlsx'
    text = read_word(file_path_input)
    word_counts = count_words(text)
    output_to_excel(word_counts, file_path_output)

if __name__ == '__main__':
    main()

其中file_path_input（需要统计的源文件）和file_path_output（统计结果输出的位置和文件名）根据自己的实际更改即可。