各省政府工作报告词频统计+数据处理

最新推荐文章于 2024-11-21 22:25:07 发布

黄黄黄黄黄66

最新推荐文章于 2024-11-21 22:25:07 发布

阅读量936

点赞数 1

分类专栏：垃圾的心路历程文章标签： python 自然语言处理 nlp

本文链接：https://blog.csdn.net/m0_57011532/article/details/120521383

版权

垃圾的心路历程专栏收录该内容

11 篇文章

订阅专栏

背景是这样的，需要统计各省各年的政府工作报告中关于能源环境的词频，以论证该省对能源环境的重视程度。
工作报告格式均为txt，txt文档前四位需为年份数字（便于后续统计）；
文件路径为’./XX省/2020年工作报告.txt’；
txt文件编码应为utf-8。
输出为excel格式如下：

province	2000	2001	…	2021
湖南省	词频	…	…	…
河北省	词频	…	…	…
XX省	…	…	…	…
…	…		…	…

这里需要用到以下几个包：

# 产生路径，用于遍历
import os
import os.path
# decode open txt
import codecs
# 分割文档
import jieba

import pandas as pd

# 模糊匹配
# 参见：https://blog.csdn.net/AlexTan_/article/details/107319603
import difflib

首先定义参数类：

# 参数设置
class Config:
     def __init__(self):
        self.path = '.\\text\\' # 根目录 
        self.provinces =['湖南省','河北省'] # 根目录下的子目录
        self.words= ['能源','煤'] # 需要统计的词
        self.date_range=[2000,2021] # 数据年限跨度，闭区间
        self.cutoff=0.6 # 模糊匹配比例，越大精度越高，0.6时‘能源’能识别‘能源资源’ 
config = Config()

分割统计函数：

# 词频统计
def word_freq():   
    # 创建存储df
    df=pd.DataFrame(columns=list(range(config.date_range[0],config.date_range[1]+1)),
                    index=range(len(config.provinces)))
    df.insert(0,'province',0)
    
    # 遍历根目录
    for i in range(len(config.provinces)):
        df['province'][i]=config.provinces[i]
        root_path = config.path + config.provinces[i]

        # 遍历root_path中全部文件
        
        # root 当前文件夹路径
        # dirs 内容是该文件夹中所有文件夹的名字
        # files 内容是该文件夹中所有的文件
        for root,dirs,files in os.walk(root_path):  
            for name in files:
                num_words=0 
                filepath = os.path.join(root,name)
                print(filepath)
                f=codecs.open(filepath,'r',encoding='UTF-8') # open txt
                filecontent=f.read()   # read txt
                seg=jieba.lcut(filecontent)   # 分割
                # 词频统计
                for word in config.words:
                    # 模糊匹配
                    select_word1=difflib.get_close_matches(word, seg, len(seg),cutoff=config.cutoff) 
                    # 剔除负向相似匹配，即被选取字段长度一定要大于等于目标字段长度
                    select_word2=list(select_word1[j] for j in range(len(select_word1)) if len(select_word1[j])>=len(word))
                    num_word = len(select_word2)
                    # 查看区别
                    #print(select_word1)
                    #print(select_word2)
                    #print(len(select_word1))
                    #print(len(select_word2))
                    num_words+=num_word
                df.loc[i,int(name[0:4])]=num_words # 在excel中填入词频统计
                f.close() # close txt
    # 存储df
    writer=pd.ExcelWriter('./words_freq.xlsx')
    df.to_excel(writer,index=False)
    writer.save()
    writer.close()

run：

%%time
word_freq()