python用pandas处理excel提取关键字

最新推荐文章于 2023-09-23 14:29:14 发布

EaSoNgo111

最新推荐文章于 2023-09-23 14:29:14 发布

阅读量1.3k

点赞数 2

文章标签： pandas python excel

本文链接：https://blog.csdn.net/EaSoNgo111/article/details/129733636

版权

import pandas as pd
import numpy as np
import os

# 定义目标文件夹路径
folder_path = r'C:\Users\win10\Desktop\新建文件夹'
# 遍历目标文件夹下的所有CSV文件
keywords_list = ['china', 'investment', 'trade', 'infrastructure', 'finance', 'debt', 'rule', 'sanction', 'International politics', 'military affairs', 'technology']

for root, dirs, files in os.walk(folder_path):
    for file in files:
        # 判断是否为CSV文件
        if file.endswith('.csv'):
            # 读取CSV文件
            file_path = os.path.join(root, file)
            # 读取CSV文件
            df = pd.read_csv(file_path)
            # 定义关键字列表
            
            # 将文章内容和标题转换为小写
            df['art_content'] = df['art_content'].str.lower()
            df['art_title'] = df['art_title'].str.lower()
            # 新增type列
            df['type'] = ''
            # 遍历每个关键字
            for keyword in keywords_list:
                # 根据关键字筛选出包含该关键字的行
                mask = df['art_content'].str.contains(keyword) | df['art_title'].str.contains(keyword)
                # 如果存在该关键字，则复制该行，并将关键字添加到type列
                if mask.any():
                    df.loc[mask, 'type'] = keyword

            # 删除type列为空的行
            df.dropna(subset=['type'], inplace=True)
            # 将type列中的关键字展开，每个关键字对应一行数据 
            df = df.explode('type') 
            # 重置索引 
            df.reset_index(drop=True, inplace=True)
            # 保存到CSV文件

            df.to_csv(file_path, encoding='utf-8', index=False)

for root, dirs, files in os.walk(folder_path):

os.walk是Python中用于遍历文件夹的函数，它返回一个三元组，分别是当前遍历到的文件夹路径、该文件夹下的子文件夹列表和该文件夹下的文件列表。for循环可以遍历这个三元组，依次处理每个文件夹和文件。

这是Python中os模块提供的walk函数，用于遍历指定目录及其子目录中的所有文件和目录。

该函数会依次遍历指定目录下的每个文件和子目录，并返回三个值（root，dirs，files）。其中：

root：代表当前正在遍历的目录的路径（包括该目录本身）。
dirs：代表root目录下的所有子目录名字（不包含子目录下的目录名字）。
files：代表root目录下的所有文件名字（不包含子目录下的文件名字）。

通过for循环遍历这三个返回值，可以实现遍历指定目录的所有文件和子目录的功能。在每次循环中，可以使用os.path.join（root，file）函数来得到每个文件的完整路径，进而实现对每个文件的操作。

注意，存的时候一定要存为uft-8格式。

检查excel属于什么格式

import chardet
#打开其中一个csv文件，查看其编码格式
f = open('\\Users\\a\\Desktop\\428.csv','rb')data = f.read()print(chardet.detect(data))

改变编码格式

import csv
import codecs
# 打开原始 CSV 文件
with open(r'D:\bruegel_art_info.csv', 'r',encoding='GB2312') as f:
    reader = csv.reader(f)
    data = [row for row in reader]
# 将数据写入新的 UTF-8 格式 CSV 文件中
with codecs.open(r'D:\bruegel_art_info.csv', 'w', encoding='utf_8') as f:
    writer = csv.writer(f)
    writer.writerows(data)

细节：

只看后缀名，不要看系统电脑显示的类型。例如，后缀是.csv，显示是xls，但要用pd.read_csv