【Python】Python3.6处理数据实例：快速进行数据清洗，迭代版

姜大炮

于 2024-07-30 20:11:55 发布

阅读量368

点赞数 11

分类专栏：姜大炮的工作笔记文章标签： python linux 人工智能

本文链接：https://blog.csdn.net/sqcainiao/article/details/140804893

版权

姜大炮的工作笔记专栏收录该内容

4 篇文章 0 订阅

订阅专栏

迭代点

这段代码的主要功能是优化前两天发的版本数据清洗【Python】Python3.6处理数据实例：快速进行数据清洗，可用但局部有小bug，静候大佬指导调整方案！

核心功能还是读取指定目录下的CSV文件，并将其转换为XLSX格式的Excel文件。在转换过程中，需要对数据进行一些预处理，例如清除字符串中的空白字符、将特定列包含百分比的字符串转换为数值类型。此外，代码还实现了多进程并发处理和处理进度的监控。

功能概述

文件读取与转换：从指定目录读取CSV文件，并将其转换为XLSX格式的Excel文件。
数据预处理：对数据进行清洗和格式转换。
多进程并发处理：利用Python的concurrent.futures.ProcessPoolExecutor并发执行多个文件的处理任务。
进度监控：记录处理过程中的时间信息，并统计处理进度。

实现模块

os: 用于文件路径操作。
re: 用于正则表达式的匹配。
datetime: 用于获取当前时间和格式化时间。
pandas: 用于读写CSV文件以及数据处理。
xlsxwriter: 用于写入XLSX文件。
multiprocessing: 用于多进程并发处理，这里主要使用Manager来创建共享队列和列表。

优点

高效处理大量文件：通过多进程并发处理，能够显著提高处理大量文件的速度。
数据预处理：对数据进行了适当的预处理，提高了数据的质量。
异常处理：对于处理过程中可能出现的异常进行了捕获，并记录了错误信息。
进度监控：实时输出处理进度，方便监控任务执行情况。

缺点

内存消耗：由于每个进程都会加载整个CSV文件到内存中进行处理，当文件非常大时可能会导致内存不足。
进程间通信开销：使用Manager创建的队列和列表在进程间传递数据时会产生一定的性能开销。
文件锁问题：如果两个进程试图同时写入同一个文件（例如日志文件），可能会导致文件锁冲突。这个可折磨死我了，有的文件太大了，单文件20MB左右，本来想用batch_size=10把单文件拆分成10个来处理，但是各种冲突报错，不晓得是不是这个导致的。
资源管理：虽然代码使用了with语句来管理资源，但在异常情况下可能还是需要更精细的资源释放机制。
可读性和维护性：代码较为复杂，尤其是多进程部分，可能不太好理解，也增加了后期维护的难度。

我想改进的地方

内存优化：考虑用pandas的分块读取功能来减少内存占用。
日志管理：把日志输出到单独的日志文件中，避免进程间的文件锁问题。
错误处理：增加更详细的错误处理逻辑，如重试。
并行度调整：根据实际硬件配置调整进程池的最大工作进程数，以达到最优性能。
代码结构改进：可以通过封装函数、类等方式提高代码的可读性和可维护性。

以上是对这段代码一个总结，希望有大佬看到能指点迷津，这个还能怎么继续优化，同时也希望能对有需要的打工人有所帮助。

以下是完整代码

import os
import re
import pandas as pd
import concurrent.futures
from datetime import datetime
from xlsxwriter.workbook import Workbook
from multiprocessing import Manager, Process, cpu_count

# CSV文件所在的目录
csv_dir = 'inputpath'  #  inputpath 替换为你需要保存数据清洗后的文件夹
# 输出文件夹
output_dir = os.path.join(csv_dir, 'outputpath')  #  outputpath替换为你需要保存数据清洗后的文件夹
os.makedirs(output_dir, exist_ok=True)  # 确保输出文件夹存在

# 定义一个函数来处理单元格中的文本数据
def convert_text_to_number(cell_value):
    # 如果单元格值含有百分号，则去除百分号并转换为数值
    if isinstance(cell_value, str) and '%' in cell_value:
        # 使用正则表达式移除百分号并转换为浮点数
        return float(re.sub(r'%', '', cell_value)) / 100.0
    # 如果单元格值是数字或数字型字符串，则转换为浮点数
    if isinstance(cell_value, (int, float)) or (
            isinstance(cell_value, str) and cell_value.replace('.', '', 1).isdigit()):
        return float(cell_value)
    # 如果没有找到数字，或者转换失败，则返回原始值
    return cell_value

# 对特定列进行数值转换前的预处理
specific_cols = ['Y', 'AF', 'AJ', 'AR', 'AV']

# 处理单个CSV文件
def process_csv_file(csv_file, output_queue, processed_files):
    try:
        start_time = datetime.now()
        formatted_start_time = start_time.strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]
        output_queue.put(f"[{formatted_start_time}] 开始处理文件 {csv_file}...")

        # 读取CSV文件
        file_path = os.path.join(csv_dir, csv_file)
        df = pd.read_csv(file_path, encoding='gbk', low_memory=False)  # 使用pd来读取CSV文件
        # 清空A到AX列所有有值单元格内的前后空字符串
        # 编码对应的A列，支付买家数对应的AX列
        for col in df.columns[df.columns.get_loc('编码'):df.columns.get_loc('支付买家数') + 1]:
            if pd.notnull(col) and pd.api.types.is_string_dtype(df[col]):
                df[col] = df[col].str.strip()
        # 第一次处理：对所有数值列进行数值转换
        for col in df.columns[df.columns.get_loc('访客数'):df.columns.get_loc('支付买家数') + 1]:   # 访客数对应的V列，支付买家数对应的AX列
            # 对特定列进行特殊处理
            if col in specific_cols:
                df[col] = df[col].apply(lambda x: convert_text_to_number(x) * 100 if isinstance(x, float) else x)
            else:
                df[col] = df[col].apply(convert_text_to_number)

        # 去除文件名中的.csv扩展名，并添加.xlsx扩展名
        xlsx_file = os.path.join(output_dir, os.path.splitext(csv_file)[0] + '.xlsx')
        # 创建一个新的工作簿
        workbook = Workbook(xlsx_file, {'constant_memory': True, 'remove_timezone': True})
        worksheet = workbook.add_worksheet()
        # 设置百分比格式
        percent_format = workbook.add_format({'num_format': '0.00%'})
        # 写入表头
        for col_num, col_name in enumerate(df.columns):
            worksheet.write(0, col_num, col_name)
        # 写入数据
        for row_idx, (index, row) in enumerate(df.iterrows(), start=1):
            for col_idx, col_name in enumerate(df.columns):
                cell_value = row[col_name]
                if pd.isna(cell_value):
                    continue  # 跳过空值
                if col_name in specific_cols:
                    worksheet.write(row_idx, col_idx, cell_value, percent_format)
                else:
                    worksheet.write(row_idx, col_idx, cell_value)
        # 关闭工作簿
        workbook.close()
        end_time = datetime.now()
        elapsed_time = end_time - start_time
        # 格式化时间差为分钟和秒
        minutes, seconds = divmod(elapsed_time.total_seconds(), 60)
        formatted_time_taken = f"{int(minutes)}分{int(seconds)}秒"
        # 格式化结束时间为 YYYY-MM-DD HH:MM:SS.mmm
        formatted_end_time = end_time.strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]
        output_queue.put(f"[{formatted_end_time}] 文件处理完毕 {csv_file}. 用时: {formatted_time_taken}")

        # 更新已处理文件列表
        processed_files.append(csv_file)
    except Exception as e:
        error_time = datetime.now()
        # 格式化报错时间为 YYYY-MM-DD HH:MM:SS.mmm
        formatted_error_time = error_time.strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]
        output_queue.put(f"[{formatted_error_time}] 文件处理出错 {csv_file}: {e}")

# 主程序入口
if __name__ == '__main__':
    # 遍历CSV文件目录中的每个文件
    csv_files = [f for f in os.listdir(csv_dir) if f.endswith('.csv')]
    total_files = len(csv_files)

    # 创建一个队列来保存输出日志
    with Manager() as manager:
        output_queue = manager.Queue()
        # 创建一个列表来记录已处理过的文件
        processed_files = manager.list()

        # 使用ProcessPoolExecutor来处理文件
        with concurrent.futures.ProcessPoolExecutor(max_workers=cpu_count()) as executor:
            # 提交任务给进程池
            futures = {executor.submit(process_csv_file, csv_file, output_queue, processed_files): csv_file for csv_file
                       in csv_files}

            # 进程完成时收集结果
            for future in concurrent.futures.as_completed(futures):
                csv_file = futures[future]
                # 获取并打印日志
                while not output_queue.empty():
                    print(output_queue.get())

                # 计算已完成的文件数量
                completed_files = len(processed_files)
                remaining_files = total_files - completed_files
                print(f"已完成 {completed_files} 个文件，剩余 {remaining_files} 个文件。")

    print("所有文件处理完毕！")

姜大炮

关注

11
点赞
踩
7

收藏

觉得还不错? 一键收藏
打赏
0
评论
【Python】Python3.6处理数据实例：快速进行数据清洗，迭代版

文件读取与转换：从指定目录读取CSV文件，并将其转换为XLSX格式的Excel文件。数据预处理：对数据进行清洗和格式转换。多进程并发处理：利用Python的并发执行多个文件的处理任务。进度监控：记录处理过程中的时间信息，并统计处理进度。
复制链接

扫一扫