【Python】Python3.6处理数据实例：快速进行数据清洗，可用但局部有小bug，静候大佬指导调整方案！-CSDN博客

本文链接：https://blog.csdn.net/sqcainiao/article/details/140129836

背景

日常需要做数据积累，每天的销售、运营相关数据需要保存，数据都单品，除了核心单品的数据，我还需要整理到品牌到天的存下来。系统直接导出的数据源是CSV格式，太慢没空一个个做数据清理再保存，所以想偷懒让Python帮帮忙，以下是绞尽脑汁写出来的，能跑但是有BUG。

import pandas as pd
import os

# CSV文件所在的目录
csv_dir = 'F:/0000-运营日报每日数据积累/0-1-销售日报积累/日度'  # 替换为您的CSV文件所在的目录

# 遍历CSV文件目录中的每个文件
for csv_file in os.listdir(csv_dir):
    if csv_file.endswith('.csv'):
        # 读取CSV文件
        file_path = os.path.join(csv_dir, csv_file)
        df = pd.read_csv(file_path, encoding='GBK')  # 根据实际情况修改编码

        # 清空A到AX列所有有值单元格内的前后空字符串
        for col in df.columns[df.columns.get_loc('A') if 'A' in df.columns else None:
        df.columns.get_loc('AX') + 1 if 'AX' in df.columns else None]:
            if pd.notnull(col) and pd.api.types.is_string_dtype(df[col]):
                df[col] = df[col].str.strip()

                # 确保V到AX列都是存在的（如果某些文件不包含这些列，则忽略）
        numeric_cols = [col for col in df.columns if
                        col.upper() in ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 
                        'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 
                        'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'AA', 'AB','AC', 
                        'AD', 'AE', 'AF', 'AG', 'AH', 'AI', 'AJ', 'AK', 'AL', 
                        'AM', 'AN','AO', 'AP', 'AQ', 'AR', 'AS', 'AT', 'AU', 
                        'AV', 'AW', 'AX', ]]  # 填充...以包含所有可能的列
        numeric_cols = [col for col in numeric_cols if col in df.columns]  # 过滤出实际存在的列

        # 把V列到AX列全部转换成数值（将不能转换的设置为NaN）
        for col in numeric_cols[numeric_cols.index('V') if 'V' in numeric_cols else None:
        numeric_cols.index('AX') + 1 if 'AX' in numeric_cols else None]:
            df[col] = pd.to_numeric(df[col], errors='coerce')

            # 去除文件名中的.csv扩展名，并添加.xlsx扩展名
        xlsx_file = os.path.join(csv_dir, os.path.splitext(csv_file)[0] + '.xlsx')

        # 将处理后的DataFrame保存为Excel文件
        df.to_excel(xlsx_file, index=False, engine='openpyxl')

print("所有文件处理完毕！")

问题

以上文件跑完我发现，其余诸如流量买家数之类的数值类都没有问题，但是有几列是百分比的数据还是有问题的，我需要使用另外一段代码把转换成xlsx后的文件汇总到一起的时候，转化率、退款率、商详页跳出率等百分比指标无法计算，需要再次手工准换成数值才能被计算。
在这里插入图片描述

# 清空A到AX列所有有值单元格内的前后空字符串
        for col in df.columns[df.columns.get_loc('A') if 'A' in df.columns else None:
        df.columns.get_loc('AX') + 1 if 'AX' in df.columns else None]:
            if pd.api.types.is_string_dtype(df[col]):
                df[col] = df[col].str.strip()
                # 确保V到AX列都是存在的（如果某些文件不包含这些列，则忽略）
        numeric_cols = [col for col in df.columns if
                        col.upper() in ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 
                        'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 
                        'V', 'W', 'X', 'Y', 'Z', 'AA', 'AB','AC', 'AD', 'AE', 'AF', 
                        'AG', 'AH', 'AI', 'AJ', 'AK', 'AL', 'AM', 'AN','AO', 'AP', 
                        'AQ', 'AR', 'AS', 'AT', 'AU', 'AV', 'AW','AX', ]]  # 填充...以包含所有可能的列
        numeric_cols = [col for col in numeric_cols if col in df.columns]  # 过滤出实际存在的列

                # 将V列到AX列（包括Y列）转换为数值
        numeric_cols = [col for col in df.columns if 'V' <= col.upper() <= 'AX']
        for col in numeric_cols:
            if col.upper() == 'Y':
                # 特殊处理Y列，将百分比字符串转换为小数
                df[col] = pd.to_numeric(df[col].str.rstrip('%'), errors='coerce') / 100
            else:
                df[col] = pd.to_numeric(df[col], errors='coerce')

                # 去除文件名中的.csv扩展名，并添加.xlsx扩展名
        xlsx_file = os.path.join(csv_dir, os.path.splitext(csv_file)[0] + '.xlsx')

        # 使用xlsxwriter作为引擎来保存Excel文件，并设置Y列的格式为百分比
        excel_writer = pd.ExcelWriter(xlsx_file, engine='xlsxwriter')
        df.to_excel(excel_writer, sheet_name='Sheet1', index=False)

        # 获取xlsxwriter对象以便进行格式化
        workbook = excel_writer.book
        worksheet = excel_writer.sheets['Sheet1']

        # 设置Y列的格式为百分比（注意：索引从0开始，且Excel中的百分比格式需要乘以100）
        percent_format = workbook.add_format({'num_format': '0.00%'})
        worksheet.set_column('Y:Y', None, percent_format)  # 假设Y列在Excel中是Y列，根据实际情况调整

        # 保存Excel文件
        excel_writer.save()

print("所有文件处理完毕！")