使用Python快速读写大量CSV数据

背景

问题是执行以下处理的程序花费了太多时间。

 从约数GB的CSV文件中获取满足一定条件的数据的行号列表。
 ⇛ 例如,使用行 [0,1,5,7,10...]

 逐行读取输入文件,提取要使用的行,并将它们写入输出文件。
 ⇛ 仅行 [0,1,5,7,10...] 会写入输出文件

上述步骤2中的IO(读/写)过程需要几个小时,因此需要改进。

目的

加快读取和写入大量数据的过程,而不影响输出文件的内容。

验证

做了很多测试,将在下面写下详细信息和结果。
最终,使用pandas方法是最快的。

输入文件

它有 26 列,包含从 0 到 1000000 行的数字。

"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"
"0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0"
"1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1"
...
"999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999"
"1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000"
验证①:ASIS处理

当前进程逐行读取输入文件并将要提取的每一行写入输出文件。

import time
import csv

# 指定输入和输出文件
input_file = 'input_large.csv'
output_file = 'output_selected_rows.csv'

# 指定要提取的行号列表(从 0 开始)
selected_row_indices = [i for i in range(0, 1000000+1, 25)] # 每 25 行中提取 1 行

# 获取开始时间
start_time = time.time()

# 打开输入和输出文件
with open(input_file, 'r', newline='') as infile, \
     open(output_file, 'w', newline='') as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile, quoting=csv.QUOTE_ALL)

    # 先写标题行
    header = next(reader)
    writer.writerow(header)
    
    # 单步执行剩余数据并将适当的数据写入输出文件
    for row_index, row in enumerate(reader):
        if row_index in selected_row_indices:
            writer.writerow(row)

# 获取结束时间
end_time = time.time()

# 计算并打印经过的时间
elapsed_time = end_time - start_time
print(f"Time taken: {elapsed_time:.2f} seconds")

直到文件读取和文件写入完成…
⇛ 131.00 秒

输出文件
"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"
"0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0"
"25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25"
...
"999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975"
"1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000"

⇛没问题。

验证2:在缓冲区中累加几行然后写入文件

首先,听说一次写入多行比一次写入一行更快,所以准备了一个缓冲区。
当缓冲区中累积了 1000 行时,将写入缓冲区。

import csv
import time
from itertools import islice

# 指定输入和输出文件
input_file = 'input_large.csv'
output_file = 'output_selected_rows_2.csv'


# 指定要提取的行号列表(从 0 开始)
selected_row_indices = [i for i in range(0, 1000000+1, 25)]

# 指定每个块的行数
chunk_size = 1000 

# 获取开始时间
start_time = time.time()

# 打开输入和输出文件
with open(input_file, 'r', newline='') as infile, \
     open(output_file, 'w', newline='') as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile, quoting=csv.QUOTE_ALL)

    # 先写标题行
    header = next(reader)
    writer.writerow(header)
    
    # 在写入之前准备好要存储的缓冲区
    buffer = []

    # 单步执行剩余数据并将适当的数据写入输出文件
    for chunk in iter(lambda: list(islice(reader, chunk_size)), []):
        for row_index, row in enumerate(chunk):
            if row_index in selected_row_indices:
                buffer.append(row)

        # 将缓冲区的内容写入文件
        if len(buffer) >= chunk_size:
            writer.writerows(buffer)
            buffer = []

    # 将剩余行写入缓冲区
    if buffer:
        writer.writerows(buffer)

# 获取结束时间
end_time = time.time()

#计算并打印经过的时间
elapsed_time = end_time - start_time
print(f"Time taken: {elapsed_time:.2f} seconds")

直到文件读取和文件写入完成…
⇛ 126.57 秒(令人惊讶的是没有那么快…)

输出文件
"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"
"0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0"
"25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25"
...
"999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975"
"1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000"

⇛没问题。

验证③:使用pandas方法

实际上,对 Pandas 尝试了各种方法,以下是最有效的:
使用 chunk 一次读取输入文件 1000 行,并将写入留给 Pandas 的 to_csv 方法,这起作用了。

注意
  • read_csv的默认行为是转换数据类型,所以如果你不想要它,你需要添加dtype=str等选项。
  • 如果不将 lineterminator=‘\r\n’ 选项添加到 to_csv 方法中,则输出文件中的换行代码将从 CRLF 更改为 LF。
import csv
import time
import pandas as pd

# 指定输入和输出文件
input_file = 'input_large.csv'
output_file = 'output_selected_rows_3.csv'

# 指定要提取的行号列表(从 0 开始)
selected_row_indices = [i for i in range(0, 1000000+1, 25)]

# 指定每个块的行数
chunk_size = 1000

# 获取开始时间
start_time = time.time()

# 指定读取chunk的进程
chunks = pd.read_csv(input_file, chunksize=chunk_size)

selected_rows = []
current_row = 0
for chunk in chunks:
    # 计算当前块的范围,并将其缩小到您想要使用的数据行数
    end_row = current_row + len(chunk)
    chunk_indices = [idx for idx in selected_row_indices if current_row <= idx < end_row]
    
    # 修改当前chunk的索引(使用整个索引而不是chunk索引)
    adjusted_indices = [idx - current_row for idx in chunk_indices]
    
    # 将相应行添加到列表中
    selected_rows.append(chunk.iloc[adjusted_indices])
    current_row = end_row

# 合并选定的行
selected_rows_df = pd.concat(selected_rows)

# 将所有选定的行写入输出文件
selected_rows_df.to_csv(output_file, index=False, quoting=csv.QUOTE_ALL, lineterminator='\r\n')

# 获取结束时间
end_time = time.time()

# 计算并打印经过的时间
elapsed_time = end_time - start_time
print(f"Time taken: {elapsed_time:.2f} seconds")

直到文件读取和文件写入完成…
⇛3.54秒! (^◇^)

输出文件
"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"
"0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0"
"25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25"
...
"999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975"
"1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000"

⇛没问题。

综上所述

在处理大量 CSV 数据时,Pandas 绝对具有优势。
使用 chunk 进行部分处理有防止内存过度使用和加速 IO 的优点,所以请尝试一下。

结果总结如下。
在这里插入图片描述顺便说一句,如果数据被分成多个文件,因为并行处理是有利的。
若经常用joblib,所以如果需要的话请查看一下。

那么,现在就这些了。

  • 4
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Allan_lam

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值