背景
问题是执行以下处理的程序花费了太多时间。
从约数GB的CSV文件中获取满足一定条件的数据的行号列表。
⇛ 例如,使用行 [0,1,5,7,10...]
逐行读取输入文件,提取要使用的行,并将它们写入输出文件。
⇛ 仅行 [0,1,5,7,10...] 会写入输出文件
上述步骤2中的IO(读/写)过程需要几个小时,因此需要改进。
目的
加快读取和写入大量数据的过程,而不影响输出文件的内容。
验证
做了很多测试,将在下面写下详细信息和结果。
最终,使用pandas方法是最快的。
输入文件
它有 26 列,包含从 0 到 1000000 行的数字。
"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"
"0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0"
"1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1","1"
...
"999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999","999999"
"1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000"
验证①:ASIS处理
当前进程逐行读取输入文件并将要提取的每一行写入输出文件。
import time
import csv
# 指定输入和输出文件
input_file = 'input_large.csv'
output_file = 'output_selected_rows.csv'
# 指定要提取的行号列表(从 0 开始)
selected_row_indices = [i for i in range(0, 1000000+1, 25)] # 每 25 行中提取 1 行
# 获取开始时间
start_time = time.time()
# 打开输入和输出文件
with open(input_file, 'r', newline='') as infile, \
open(output_file, 'w', newline='') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile, quoting=csv.QUOTE_ALL)
# 先写标题行
header = next(reader)
writer.writerow(header)
# 单步执行剩余数据并将适当的数据写入输出文件
for row_index, row in enumerate(reader):
if row_index in selected_row_indices:
writer.writerow(row)
# 获取结束时间
end_time = time.time()
# 计算并打印经过的时间
elapsed_time = end_time - start_time
print(f"Time taken: {elapsed_time:.2f} seconds")
直到文件读取和文件写入完成…
⇛ 131.00 秒
输出文件
"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"
"0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0"
"25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25"
...
"999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975"
"1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000"
⇛没问题。
验证2:在缓冲区中累加几行然后写入文件
首先,听说一次写入多行比一次写入一行更快,所以准备了一个缓冲区。
当缓冲区中累积了 1000 行时,将写入缓冲区。
import csv
import time
from itertools import islice
# 指定输入和输出文件
input_file = 'input_large.csv'
output_file = 'output_selected_rows_2.csv'
# 指定要提取的行号列表(从 0 开始)
selected_row_indices = [i for i in range(0, 1000000+1, 25)]
# 指定每个块的行数
chunk_size = 1000
# 获取开始时间
start_time = time.time()
# 打开输入和输出文件
with open(input_file, 'r', newline='') as infile, \
open(output_file, 'w', newline='') as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile, quoting=csv.QUOTE_ALL)
# 先写标题行
header = next(reader)
writer.writerow(header)
# 在写入之前准备好要存储的缓冲区
buffer = []
# 单步执行剩余数据并将适当的数据写入输出文件
for chunk in iter(lambda: list(islice(reader, chunk_size)), []):
for row_index, row in enumerate(chunk):
if row_index in selected_row_indices:
buffer.append(row)
# 将缓冲区的内容写入文件
if len(buffer) >= chunk_size:
writer.writerows(buffer)
buffer = []
# 将剩余行写入缓冲区
if buffer:
writer.writerows(buffer)
# 获取结束时间
end_time = time.time()
#计算并打印经过的时间
elapsed_time = end_time - start_time
print(f"Time taken: {elapsed_time:.2f} seconds")
直到文件读取和文件写入完成…
⇛ 126.57 秒(令人惊讶的是没有那么快…)
输出文件
"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"
"0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0"
"25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25"
...
"999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975"
"1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000"
⇛没问题。
验证③:使用pandas方法
实际上,对 Pandas 尝试了各种方法,以下是最有效的:
使用 chunk 一次读取输入文件 1000 行,并将写入留给 Pandas 的 to_csv 方法,这起作用了。
注意
- read_csv的默认行为是转换数据类型,所以如果你不想要它,你需要添加dtype=str等选项。
- 如果不将 lineterminator=‘\r\n’ 选项添加到 to_csv 方法中,则输出文件中的换行代码将从 CRLF 更改为 LF。
import csv
import time
import pandas as pd
# 指定输入和输出文件
input_file = 'input_large.csv'
output_file = 'output_selected_rows_3.csv'
# 指定要提取的行号列表(从 0 开始)
selected_row_indices = [i for i in range(0, 1000000+1, 25)]
# 指定每个块的行数
chunk_size = 1000
# 获取开始时间
start_time = time.time()
# 指定读取chunk的进程
chunks = pd.read_csv(input_file, chunksize=chunk_size)
selected_rows = []
current_row = 0
for chunk in chunks:
# 计算当前块的范围,并将其缩小到您想要使用的数据行数
end_row = current_row + len(chunk)
chunk_indices = [idx for idx in selected_row_indices if current_row <= idx < end_row]
# 修改当前chunk的索引(使用整个索引而不是chunk索引)
adjusted_indices = [idx - current_row for idx in chunk_indices]
# 将相应行添加到列表中
selected_rows.append(chunk.iloc[adjusted_indices])
current_row = end_row
# 合并选定的行
selected_rows_df = pd.concat(selected_rows)
# 将所有选定的行写入输出文件
selected_rows_df.to_csv(output_file, index=False, quoting=csv.QUOTE_ALL, lineterminator='\r\n')
# 获取结束时间
end_time = time.time()
# 计算并打印经过的时间
elapsed_time = end_time - start_time
print(f"Time taken: {elapsed_time:.2f} seconds")
直到文件读取和文件写入完成…
⇛3.54秒! (^◇^)
输出文件
"a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z"
"0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0"
"25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25","25"
...
"999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975","999975"
"1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000","1000000"
⇛没问题。
综上所述
在处理大量 CSV 数据时,Pandas 绝对具有优势。
使用 chunk 进行部分处理有防止内存过度使用和加速 IO 的优点,所以请尝试一下。
结果总结如下。
顺便说一句,如果数据被分成多个文件,因为并行处理是有利的。
若经常用joblib,所以如果需要的话请查看一下。
那么,现在就这些了。