python 加速csv读取速度_提高读取大型CSV Fi的效率

正如在注释中提到的,Python已经为写入文件提供了缓冲区,因此用Python实现自己的缓冲区(与C相反,就像已经是这样)会使速度变慢。可以使用对open调用的参数调整缓冲区大小。在

另一种方法是将文件分块读取。基本算法如下:{cdx>使用当前文件的大小迭代

迭代时,记录每个块的开始-结束-结束字节位置

在工作进程中(使用multiprocessing.Pool()),使用起始字节和结束字节位置读入块

每个进程写入自己的关键字文件

协调分开的文件。您可以选择以下几种方法:将关键字文件读回内存到单个列表中

如果是*尼克斯,使用“cat”命令组合关键字文件。在

如果您在Windows上,您可以保留一个关键字文件列表(而不是一个文件路径),并根据需要迭代这些文件

有很多博客和食谱可以并行阅读大型文件:

旁注:我曾经试过做同样的事情,结果也一样。将文件写入外包给另一个线程也没有帮助(至少在我尝试的时候没有)。在

下面是演示算法的代码片段:import functools

import multiprocessing

BYTES_PER_MB = 1048576

# stand-in for whatever processing you need to do on each line

# for demonstration, we'll just grab the first character of every non-empty line

def line_processor(line):

try:

return line[0]

except IndexError:

return None

# here's your worker function that executes in a worker process

def parser(file_name, start, end):

with open(file_name) as infile:

# get to proper starting position

infile.seek(start)

# use read() to force exactly the number of bytes we want

lines = infile.read(end - start).split("\n")

return [line_processor(line) for line in lines]

# this function splits the file into chunks and returns the start and end byte

# positions of each chunk

def chunk_file(file_name):

chunk_start = 0

chunk_size = 512 * BYTES_PER_MB # 512 MB chunk size

with open(file_name) as infile:

# we can't use the 'for line in infile' construct because fi.tell()

# is not accurate during that kind of iteration

while True:

# move chunk end to the end of this chunk

chunk_end = chunk_start + chunk_size

infile.seek(chunk_end)

# reading a line will advance the FP to the end of the line so that

# chunks don't break lines

line = infile.readline()

# check to see if we've read past the end of the file

if line == '':

yield (chunk_start, chunk_end)

break

# adjust chunk end to ensure it didn't break a line

chunk_end = infile.tell()

yield (chunk_start, chunk_end)

# move starting point to the beginning of the new chunk

chunk_start = chunk_end

return

if __name__ == "__main__":

pool = multiprocessing.Pool()

keywords = []

file_name = # enter your file name here

# bind the file name argument to the parsing function so we dont' have to

# explicitly pass it every time

new_parser = functools.partial(parser, file_name)

# chunk out the file and launch the subprocesses in one step

for keyword_list in pool.starmap(new_parser, chunk_file(file_name)):

# as each list is available, extend the keyword list with the new one

# there are definitely faster ways to do this - have a look at

# itertools.chain() for other ways to iterate over or combine your

# keyword lists

keywords.extend(keyword_list)

# now do whatever you need to do with your list of keywords

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值