python 加速csv读取速度_提高读取大型CSV Fi的效率

最新推荐文章于 2021-03-26 23:04:51 发布

weixin_39950081

最新推荐文章于 2021-03-26 23:04:51 发布

阅读量1k

点赞数

文章标签： python 加速csv读取速度

正如在注释中提到的，Python已经为写入文件提供了缓冲区，因此用Python实现自己的缓冲区(与C相反，就像已经是这样)会使速度变慢。可以使用对open调用的参数调整缓冲区大小。在

另一种方法是将文件分块读取。基本算法如下：{cdx>使用当前文件的大小迭代

迭代时，记录每个块的开始-结束-结束字节位置

在工作进程中(使用multiprocessing.Pool())，使用起始字节和结束字节位置读入块

每个进程写入自己的关键字文件

协调分开的文件。您可以选择以下几种方法：将关键字文件读回内存到单个列表中

如果是*尼克斯，使用“cat”命令组合关键字文件。在

如果您在Windows上，您可以保留一个关键字文件列表(而不是一个文件路径)，并根据需要迭代这些文件

有很多博客和食谱可以并行阅读大型文件：

旁注：我曾经试过做同样的事情，结果也一样。将文件写入外包给另一个线程也没有帮助(至少在我尝试的时候没有)。在

下面是演示算法的代码片段：import functools

import multiprocessing

BYTES_PER_MB = 1048576

# stand-in for whatever processing you need to do on each line

# for demonstration, we'll just grab the first character of every non-empty line

def line_processor(line):

try:

return line[0]

except IndexError:

return None

# here's your worker function that executes in a worker process

def parser(file_name, start, end):

with open(file_name) as infile:

# get to proper starting position

infile.seek(start)

# use read() to force exactly the number of bytes we want

lines = infile.read(end - start).split("\n")

return [line_processor(line) for line in lines]

# this function splits the file into chunks and returns the start and end byte

# positions of each chunk

def chunk_file(file_name):

chunk_start = 0

chunk_size = 512 * BYTES_PER_MB # 512 MB chunk size

with open(file_name) as infile:

# we can't use the 'for line in infile' construct because fi.tell()

# is not accurate during that kind of iteration

while True:

# move chunk end to the end of this chunk

chunk_end = chunk_start + chunk_size

infile.seek(chunk_end)

# reading a line will advance the FP to the end of the line so that

# chunks don't break lines

line = infile.readline()

# check to see if we've read past the end of the file

if line == '':

yield (chunk_start, chunk_end)

break

# adjust chunk end to ensure it didn't break a line

chunk_end = infile.tell()

yield (chunk_start, chunk_end)

# move starting point to the beginning of the new chunk

chunk_start = chunk_end

return

if __name__ == "__main__":

pool = multiprocessing.Pool()

keywords = []

file_name = # enter your file name here

# bind the file name argument to the parsing function so we dont' have to

# explicitly pass it every time

new_parser = functools.partial(parser, file_name)

# chunk out the file and launch the subprocesses in one step

for keyword_list in pool.starmap(new_parser, chunk_file(file_name)):

# as each list is available, extend the keyword list with the new one

# there are definitely faster ways to do this - have a look at

# itertools.chain() for other ways to iterate over or combine your

# keyword lists

keywords.extend(keyword_list)

# now do whatever you need to do with your list of keywords

weixin_39950081

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python 加速csv读取速度_提高读取大型CSV Fi的效率

正如在注释中提到的，Python已经为写入文件提供了缓冲区，因此用Python实现自己的缓冲区(与C相反，就像已经是这样)会使速度变慢。可以使用对open调用的参数调整缓冲区大小。在另一种方法是将文件分块读取。基本算法如下：{cdx>使用当前文件的大小迭代迭代时，记录每个块的开始-结束-结束字节位置在工作进程中(使用multiprocessing.Pool())，使用起始字节和结束字节位置读...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。