python 并行读取文件_并行读取大文件？

最新推荐文章于 2024-02-05 21:01:58 发布

weixin_39633891

最新推荐文章于 2024-02-05 21:01:58 发布

阅读量706

点赞数

文章标签： python 并行读取文件

本文链接：https://blog.csdn.net/weixin_39633891/article/details/111457830

版权

I have a large file which I need to read in and make a dictionary from. I would like this to be as fast as possible. However my code in python is too slow. Here is a minimal example that shows the problem.

First make some fake data

paste largefile.txt

Now here is a minimal piece of python code to read it in and make a dictionary.

import sys

from collections import defaultdict

fin = open(sys.argv[1])

dict = defaultdict(list)

for line in fin:

parts = line.split()

dict[parts[0]].append(parts[1])

Timings:

time ./read.py largefile.txt

real 0m55.746s

However it is possible to read the whole file much faster as:

time cut -f1 largefile.txt > /dev/null

real 0m1.702s

My CPU has 8 cores, is it possible to parallelize this program in

python to speed it up?

One possibility might be to read in large chunks of the input and then run 8 processes in parallel on different non-overlapping subchunks making dictionaries in parallel from the data in memory then read in another large chunk. Is this possible in python using multiprocessing somehow?

Update. The fake data was not very good as it had only one value per key. Better is

perl -E 'say int rand 1e7, $", int rand 1e4 for 1 .. 1e7' > largefile.txt

解决方案

There was a blog post series "Wide Finder Project" several years ago about this at Tim Bray's site [1]. You can find there a solution [2] by Fredrik Lundh of ElementTree [3] and PIL [4] fame. I know posting links is generally discouraged at this site but I think these links give you better answer than copy-pasting his code.