python并行遍历_并行目录遍历python

I need to read every file in the directory tree starting from a given root location. I would like to do this as fast as possible using parallelism. I have 48 cores at my disposal and 1 TB ram, so the thread resources are not an issue. I also need to log every file that was read.

I looked at using joblib but am unable to combine joblib with os.walk.

I can think of two ways:

walk the tree and add all files to a queue or list and have a worker pool of threads dequeue files - best load balancing, maybe more time due to initial walk & queue overhead

spawn threads and statically assign portions of the tree to each thread - low load balancing, no initial walk, assign directories based on a hash of some sort.

or is there a better way?

EDIT performance of storage is not a concern. assume there is an infinitely fast storage that can handle infinite number of parallel reads

EDIT removed multinode situation to keep the focus on parallel directory walk

解决方案

The simplest approach is probably to use a multiprocessing.Pool to process the results output of an os.walk performed in the main process.

This assumes that the main work you want to parallelize is whatever processing takes place on the individual files, not the effort of recursively scanning the directory structure. This may not be true if your files are small and you don't need to do much processing on their contents. I'm also assuming that the process creation handled for you by multiprocessing will be able to properly distribute the load over your cluster (which may or may not be true).

import itertools

import multiprocessing

def worker(filename):

pass # do something here!

def main():

with multiprocessing.Pool(48) as Pool: # pool of 48 processes

walk = os.walk("some/path")

fn_gen = itertools.chain.from_iterable((os.path.join(root, file)

for file in files)

for root, dirs, files in walk)

results_of_work = pool.map(worker, fn_gen) # this does the parallel processing

It is entirely possible that parallelizing the work this way will be slower than just doing the work in a single process. This is because IO on the hard disks underlying your shared filesystem may be the bottleneck and attempting many disk reads in parallel could make them all slower, if the disks needs to seek more often rather than reading longer linear sections of data. Even if the IO is a little faster, the overhead of communicating between the processes could eat up all of the gains.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值