I need to read every file in the directory tree starting from a given root location. I would like to do this as fast as possible using parallelism. I have 48 cores at my disposal and 1 TB ram, so the thread resources are not an issue. I also need to log every file that was read.
I looked at using joblib but am unable to combine joblib with os.walk.
I can think of two ways:
walk the tree and add all files to a queue or list and have a worker pool of threads dequeue files - best load balancing, maybe more time due to initial walk & queue overhead
spawn threads and statically assign portions of the tree to each thread - low load balancing, no initial walk, assign directories based on a hash of some sort.
or is there a better way?
EDIT performance of storage is not a concern. assume there is an infinitely fast storage that can handle infinite number of parallel reads
EDIT removed multinode situation to keep the focus on parallel directory walk
解决方案
The simplest approach is probably to use a multiprocessing.Pool to process the results output of an os.walk performed in the main process.
This assumes that the main work you want to parallelize is whatever processing takes place on the individual files, not the effort of recursively scanning the directory structure. This may not be true if your files are small and you don't need to do much processing on their contents. I'm also assuming that the process creation handled for you by multiprocessing will be able to properly distribute the load over your cluster (which may or may not be true).
import itertools
import multiprocessing
def worker(filename):
pass # do something here!
def main():
with multiprocessing.Pool(48) as Pool: # pool of 48 processes
walk = os.walk("some/path")
fn_gen = itertools.chain.from_iterable((os.path.join(root, file)
for file in files)
for root, dirs, files in walk)
results_of_work = pool.map(worker, fn_gen) # this does the parallel processing
It is entirely possible that parallelizing the work this way will be slower than just doing the work in a single process. This is because IO on the hard disks underlying your shared filesystem may be the bottleneck and attempting many disk reads in parallel could make them all slower, if the disks needs to seek more often rather than reading longer linear sections of data. Even if the IO is a little faster, the overhead of communicating between the processes could eat up all of the gains.