背景:
Python 3.5.1,Windows 7
我有一个网络驱动器,可以存放大量的文件和目录。我正在尝试编写一个脚本来尽可能快地解析所有这些文件,以找到与RegEx匹配的所有文件,并将这些文件复制到我的本地PC上以供审阅。大约有3500个目录和子目录,以及几百万个文件。我试图使它尽可能通用(即,不写代码到这个确切的文件结构),以便在其他网络驱动器上重用它。当我的代码运行在一个小型网络驱动器上时,这里的问题似乎是可伸缩性。在
我用多处理库做了一些尝试,但似乎不能使它可靠地工作。我的想法是创建一个新的作业来解析每个子目录,以便尽快工作。我有一个递归函数,它解析一个目录中的所有对象,然后为任何子目录调用自己,并根据正则表达式检查找到的任何文件。在
问题:如何在不使用池来实现目标的情况下限制线程/进程的数量?在
我的尝试:如果我只使用进程作业,那么在超过几百个线程启动后,我会得到错误RuntimeError: can't start new thread,它开始断开连接。最后我找到了大约一半的文件,因为有一半的目录出错了(下面是代码)。在
为了限制线程总数,我尝试使用Pool方法,但是我不能根据this question将Pool对象传递给被调用的方法,这使得递归实现不可能。在
为了解决这个问题,我尝试在Pool方法中调用进程,但是我得到了错误daemonic processes are not allowed to have children。在
我认为如果我可以限制并发线程的数量,那么我的解决方案将按设计工作。在
代码:import os
import re
import shutil
from multiprocessing import Process, Manager
CheckLocations = ['network drive location 1', 'network drive location 2']
SaveLocation = 'local PC location'
FileNameRegex = re.compile('RegEx here', flags = re.IGNORECASE)
# Loop through all items in folder, and call itself for subfolders.
def ParseFolderContents(path, DebugFileList):
FolderList = []
jobs = []
TempList = []
if not os.path.exists(path):
return
try:
for item in os.scandir(path):
try:
if item.is_dir():
p = Process(target=ParseFolderContents, args=(item.path, DebugFileList))
jobs.append(p)
p.start()
elif FileNameRegex.search(item.name) != None:
DebugFileList.append((path, item.name))
else:
pass
except Exception as ex:
if hasattr(ex, 'message'):
print(ex.message)
else:
print(ex)
# print('Error in file:\t' + item.path)
except Exception as ex:
if hasattr(ex, 'message'):
print(ex.message)
else:
print('Error in path:\t' + path)
pass
else:
print('\tToo many threads to restart directory.')
for job in jobs:
job.join()
# Save list of debug files.
def SaveDebugFiles(DebugFileList):
for file in DebugFileList:
try:
shutil.copyfile(file[0] + '\\' + file[1], SaveLocation + file[1])
except PermissionError:
continue
if __name__ == '__main__':
with Manager() as manager:
# Iterate through all directories to make a list of all desired files.
DebugFileList = manager.list()
jobs = []
for path in CheckLocations:
p = Process(target=ParseFolderContents, args=(path, DebugFileList))
jobs.append(p)
p.start()
for job in jobs:
job.join()
print('\n' + str(len(DebugFileList)) + ' files found.\n')
if len(DebugFileList) == 0:
quit()
# Iterate through all debug files and copy them to local PC.
n = 25 # Number of files to grab for each parallel path.
TempList = [DebugFileList[i:i + n] for i in range(0, len(DebugFileList), n)] # Split list into small chunks.
jobs = []
for item in TempList:
p = Process(target=SaveDebugFiles, args=(item, ))
jobs.append(p)
p.start()
for job in jobs:
job.join()