正如在注释中提到的,Python已经为写入文件提供了缓冲区,因此用Python实现自己的缓冲区(与C相反,就像已经是这样)会使速度变慢。可以使用对open调用的参数调整缓冲区大小。在
另一种方法是将文件分块读取。基本算法如下:{cdx>使用当前文件的大小迭代
迭代时,记录每个块的开始-结束-结束字节位置
在工作进程中(使用multiprocessing.Pool()),使用起始字节和结束字节位置读入块
每个进程写入自己的关键字文件
协调分开的文件。您可以选择以下几种方法:将关键字文件读回内存到单个列表中
如果是*尼克斯,使用“cat”命令组合关键字文件。在
如果您在Windows上,您可以保留一个关键字文件列表(而不是一个文件路径),并根据需要迭代这些文件
有很多博客和食谱可以并行阅读大型文件:
旁注:我曾经试过做同样的事情,结果也一样。将文件写入外包给另一个线程也没有帮助(至少在我尝试的时候没有)。在
下面是演示算法的代码片段:import functools
import multiprocessing
BYTES_PER_MB = 1048576
# stand-in for whatever processing you need to do on each line
# for demonstration, we'll just grab the first character of every non-empty line
def line_processor(line):
try:
return line[0]
except IndexError:
return None
# here's your worker function that executes in a worker process
def parser(file_name, start, end):
with open(file_name) as infile:
# get to proper starting position
infile.seek(start)
# use read() to force exactly the number of bytes we want
lines = infile.read(end - start).split("\n")
return [line_processor(line) for line in lines]
# this function splits the file into chunks and returns the start and end byte
# positions of each chunk
def chunk_file(file_name):
chunk_start = 0
chunk_size = 512 * BYTES_PER_MB # 512 MB chunk size
with open(file_name) as infile:
# we can't use the 'for line in infile' construct because fi.tell()
# is not accurate during that kind of iteration
while True:
# move chunk end to the end of this chunk
chunk_end = chunk_start + chunk_size
infile.seek(chunk_end)
# reading a line will advance the FP to the end of the line so that
# chunks don't break lines
line = infile.readline()
# check to see if we've read past the end of the file
if line == '':
yield (chunk_start, chunk_end)
break
# adjust chunk end to ensure it didn't break a line
chunk_end = infile.tell()
yield (chunk_start, chunk_end)
# move starting point to the beginning of the new chunk
chunk_start = chunk_end
return
if __name__ == "__main__":
pool = multiprocessing.Pool()
keywords = []
file_name = # enter your file name here
# bind the file name argument to the parsing function so we dont' have to
# explicitly pass it every time
new_parser = functools.partial(parser, file_name)
# chunk out the file and launch the subprocesses in one step
for keyword_list in pool.starmap(new_parser, chunk_file(file_name)):
# as each list is available, extend the keyword list with the new one
# there are definitely faster ways to do this - have a look at
# itertools.chain() for other ways to iterate over or combine your
# keyword lists
keywords.extend(keyword_list)
# now do whatever you need to do with your list of keywords