在我看来,你的问题不在于重读文件,而在于将长列表的切片与短列表相匹配。正如其他答案所指出的,你可以使用普通列表或内存映射文件来加速你的程序。在
如果您希望使用特定的数据结构来进一步提高速度,那么我建议您研究一下blist,特别是因为它在切片列表方面比标准Python列表有更好的性能:它们声明O(logn),而不是O(n)。在
在10MB的列表中,我测量到了大约4倍的加速:import random
from blist import blist
LINE_NUMBER = 1000000
def write_files(line_length=LINE_NUMBER):
with open('haystack.txt', 'w') as infile:
for _ in range(line_length):
infile.write('an example\n')
with open('needles.txt', 'w') as infile:
for _ in range(line_length / 100):
first_rand = random.randint(0, line_length)
second_rand = random.randint(first_rand, line_length)
needle = random.choice(['an example', 'a sample'])
infile.write('%s\t%s\t%s\n' % (needle, first_rand, second_rand))
def read_files():
with open('haystack.txt', 'r') as infile:
normal_list = []
for line in infile:
normal_list.append(line.strip())
enhanced_list = blist(normal_list)
return normal_list, enhanced_list
def match_over(list_structure):
matches = 0
total = len(list_structure)
with open('needles.txt', 'r') as infile:
for line in infile:
needle, start, end = line.split('\t')
start, end = int(start), int(end)
if needle in list_structure[start:end]:
matches += 1
return float(matches) / float(total)
通过IPython的%time命令测量,blist需要12秒,而普通的list需要46秒:
^{pr2}$