python高效处理文件,在python中高效处理大型.txt文件

最新推荐文章于 2022-07-22 22:05:30 发布

weixin_39769807

最新推荐文章于 2022-07-22 22:05:30 发布

阅读量122

点赞数

文章标签： python高效处理文件

I am quite new to python and programming in general, but I am trying to run a "sliding window" calculation over a tab delimited .txt file that contains about 7 million lines with python. What I mean by sliding window is that it will run a calculation over say 50,000 lines, report the number and then move up say 10,000 lines and perform the same calculation over another 50,000 lines. I have the calculation and the "sliding window" working correctly and it runs well if I test it on a a small subset of my data. However, if i try to run the program over my entire data set it is incredibly slow (i've had it running now for about 40 hours). The math is quite simple so I don't think it should be taking this long.

The way I am reading my .txt file right now is with the csv.DictReader module. My code is as follows:

file1='/Users/Shared/SmallSetbee.txt'

newfile=open(file1, 'rb')

reader=csv.DictReader((line.replace('\0','') for line in newfile), delimiter="\t")

I believe that this is making a dictionary out of all 7 million lines at once, which I'm thinking could be the reason it slows down so much for the larger file.

Since I am only interested in running my calculation over "chunks" or "windows" of data at a time, is there a more efficient way to read in only specified lines at a time, perform the calculation and then repeat with a new specified "chunk" or "window" of specified lines?

解决方案

A collections.deque is an ordered collection of items which can take a maximum size. When you add an item to one end, one falls of the other end. This means that to iterate over a "window" on your csv, you just need to keep adding rows to the deque and it will handle throwing away complete ones already.

dq = collections.deque(maxlen=50000)

with open(...) as csv_file:

reader = csv.DictReader((line.replace("\0", "") for line in csv_file), delimiter="\t")

# initial fill

for _ in range(50000):

dq.append(reader.next())

# repeated compute

try:

while 1:

compute(dq)

for _ in range(10000):

dq.append(reader.next())