python高效处理文件,在python中高效处理大型.txt文件

I am quite new to python and programming in general, but I am trying to run a "sliding window" calculation over a tab delimited .txt file that contains about 7 million lines with python. What I mean by sliding window is that it will run a calculation over say 50,000 lines, report the number and then move up say 10,000 lines and perform the same calculation over another 50,000 lines. I have the calculation and the "sliding window" working correctly and it runs well if I test it on a a small subset of my data. However, if i try to run the program over my entire data set it is incredibly slow (i've had it running now for about 40 hours). The math is quite simple so I don't think it should be taking this long.

The way I am reading my .txt file right now is with the csv.DictReader module. My code is as follows:

file1='/Users/Shared/SmallSetbee.txt'

newfile=open(file1, 'rb')

reader=csv.DictReader((line.replace('\0','') for line in newfile), delimiter="\t")

I believe that this is making a dictionary out of all 7 million lines at once, which I'm thinking could be the reason it slows down so much for the larger file.

Since I am only interested in running my calculation over "chunks" or "windows" of data at a time, is there a more efficient way to read in only specified lines at a time, perform the calculation and then repeat with a new specified "chunk" or "window" of specified lines?

解决方案

A collections.deque is an ordered collection of items which can take a maximum size. When you add an item to one end, one falls of the other end. This means that to iterate over a "window" on your csv, you just need to keep adding rows to the deque and it will handle throwing away complete ones already.

dq = collections.deque(maxlen=50000)

with open(...) as csv_file:

reader = csv.DictReader((line.replace("\0", "") for line in csv_file), delimiter="\t")

# initial fill

for _ in range(50000):

dq.append(reader.next())

# repeated compute

try:

while 1:

compute(dq)

for _ in range(10000):

dq.append(reader.next())

except StopIteration:

compute(dq)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值