1000行python代码,使用Python处理大文件[1000 GB或更多]

Lets say i have a text file of 1000 GB. I need to find how much times a phrase occurs in the text.

Is there any faster way to do this that the one i am using bellow?

How much would it take to complete the task.

phrase = "how fast it is"

count = 0

with open('bigfile.txt') as f:

for line in f:

count += line.count(phrase)

If I am right if I do not have this file in the memory i would meed to wait till the PC loads the file each time I am doing the search and this should take at least 4000 sec for a 250 MB/sec hard drive and a file of 10000 GB.

解决方案

I used file.read() to read the data in chunks, in current examples the chunks were of size 100 MB, 500MB, 1GB and 2GB respectively. The size of my text file is 2.1 GB.

Code:

from functools import partial

def read_in_chunks(size_in_bytes):

s = 'Lets say i have a text file of 1000 GB'

with open('data.txt', 'r+b') as f:

prev = ''

count = 0

f_read = partial(f.read, size_in_bytes)

for text in iter(f_read, ''):

if not text.endswith('\n'):

# if file contains a partial line at the end, then don't

# use it when counting the substring count.

text, rest = text.rsplit('\n', 1)

# pre-pend the previous partial line if any.

text = prev + text

prev = rest

else:

# if the text ends with a '\n' then simple pre-pend the

# previous partial line.

text = prev + text

prev = ''

count += text.count(s)

count += prev.count(s)

print count

Timings:

read_in_chunks(104857600)

$ time python so.py

10000000

real 0m1.649s

user 0m0.977s

sys 0m0.669s

read_in_chunks(524288000)

$ time python so.py

10000000

real 0m1.558s

user 0m0.893s

sys 0m0.646s

read_in_chunks(1073741824)

$ time python so.py

10000000

real 0m1.242s

user 0m0.689s

sys 0m0.549s

read_in_chunks(2147483648)

$ time python so.py

10000000

real 0m0.844s

user 0m0.415s

sys 0m0.408s

On the other hand the simple loop version takes around 6 seconds on my system:

def simple_loop():

s = 'Lets say i have a text file of 1000 GB'

with open('data.txt') as f:

print sum(line.count(s) for line in f)

$ time python so.py

10000000

real 0m5.993s

user 0m5.679s

sys 0m0.313s

Results of @SlaterTyranus's grep version on my file:

$ time grep -o 'Lets say i have a text file of 1000 GB' data.txt|wc -l

10000000

real 0m11.975s

user 0m11.779s

sys 0m0.568s

Results of @woot's solution:

$ time cat data.txt | parallel --block 10M --pipe grep -o 'Lets\ say\ i\ have\ a\ text\ file\ of\ 1000\ GB' | wc -l

10000000

real 0m5.955s

user 0m14.825s

sys 0m5.766s

Got best timing when I used 100 MB as block size:

$ time cat data.txt | parallel --block 100M --pipe grep -o 'Lets\ say\ i\ have\ a\ text\ file\ of\ 1000\ GB' | wc -l

10000000

real 0m4.632s

user 0m13.466s

sys 0m3.290s

Results of woot's second solution:

$ time python woot_thread.py # CHUNK_SIZE = 1073741824

10000000

real 0m1.006s

user 0m0.509s

sys 0m2.171s

$ time python woot_thread.py #CHUNK_SIZE = 2147483648

10000000

real 0m1.009s

user 0m0.495s

sys 0m2.144s

System Specs: Core i5-4670, 7200 RPM HDD

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值