1000行python代码,使用Python处理大文件[1000 GB或更多]

最新推荐文章于 2022-07-22 22:05:30 发布

weixin_39529302

最新推荐文章于 2022-07-22 22:05:30 发布

阅读量150

点赞数

文章标签： 1000行python代码

Lets say i have a text file of 1000 GB. I need to find how much times a phrase occurs in the text.

Is there any faster way to do this that the one i am using bellow?

How much would it take to complete the task.

phrase = "how fast it is"

count = 0

with open('bigfile.txt') as f:

for line in f:

count += line.count(phrase)

If I am right if I do not have this file in the memory i would meed to wait till the PC loads the file each time I am doing the search and this should take at least 4000 sec for a 250 MB/sec hard drive and a file of 10000 GB.

解决方案

I used file.read() to read the data in chunks, in current examples the chunks were of size 100 MB, 500MB, 1GB and 2GB respectively. The size of my text file is 2.1 GB.

Code:

from functools import partial

def read_in_chunks(size_in_bytes):

s = 'Lets say i have a text file of 1000 GB'

with open('data.txt', 'r+b') as f:

prev = ''

count = 0

f_read = partial(f.read, size_in_bytes)

for text in iter(f_read, ''):

if not text.endswith('\n'):

# if file contains a partial line at the end, then don't

# use it when counting the substring count.

text, rest = text.rsplit('\n', 1)

# pre-pend the previous partial line if any.

text = prev + text

prev = rest

else:

# if the text ends with a '\n' then simple pre-pend the

# previous partial line.

text = prev + text

prev = ''

count += text.count(s)

count += prev.count(s)

print count

Timings:

read_in_chunks(104857600)

$ time python so.py

10000000

real 0m1.649s

user 0m0.977s

sys 0m0.669s

read_in_chunks(524288000)

$ time python so.py

10000000

real 0m1.558s

user 0m0.893s

sys 0m0.646s

read_in_chunks(1073741824)

$ time python so.py

10000000

real 0m1.242s

user 0m0.689s

sys 0m0.549s

read_in_chunks(2147483648)

$ time python so.py

10000000

real 0m0.844s

user 0m0.415s

sys 0m0.408s

On the other hand the simple loop version takes around 6 seconds on my system:

def simple_loop():

s = 'Lets say i have a text file of 1000 GB'

with open('data.txt') as f:

print sum(line.count(s) for line in f)

$ time python so.py

10000000

real 0m5.993s

user 0m5.679s

sys 0m0.313s

Results of @SlaterTyranus's grep version on my file:

$ time grep -o 'Lets say i have a text file of 1000 GB' data.txt|wc -l

10000000

real 0m11.975s

user 0m11.779s

sys 0m0.568s

Results of @woot's solution:

$ time cat data.txt | parallel --block 10M --pipe grep -o 'Lets\ say\ i\ have\ a\ text\ file\ of\ 1000\ GB' | wc -l

10000000

real 0m5.955s

user 0m14.825s

sys 0m5.766s

Got best timing when I used 100 MB as block size:

$ time cat data.txt | parallel --block 100M --pipe grep -o 'Lets\ say\ i\ have\ a\ text\ file\ of\ 1000\ GB' | wc -l

10000000

real 0m4.632s

user 0m13.466s

sys 0m3.290s

Results of woot's second solution:

$ time python woot_thread.py # CHUNK_SIZE = 1073741824

10000000

real 0m1.006s

user 0m0.509s

sys 0m2.171s

$ time python woot_thread.py #CHUNK_SIZE = 2147483648

10000000

real 0m1.009s

user 0m0.495s

sys 0m2.144s

System Specs: Core i5-4670, 7200 RPM HDD

weixin_39529302

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫