python grep_Python grep代码比命令行的grep慢得多

I'm just grepping some Xliff files for the pattern approved="no". I have a Shell script and a Python script, and the difference in performance is huge (for a set of 393 files, and a total of 3,686,329 lines, 0.1s user time for the Shell script, and 6.6s for the Python script).

Shell: grep 'approved="no"' FILE

Python:

def grep(pattern, file_path):

ret = False

with codecs.open(file_path, "r", encoding="utf-8") as f:

while 1 and not ret:

lines = f.readlines(100000)

if not lines:

break

for line in lines:

if re.search(pattern, line):

ret = True

break

return ret

Any ideas to improve performance with a multiplatform solution?

Results

Here are a couple of results after applying some of the proposed solutions.

Tests were run on a RHEL6 Linux machine, with Python 2.6.6.

Working set: 393 Xliff files, 3,686,329 lines in total.

Numbers are user time in seconds.

grep_1 (io, joining 100,000 file lines): 50s

grep_3 (mmap): 0.7s

Shell version (Linux grep): 0.130s

解决方案

Python, being an interpreted language vs. a compiled C version of grep will always be slower.

Apart from that your Python implementation is not the same as your grep example. It is not returning the matching lines, it is merely testing to see if the pattern matches the characters on any one line. A closer comparison would be:

grep -q 'approved="no"' FILE

which will return as soon as a match is found and not produce any output.

You can substantially speed up your code by writing your grep() function more efficiently:

def grep_1(pattern, file_path):

with io.open(file_path, "r", encoding="utf-8") as f:

while True:

lines = f.readlines(100000)

if not lines:

return False

if re.search(pattern, ''.join(lines)):

return True

This uses io instead of codecs which I found was a little faster. The while loop condition does not need to check ret and you can return from the function as soon as the result is known. There's no need to run re.search() for each individual ilne - just join the lines and perform a single search.

At the cost of memory usage you could try this:

import io

def grep_2(pattern, file_path):

with io.open(file_path, "r", encoding="utf-8") as f:

return re.search(pattern, f.read())

If memory is an issue you could mmap the file and run the regex search on the mmap:

import io

import mmap

def grep_3(pattern, file_path):

with io.open(file_path, "r", encoding="utf-8") as f:

return re.search(pattern, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ))

mmap will efficiently read the data from the file in pages without consuming a lot of memory. Also, you'll probably find that mmap runs faster than the other solutions.

Using timeit for each of these functions shows that this is the case:

10 loops, best of 3: 639 msec per loop # grep()

10 loops, best of 3: 78.7 msec per loop # grep_1()

10 loops, best of 3: 19.4 msec per loop # grep_2()

100 loops, best of 3: 5.32 msec per loop # grep_3()

The file was /usr/share/dict/words containing approx 480,000 lines and the search pattern was zymurgies, which occurs near the end of the file. For comparison, when pattern is near the start of the file, e.g. abaciscus, the times are:

10 loops, best of 3: 62.6 msec per loop # grep()

1000 loops, best of 3: 1.6 msec per loop # grep_1()

100 loops, best of 3: 14.2 msec per loop # grep_2()

10000 loops, best of 3: 37.2 usec per loop # grep_3()

which again shows that the mmap version is fastest.

Now comparing the grep command with the Python mmap version:

$ time grep -q zymurgies /usr/share/dict/words

real 0m0.010s

user 0m0.007s

sys 0m0.003s

$ time python x.py grep_3 # uses mmap

real 0m0.023s

user 0m0.019s

sys 0m0.004s

Which is not too bad considering the advantages that grep has.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值