python grep_Python grep代码比命令行的grep慢得多

最新推荐文章于 2023-11-23 20:37:36 发布

weixin_39606799

最新推荐文章于 2023-11-23 20:37:36 发布

阅读量71

点赞数

文章标签： python grep

I'm just grepping some Xliff files for the pattern approved="no". I have a Shell script and a Python script, and the difference in performance is huge (for a set of 393 files, and a total of 3,686,329 lines, 0.1s user time for the Shell script, and 6.6s for the Python script).

Shell: grep 'approved="no"' FILE

Python:

def grep(pattern, file_path):

ret = False

with codecs.open(file_path, "r", encoding="utf-8") as f:

while 1 and not ret:

lines = f.readlines(100000)

if not lines:

break

for line in lines:

if re.search(pattern, line):

ret = True

break

return ret

Any ideas to improve performance with a multiplatform solution?

Results

Here are a couple of results after applying some of the proposed solutions.

Tests were run on a RHEL6 Linux machine, with Python 2.6.6.

Working set: 393 Xliff files, 3,686,329 lines in total.

Numbers are user time in seconds.

grep_1 (io, joining 100,000 file lines): 50s

grep_3 (mmap): 0.7s

Shell version (Linux grep): 0.130s

解决方案

Python, being an interpreted language vs. a compiled C version of grep will always be slower.

Apart from that your Python implementation is not the same as your grep example. It is not returning the matching lines, it is merely testing to see if the pattern matches the characters on any one line. A closer comparison would be:

grep -q 'approved="no"' FILE

which will return as soon as a match is found and not produce any output.

You can substantially speed up your code by writing your grep() function more efficiently:

def grep_1(pattern, file_path):

with io.open(file_path, "r", encoding="utf-8") as f:

while True:

lines = f.readlines(100000)

if not lines:

return False

if re.search(pattern, ''.join(lines)):

return True

This uses io instead of codecs which I found was a little faster. The while loop condition does not need to check ret and you can return from the function as soon as the result is known. There's no need to run re.search() for each individual ilne - just join the lines and perform a single search.

At the cost of memory usage you could try this:

import io

def grep_2(pattern, file_path):

with io.open(file_path, "r", encoding="utf-8") as f:

return re.search(pattern, f.read())

If memory is an issue you could mmap the file and run the regex search on the mmap:

import io

import mmap

def grep_3(pattern, file_path):

with io.open(file_path, "r", encoding="utf-8") as f:

return re.search(pattern, mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ))

mmap will efficiently read the data from the file in pages without consuming a lot of memory. Also, you'll probably find that mmap runs faster than the other solutions.

Using timeit for each of these functions shows that this is the case:

10 loops, best of 3: 639 msec per loop # grep()

10 loops, best of 3: 78.7 msec per loop # grep_1()

10 loops, best of 3: 19.4 msec per loop # grep_2()

100 loops, best of 3: 5.32 msec per loop # grep_3()

The file was /usr/share/dict/words containing approx 480,000 lines and the search pattern was zymurgies, which occurs near the end of the file. For comparison, when pattern is near the start of the file, e.g. abaciscus, the times are:

10 loops, best of 3: 62.6 msec per loop # grep()

1000 loops, best of 3: 1.6 msec per loop # grep_1()

100 loops, best of 3: 14.2 msec per loop # grep_2()

10000 loops, best of 3: 37.2 usec per loop # grep_3()

which again shows that the mmap version is fastest.

Now comparing the grep command with the Python mmap version:

$ time grep -q zymurgies /usr/share/dict/words

real 0m0.010s

user 0m0.007s

sys 0m0.003s

$ time python x.py grep_3 # uses mmap

real 0m0.023s

user 0m0.019s

sys 0m0.004s

Which is not too bad considering the advantages that grep has.

weixin_39606799

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python grep_Python grep代码比命令行的grep慢得多

I'm just grepping some Xliff files for the pattern approved="no". I have a Shell script and a Python script, and the difference in performance is huge (for a set of 393 files, and a total of 3,686,329...
复制链接

扫一扫