Python re(正则)处理

编译正则和非编译正则

在使用编译正则的时候,系统不需要反复解读你的正则表达式,故而速度更快。
通识的说,就是编译性程序和解释性程序的速度差别也是这个原因
  • test_re_nocompile.py
#!/usr/bin/python
# _*_ codeing: UTF-8 _*_
from __future__ import print_function
import re
import time

def main():
    time_start = time.time()
    pattern = "[0-9]+"
    with open('test/1.txt') as f:
        for line in f:
            print(re.findall(pattern,line))
    time_end = time.time()
    print('{:_^+10.4f}'.format(time_end-time_start))
    #print(": {:_^+10.4f}".format(3.1415926))

if __name__ == '__main__':
    main()
  • test_re_compile.py
#!/usr/bin/python
# _*_ coding: UTF-8 _*_
from __future__ import print_function
import re
import time


def main():
    time_start = time.time()
    pattern = "[0-9]+"
    re_obj = re.compile(pattern)
    with open('test/1.txt') as f:
        for line in f:
            print(re_obj.findall(line))
    time_end = time.time()
    print('{:_^10.4f}'.format(time_end-time_start))


if __name__ == '__main__':
    main()

常用的re方法

匹配类函数

闲话不多说,直接上代码(ipython)
In [1]: import re

In [2]: data = "What is the difference between python 2.7.13 and 3.6.0"

In [3]: re.findall("[0-9]\.[0-9]\.[0-9]",data)
Out[3]: ['2.7.1', '3.6.0']

In [4]: re.findall("\d\.\d\.\d",data)
Out[4]: ['2.7.1', '3.6.0']

In [1]: import re

In [2]: data = "What is the difference between python 2.7.13 and 3.6.0"

In [3]: re.findall("[0-9]\.[0-9]\.[0-9]",data)
Out[3]: ['2.7.1', '3.6.0']

In [4]: re.findall("\d\.\d\.\d",data)
Out[4]: ['2.7.1', '3.6.0']

In [5]: re.findall("Python [0-9]\.[0-9]\.[0-9]", data, flags=re.IGNORECASE)
Out[5]: ['python 2.7.1']

这里的findall 可以查找data里的全部匹配项
当然还有就是flags=re.IGNOERCASE 忽略大小写

match 匹配函数的开始

我想你一定知道 startswith ,我觉得这个match 就是一个加强版,加入了正则的元素,让匹配更加的灵活

In [9]: s = "12345上山打老虎"

In [10]: re.match('\d+',s)
Out[10]: <_sre.SRE_Match object; span=(0, 5), match='12345'>

#遇到这种情况使用 startswith 就很鸡肋了,当然,re.match还有一些方法可以用

In [5]: import re

In [6]: s = "12345上山打老虎"

In [7]: r = re.match('\d+',s)

In [8]: r.start()
Out[8]: 0

In [9]: r.end()
Out[9]: 5

In [10]: r.string
Out[10]: '12345上山打老虎'

In [11]: r.group()
Out[11]: '12345'
  • 这里还要强调的是 search
In [12]: re.search("山",s)
Out[12]: <_sre.SRE_Match object; span=(6, 7), match='山'>

In [13]: r = re.search("山",s)

In [14]: r.start()
Out[14]: 6

In [15]: r.string
Out[15]: '12345上山打老虎'

In [16]: r.group()
Out[16]: '山'
  • 还有就是finditer 返回的是迭代器
In [17]: data = "What is the difference between python 2.7.13 and Python 3.6.0" 

In [18]: r = re.finditer("\d+\.\d+\.\d+", data)

In [19]: for it in r:
    ...:     print(it.group(0))
    ...:     
2.7.13
3.6.0

“修改类”函数

如果你了解C++的string,那么你一定知道 substr(),而在python中,他被replace代替。当然,更加定制化的是 re.sub

In [28]: data = "What is the difference between python 2.7.13 and Python 3.6.0" 

In [29]: data.replace("Python","*****")
Out[29]: 'What is the difference between python 2.7.13 and ***** 3.6.0'


In [35]: re.sub("\d\.\d\.\d+","*.*.*",data,flags=re.IGNORECASE)
Out[35]: 'What is the difference between python *.*.* and python *.*.*'

In [36]: re.sub("python \d\.\d\.\d+","*.*.*",data,flags=re.IGNORECASE)
Out[36]: 'What is the difference between *.*.* and *.*.*'
  • 本想在replace上添加 flags=re.IGNORECASE ,但是,这是re中的内容,所以无法在replace中使用
In [40]: text = "Today is 25/4/2018. PyCon starts 5/25/2017"

In [41]: re.sub(r"(\d+)/(\d+)/(\d+)",r'\3-\1-\2', text)
Out[41]: 'Today is 2018-25-4. PyCon starts 2017-5-25'
  • 还记得split吗,对,他也可以使用 re
In [90]: text = "MySQL slave binlog position: master host '10.173.33.35',filename 'mysql-bin.000002', positon '524993060'"

In [91]: re.split(r"[':,\s]+",text.strip("'"))
Out[91]: 
['MySQL',
 'slave',
 'binlog',
 'position',
 'master',
 'host',
 '10.173.33.35',
 'filename',
 'mysql-bin.000002',
 'positon',
 '524993060']

小知识:

(仅限在ipython中使用)
In [82]: temp = !ls -al | grep test.py

In [83]: temp
Out[83]: ['-rw-r--r--   1 root root    14 Apr 25 16:53 test.py']

大小写不敏感,

我就不赘述了,就是 flags=re.IGNORECASE

贪婪匹配和非贪婪匹配

In [98]: text = "Beautiful is better than ugly.Explicit is batter than implicit."

In [99]: re.findall("Beautiful.*\.",text)
Out[99]: ['Beautiful is better than ugly.Explicit is batter than implicit.']

In [100]: re.findall("Beautiful.*?\.",text)
Out[100]: ['Beautiful is better than ugly.']

附带爬虫demo

In [120]: import requests

In [121]: import re

In [122]: r = requests.get('https://news.ycombinator.com')

In [124]: re.findall('"https?://.*?"', str(r.content))
Out[124]: 
['"https://news.ycombinator.com"',
 '"http://conference.startupschool.org/"',
 '"http://norvig.com/spell-correct.html"',
 ... ...

多线程爬虫

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值