第一课:正则表达式符号与方法
1.
. :匹配任意字符,换行符除外:
>>> import re
>>> a='xy123'
>>> b=re.findall('x',a)
>>> b
['x']>>> b=re.findall('x...',a)
>>> b
['xy12']
所以,"."是一个占位符
2.
* :匹配前一个字符0次或者无限次:
>>> import re
>>> a='xy123'
>>> b=re.findall('x*',a)
>>> b
['x', '', '', '', '', '']
>>> a='xyx123'
>>> b=re.findall('x*',a)
>>> b
['x', '', 'x', '', '', '', '']
3.
? : 匹配前一个字符0次或者1次:
>>> b=re.findall('x?',a)
>>> b
['x', '', 'x', '', '', '', '']
4.
.* :贪心算法:
>>> a
'fffxxIxxhyhxxlovexxhhhxxyouxxghh'
>>> b=re.findall('xx.*xx',a)
>>> b
['xxIxxhyhxxlovexxhhhxxyouxx']
5.
.*? :非贪心算法:
>>> a
'fffxxIxxhyhxxlovexxhhhxxyouxxghh'
>>> b=re.findall('xx.*?xx',a)
>>> b
['xxIxx', 'xxlovexx', 'xxyouxx']
6.
() :匹配目标:
>>> a
'fffxxIxxhyhxxlovexxhhhxxyouxxghh'
>>> b=re.findall('xx(.*?)xx',a)
>>> b
['I', 'love', 'you']
提取出来了目标:I LOVE YOU
再来看一个例子:
import re
s='ffsdxxhello\nxxfgfgxxworldxxhffh'
d=re.findall('xx(.*?)xx',s)
结果:>>> d
['fgfg']
注意,这里换行了,而只寻找到了第二行。(我们的目标是找到hello world)
那么怎么避免这种情况呢?
答案:用.S
import re
s='ffsdxxhello\nxxfgfgxxworldxxhffh'
d=re.findall('xx(.*?)xx',s,re.S)
结果:>>> d
['hello\n', 'world']
接下来对比findall 与search 的区别:
>>> s2='asssxxIxx123xxlovexxdh'
>>> f=re.search('xx(.*?)xx123xx(.*?)xx',s2).group(2)
>>> f
'love'
>>> f=re.search('xx(.*?)xx123xx(.*?)xx',s2).group(1)
>>> f
'I'
>>> f=re.search('xx(.*?)xx123xx(.*?)xx',s2).group(0)
>>> f
'xxIxx123xxlovexx'
>>> f=re.search('xx(.*?)xx123xx(.*?)xx',s2).group(3)
Traceback (most recent call last):
File "<pyshell#22>", line 1, in <module>
f=re.search('xx(.*?)xx123xx(.*?)xx',s2).group(3)
IndexError: no such group
而
>>> f=re.findall('xx(.*?)xxgdgxx(.*?)xx',s2)
>>> f[('I', 'love')]
接下来讲解sub的使用:
sub的功能就是替换
>>> s='123hfhdfhdxhdhd123'
>>> output=re.sub('123(.*?)123','123789123',s)
>>> output
'123789123'
>>> output=re.sub('123(.*?)123','123%d123'%789,s)
>>> output
'123789123'
最好不要使用compile
匹配纯数字的特殊方法:
\d+
>> a='dsgdgd1112255555555555hdhdgdgd'
>>> c=re.findall('(\d+)',a)
>>> c
['1112255555555555']>>> b='dghgd11111111ysdysdys2222223ddh'
>>> dc=re.findall('(\d+)',b)
>>> dc
['11111111', '2222223']