爬虫第一课：正则表达式符号与方法

最新推荐文章于 2022-06-16 23:28:38 发布

付修磊

最新推荐文章于 2022-06-16 23:28:38 发布

阅读量4.4k

点赞数

本文链接：https://blog.csdn.net/qq_38669138/article/details/80418348

版权

第一课：正则表达式符号与方法

１．

. :匹配任意字符，换行符除外：

>>> import re

>>> a='xy123'

>>> b=re.findall('x',a)

>>> b

['x']

>>> b=re.findall('x...',a)
>>> b

['xy12']

所以，"."是一个占位符

２．

* :匹配前一个字符０次或者无限次：

>>> import re
>>> a='xy123'
>>> b=re.findall('x*',a)
>>> b

['x', '', '', '', '', '']

>>> a='xyx123'
>>> b=re.findall('x*',a)
>>> b
['x', '', 'x', '', '', '', '']

３．

? : 匹配前一个字符０次或者１次：

>>> b=re.findall('x?',a)
>>> b
['x', '', 'x', '', '', '', '']

４．

.* :贪心算法：

>>> a
'fffxxIxxhyhxxlovexxhhhxxyouxxghh'
>>> b=re.findall('xx.*xx',a)
>>> b

['xxIxxhyhxxlovexxhhhxxyouxx']

５．

.*? :非贪心算法：

>>> a
'fffxxIxxhyhxxlovexxhhhxxyouxxghh'
>>> b=re.findall('xx.*?xx',a)
>>> b
['xxIxx', 'xxlovexx', 'xxyouxx']

６．

() :匹配目标：

>>> a
'fffxxIxxhyhxxlovexxhhhxxyouxxghh'
>>> b=re.findall('xx(.*?)xx',a)
>>> b
['I', 'love', 'you']

提取出来了目标：Ｉ LOVE YOU

再来看一个例子：

import re
s='ffsdxxhello\nxxfgfgxxworldxxhffh'

d=re.findall('xx(.*?)xx',s)

结果：>>> d

['fgfg']

注意，这里换行了，而只寻找到了第二行。（我们的目标是找到hello world）

那么怎么避免这种情况呢？

答案：用.S

import re
s='ffsdxxhello\nxxfgfgxxworldxxhffh'

d=re.findall('xx(.*?)xx',s,re.S)

结果：>>> d

['hello\n', 'world']

接下来对比findall 与search 的区别：

>>> s2='asssxxIxx123xxlovexxdh'
>>> f=re.search('xx(.*?)xx123xx(.*?)xx',s2).group(2)
>>> f
'love'
>>> f=re.search('xx(.*?)xx123xx(.*?)xx',s2).group(1)
>>> f
'I'
>>> f=re.search('xx(.*?)xx123xx(.*?)xx',s2).group(0)
>>> f
'xxIxx123xxlovexx'
>>> f=re.search('xx(.*?)xx123xx(.*?)xx',s2).group(3)
Traceback (most recent call last):
File "<pyshell#22>", line 1, in <module>
f=re.search('xx(.*?)xx123xx(.*?)xx',s2).group(3)

IndexError: no such group

而

>>> f=re.findall('xx(.*?)xxgdgxx(.*?)xx',s2)

>>> f
[('I', 'love')]

接下来讲解sub的使用：

ｓｕｂ的功能就是替换

>>> s='123hfhdfhdxhdhd123'

>>> output=re.sub('123(.*?)123','123789123',s)
>>> output
'123789123'
>>> output=re.sub('123(.*?)123','123%d123'%789,s)
>>> output

'123789123'

最好不要使用compile

匹配纯数字的特殊方法：

\d+

>> a='dsgdgd1112255555555555hdhdgdgd'
>>> c=re.findall('(\d+)',a)

>>> c

['1112255555555555']

>>> b='dghgd11111111ysdysdys2222223ddh'

>>> dc=re.findall('(\d+)',b)
>>> dc
['11111111', '2222223']

付修磊

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫第一课：正则表达式符号与方法

第一课：正则表达式符号与方法１．. :匹配任意字符，换行符除外：&gt;&gt;&gt; import re&gt;&gt;&gt; a='xy123'&gt;&gt;&gt; b=re.findall('x',a)&gt;&gt;&gt; b['x']&gt;&gt;&gt; b=re.findall('x...',a)&gt;&gt;&gt; b['xy12']
复制链接

扫一扫