正则表达式

最新推荐文章于 2024-05-20 10:35:51 发布

GJShine107

最新推荐文章于 2024-05-20 10:35:51 发布

阅读量130

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/wojiaodabai/article/details/80269923

版权

爬虫专栏收录该内容

5 篇文章 0 订阅

订阅专栏

什么是正则表达式？

正则表达式是对字符串串操作的一种逻辑公式，就是用事先定义好的一些特定字符、及这些特定字符的组合，组成⼀一个“规则字符串串”，这个“规则字符串串”用来表达对字符串串的一种过滤逻辑。

本文主要介绍re.match、re.search、re.findall、re.sub、re.compile五种方法。

在线测试工具：

http://tool.oschina.net/regex/

常用的模式：

re.match

re.match(pattern, string, flags=0)

re.match 尝试从字符串的起始位置匹配一个模式，如果不是起始位置匹配成功的话，match()就返回none。

最常规的匹配

s='hello 1590 word'

res=re.match(r'^hello\s\d\d\d\d\sword$',s)
print(res)
print(res.span())  #返回匹配范围
print(res.group())  #返回匹配结果

print(res)
<_sre.SRE_Match object; span=(0, 15), match='hello 1590 word'>

print(res.span())  #返回匹配范围
(0, 15)

print(res.group())  #返回匹配结果
hello 1590 word

范匹配（.*的使用）

res1=re.match(r'^hello.*word$',s)
print(res1)
<_sre.SRE_Match object; span=(0, 15), match='hello 1590 word'>

匹配目标

res2=re.match(r'^hello\s(\d+)\s.*$',s)
print(res2)
print(res2.group(1))
<_sre.SRE_Match object; span=(0, 15), match='hello 1590 word'>
1590

可以看到，如果在想要匹配的目标中使用小括号（），在输出的时候使用group（1）就可以把要匹配的目标获取了。

贪婪匹配与非贪婪匹配

贪婪匹配

s='hello 1590 word'

res3=re.match(r'he.*(\d+).*rd$',s)
print(res3.group(1))
0

我们本来想匹配出1590四个数字，结果只匹配出来了一个数字，这是因为(.*）是贪婪模式的匹配，它会尽可能多的去匹配字符，直到遇到下一个匹配命令。

非贪婪模式

res4=re.match(r'he.*?(\d+).*rd$',s)
print(res4.group(1))
1590

这次匹配出来了1590四个数字，（.*?)属于非贪婪匹配，它会优先考虑后面的匹配模式。

换行的匹配模式

ss="""hello 1590 
    new word"""


res4=re.match(r'he.*?(\d+).*?rd$',ss)
print(res4)


None

如果字符串换行，（.*）是匹配不出来的，原因是：

（.）匹配任意字符，除了换行符，当re.DOTALL标记被指定时，则可以匹配包括换行符的任意字符

换行是匹配不出来的。

那么想匹配换行的字符，怎么办呢？

用re.S，就可以了。

res4=re.match(r'he.*?(\d+).*?rd$',ss,re.S)
print(res4)
<_sre.SRE_Match object; span=(0, 24), match='hello 1590 \n    new word'>

print(res4.group(1))
1590

加上之后，就匹配出来了。

转译--匹配特殊字符

s1='price is 5.00$'
res=re.match(r'price is \d\.\d+\$$',s1)
print(res)
<_sre.SRE_Match object; span=(0, 14), match='price is 5.00$'>

反斜杠（\）用于转译字符。

简单小总结：

尽量使用泛匹配、使用括号得到匹配目标、尽量使用非贪婪模式、有换行符就用re.S。

re.search

re.search 扫描整个字符串并返回第一个成功的匹配。

比较下面两种：

content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
result = re.match('Hello.*?(\\d+).*?Demo', content)
print(result)
None

content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
result = re.search('Hello.*?(\\d+).*?Demo', content)
print(result)
print(result.group(1))
<_sre.SRE_Match object; span=(13, 53), match='Hello 1234567 World_This is a Regex Demo'>
1234567

可以看到re.match对字符串的头是有严格限制的，而re.search却并没有。

小结：：为匹配方便，能用search就不用match。

re.findall

搜索字符串，以列表形式返回全部能匹配的子串。

html = """<div id=\"songs-list\">
        <h2 class=\"title\">经典老歌</h2>
       <p class=\"introduction\">
            经典老歌列表
        </p>
        <ul id=\"list\" class=\"list-group\">
            <li data-view=\"2\">一路上有你</li>
            <li data-view=\"7\">
                <a href=\"/2.mp3\" singer=\"任贤齐\">沧海一声笑</a>
            </li>
            <li data-view=\"4\" class=\"active\">
                <a href=\"/3.mp3\" singer=\"齐秦\">往事随风</a>
            </li>
            <li data-view=\"6\"><a href=\"/4.mp3\" singer=\"beyond\">光辉岁月</a></li>
            <li data-view=\"5\"><a href=\"/5.mp3\" singer=\"陈慧琳\">记事本</a></li>
            <li data-view=\"5\">
                <a href=\"/6.mp3\" singer=\"邓丽君\">但愿人长久</a>
            </li>"""




results = re.findall('<li.*?>\\s*?(<a.*?>)?(\\w+)(</a>)?\\s*?</li>', html, re.S)
print(results)

[('', '一路上有你', ''), ('<a href="/2.mp3" singer="任贤齐">', '沧海一声笑', '</a>'), ('<a href="/3.mp3" singer="齐秦">', '往事随风', '</a>'), ('<a href="/4.mp3" singer="beyond">', '光辉岁月', '</a>'), ('<a href="/5.mp3" singer="陈慧琳">', '记事本', '</a>'), ('<a href="/6.mp3" singer="邓丽君">', '但愿人长久', '</a>')]

结果，以列表的形式返回。

re.sub

替换字符串中每一个匹配的子串后返回替换后的字符串

比如：将数字去除

s='hello 12333 word'
print(re.sub(r'\d','',s))
hello  word

比如：将数字替换为new

print(re.sub(r'\d+','new',s))
hello new word

re.compile

将正则字符串编译成正则表达式对象

将一个正则表达式串编译成正则对象，以便于复用该匹配模式

s = """Hello 1234567 World
        new day"""

pattern = re.compile('Hello.*day', re.S)
result = re.match(pattern, s)
print(result)
<_sre.SRE_Match object; span=(0, 35), match='Hello 1234567 World\n        new day'>