今天在测试url正则匹配的时候,出现了在“在线测试正则表达式”的网站,明明可以正常匹配URL的,但是在python上就出现了断句,类似:
<div class="right">\r\n <div style="text-align: left;color: #1D51B4;font-weight: 600;padding-left: 40px;font-size: 16px;">\r\n 交管12123APP下载\r\n </div>\r\n <img src="https://static.122.gov.cn/V1.24.5/plat/static/img/mapSelect.png" style="width: 87%;">\r\n </div>\r\n </div>\r\n </div>\r\n <div id="content">\r\n \r\n <div id="content-title">\r\n \r\n \r\n \r\n 各地平台网站\r\n \r\n \r\n
正则表达式:
(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?
在在线测试的网站上可以匹配出:
https://static.122.gov.cn/V1.24.5/plat/static/img/mapSelect.png
而用python这么写:
import re
pattern = re.compile(r'(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?')
print(pattern.findall(html))
则匹配成了:
(("http", "/", "png"))
后来参考了:https://blog.csdn.net/yu_1628060739/article/details/102767158
发现需要在每个括号中加入?:
把捕获组转变为一个非捕获组就可以完整匹配了:
import re
pattern = re.compile(r'(?:http|ftp|https):\/\/[\w\-_]+(?:\.[\w\-_]+)+(?:[\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?')
print(pattern.findall(html))