python正则表达式多匹配成了断句

最新推荐文章于 2023-11-08 10:39:28 发布

yanjiee

最新推荐文章于 2023-11-08 10:39:28 发布

阅读量306

点赞数

分类专栏： Python 文章标签： python

本文链接：https://blog.csdn.net/yanjiee/article/details/124409980

版权

Python 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

今天在测试url正则匹配的时候，出现了在“在线测试正则表达式”的网站，明明可以正常匹配URL的，但是在python上就出现了断句，类似：

<div class="right">\r\n <div style="text-align: left;color: #1D51B4;font-weight: 600;padding-left: 40px;font-size: 16px;">\r\n 交管12123APP下载\r\n </div>\r\n <img src="https://static.122.gov.cn/V1.24.5/plat/static/img/mapSelect.png" style="width: 87%;">\r\n </div>\r\n </div>\r\n </div>\r\n <div id="content">\r\n \r\n <div id="content-title">\r\n \r\n \r\n \r\n 各地平台网站\r\n \r\n \r\n

正则表达式：

(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?

在在线测试的网站上可以匹配出：
https://static.122.gov.cn/V1.24.5/plat/static/img/mapSelect.png
而用python这么写：

import re
pattern = re.compile(r'(http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?')
print(pattern.findall(html))

则匹配成了:
(("http", "/", "png"))
后来参考了：https://blog.csdn.net/yu_1628060739/article/details/102767158
发现需要在每个括号中加入?:把捕获组转变为一个非捕获组就可以完整匹配了：

import re
pattern = re.compile(r'(?:http|ftp|https):\/\/[\w\-_]+(?:\.[\w\-_]+)+(?:[\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?')
print(pattern.findall(html))