python html移除a，img等标签正则处理

进阶中的小檀

于 2022-12-20 10:32:28 发布

阅读量677

点赞数

文章标签： python 开发语言 python移除富文本a标签

本文链接：https://blog.csdn.net/qq_35260798/article/details/128381417

版权

python html移除a，img等标签正则处理
正则匹配所有A标签

//分组1和分组2即为href和value
<a\b[^>]+\bhref="([^"]*)"[^>]*>([\s\S]*?)</a>

解释：

<a\b #匹配a标签的开始
[^>]+ #匹配a标签中href之前的内容
\bhref=“([^”]*)" #匹配href的值，并将匹配内容捕获到分组1当中
[^>]*> #匹配a标签中href之后的内容
([\s\S]*?) #匹配a标签的value，并捕获到分组2当中，?表示懒惰匹配
#匹配a标签的结束

对应python里面的处理

 def replaceA(self,txt):
     print('-----')
     print(">>1 "+txt.group(0))#匹配到的a标签
     print(">>2 "+txt.group(1))#href
     print(">>3 "+txt.group(2))#value
     return ''
#content为html内容     
a3=r'<a\b[^>]+\bhref="([^"]*)"[^>]*>([\s\S]*?)</a>'
content=re.sub(a3, replaceA, content,flags=re.I|re.M|re.S)