Python 正则表达式删除非<em><strong>的其它<xml> tag

1. Match tags except <em> and <strong>
Match:
</?(?!(?:em|strong)\b)[a-zA-Z](?:[^>"']|"[^"]*"|'[^']*')*>

Replace:
NULL

eg.
<b fdjfkjdk>inikkk</b fdjkfjk>
<i fff>nihao</i nihao>
<em ff>nihao</em nihao>
<big ff>nihao</big nihao>

result:
inikkk
nihao
<em ff>nihao</em nihao>
nihao

2. Match tags except <em> and <stong>, and any tags that contain attributes
Match:
</?(?!(?:em|strong)\s*>)[a-zA-Z](?:[^>"']|"[^"]*"|'[^']*')*>

Replace:
NULL

eg.
<b fdjfkjdk>inikkk</b fdjkfjk>
<i fff>nihao</i nihao>
<em ff>nihao</em nihao>
<big ff>nihao</big nihao>
<strong>nihao</strong>

result:
inikkk
nihao
nihao
nihao
<strong>nihao</strong>

3. Whitelist specific attributes
Match all tags except <a> <em> <strong>, with two exceptions,
Any <a> tags that have attributes other than href or title should be matached
Match:
</?(?!(?:em|strong|a(?:\s+(:href|title)\s*=\s*(?:"[^"]*"|'[^']*'))*)\s*>)[a-zA-Z](?:[^>"']|"[^"]*"|'[^']*')*>

Replace:
NULL

eg.
<b fdjfkjdk>inikkk</b fdjkfjk>
<i fff>nihao</i nihao>
<em ff>nihao</em nihao>
<big ff>nihao</big nihao>
<strong>nihao</strong>
<a nihao>fdjkf</nihao>
<a href="2222">fdjkf</nihao>

Result:
inikkk
nihao
nihao
nihao
<strong>nihao</strong>
fdjkf
<a href="2222">fdjkf   <===A little error, </a> is also removed, can anyone supply solution? thanks.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值