python 过滤文本中的HTML标签

最新推荐文章于 2022-09-09 16:47:02 发布

gtfaww

最新推荐文章于 2022-09-09 16:47:02 发布

阅读量6.3k

点赞数 2

分类专栏： python 文章标签： python

python 专栏收录该内容

31 篇文章 1 订阅

订阅专栏

最近需要过滤文本中的HTML标签，在网上找了个大神写的python方法：

'''过滤HTML中的标签

#将HTML中标签等信息去掉

#@param htmlstr HTML字符串.'''

def filter_tag(htmlstr):

re_cdata = re.compile('<!DOCTYPE HTML PUBLIC[^>]*>', re.I)

re_script = re.compile('<\s*script[^>]*>[^<]*<\s*/\s*script\s*>', re.I) #过滤脚本

re_style = re.compile('<\s*style[^>]*>[^<]*<\s*/\s*style\s*>', re.I) #过滤style

re_br = re.compile('<br\s*?/?>')

re_h = re.compile('</?\w+[^>]*>')

re_comment = re.compile('')

s = re_cdata.sub('', htmlstr)

s = re_script.sub('', s)

s=re_style.sub('',s)

s=re_br.sub('\n',s)

s=re_h.sub(' ',s)

s=re_comment.sub('',s)

blank_line=re.compile('\n+')

s=blank_line.sub('\n',s)

s=re.sub('\s+',' ',s)

s=replaceCharEntity(s)

return s

'''##替换常用HTML字符实体.

#使用正常的字符替换HTML中特殊的字符实体.

#你可以添加新的实体字符到CHAR_ENTITIES中,处理更多HTML字符实体.

#@param htmlstr HTML字符串.'''

def replaceCharEntity(htmlstr):

CHAR_ENTITIES={'nbsp':'','160':'',

'lt':'<','60':'<',

'gt':'>','62':'>',

'amp':'&','38':'&',

'quot':'"''"','34':'"'}

re_charEntity=re.compile(r'&#?(?P<name>\w+);') #命名组,把匹配字段中\w+的部分命名为name,可以用group函数获取

sz=re_charEntity.search(htmlstr)

while sz:

#entity=sz.group()

key=sz.group('name') #命名组的获取

try:

htmlstr=re_charEntity.sub(CHAR_ENTITIES[key],htmlstr,1) #1表示替换第一个匹配

sz=re_charEntity.search(htmlstr)

except KeyError:

htmlstr=re_charEntity.sub('',htmlstr,1)

sz=re_charEntity.search(htmlstr)

return htmlstr

非常好用，谢谢大神。 python 过滤文本中的HTML标签 - guotengfei19880205 - guotengfei19880205的博客

python 过滤文本中的HTML标签 - guotengfei19880205 - guotengfei19880205的博客

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

评论 1

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。