python删除html字符,Python代码从字符串中删除HTML标记

最新推荐文章于 2022-09-09 16:47:02 发布

张戎

最新推荐文章于 2022-09-09 16:47:02 发布

阅读量561

点赞数

文章标签： python删除html字符

Using a regex

Using a regex, you can clean everything inside <> :

import re

def cleanhtml(raw_html):

cleanr = re.compile('<.>')

cleantext = re.sub(cleanr, '', raw_html)

return cleantext

Some HTML texts can also contain entities, that are not enclosed in brackets such as '&nsbm'. If that is the case then you might want to write the regex as

cleanr = re.compile('<.>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')

This link contains more details on this.

Using BeautifulSoup

You could also use BeautifulSoup additional package to find out all the raw text

You will need to explicitly set a parser when calling BeautifulSoup

I recommend "lxml" as mentioned in alternative answers (much more robust than the default one (i.e. available without additional install) 'html.parser'

from bs4 import BeautifulSoup

cleantext = BeautifulSoup(raw_html, "lxml").text

But it doesn't prevent you from using external libraries, so I recommend the first solution.

确定要放弃本次机会？

福利倒计时

: :

立减 ¥

普通VIP年卡可用

立即使用

张戎

关注关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python删除html字符,Python代码从字符串中删除HTML标记

Using a regexUsing a regex, you can clean everything inside <> :import redef cleanhtml(raw_html):cleanr = re.compile('')cleantext = re.sub(cleanr, '', raw_html)return cleantextSome HTML texts ca...
复制链接

扫一扫