用python将html去除格式转unicode

最新推荐文章于 2024-06-25 14:36:10 发布

splayx

最新推荐文章于 2024-06-25 14:36:10 发布

阅读量317

点赞数

CC 4.0 BY-SA版权

分类专栏： python 文章标签： python 人工智能

本文链接：https://blog.csdn.net/splayx/article/details/84459620

python 专栏收录该内容

7 篇文章

订阅专栏

本文介绍了如何使用Python代码去除HTML标签，并将实体编码转换为可识别的编码，如UTF8。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

见html中的一些奇怪的编码

http://www.w3school.com.cn/tags/html_ref_symbols.html

例如以下这么一串东西就是实体编码

小何很好啊！

如果一个html文本想把它去除html的tag，然后又把它转为系统可识别的编码（例如utf8）

以下python代码把它转为unicode，然后想转什么都可以了。

from HTMLParser import HTMLParser

def strip_tags(html):
    html_parser = HTMLParser()

    # remove format
    #html = html.strip()
    #html = html.strip("\n")
    result=[]
    html_parser.handle_data = result.append
    html_parser.feed(html)

    # transform entity to unicode
    result = [html_parser.unescape(el) for el in result]

    html_parser.close()
    return "".join(result)