【Python】BeautifulSoup

最新推荐文章于 2024-08-30 10:01:19 发布

风吹我亦散

最新推荐文章于 2024-08-30 10:01:19 发布

阅读量625

点赞数

文章标签： python

本文链接：https://blog.csdn.net/weixin_45468845/article/details/108498707

版权

本文介绍了Python库BeautifulSoup的使用，包括解析HTML文档、遍历和搜索文档树、修改文档结构以及输出格式化等内容。通过实例展示了如何处理HTML标签、属性、字符串以及注释，并提供了多种遍历和搜索文档的方法。

摘要由CSDN通过智能技术生成

简介

我们知道，一个网页是由HTML文档组成的，HTML文档是一种结构化的文档，有一定的规则，通过它的结构可以简化信息提取。

Beautiful Soup 4.4.0 文档

我的理解就是：将一段HTML文档通过 BeautifulSoup()构造方法 解析成一个对象，然后对这个对象进行操作。

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。取名来自 《爱丽丝梦游仙境》 ，下面的代码来自官方文档，是 《爱丽丝梦游仙境》 中的一段内容。

在这里插入图片描述
使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

# 输出
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

几个简单的浏览结构化数据的方法:

soup.title
# <title>The Dormouse's story</title>

soup.title.name

最低0.47元/天解锁文章