Python 爬虫之 Beautiful Soup 模块使用指南

最新推荐文章于 2025-01-23 17:56:42 发布

hoxis

最新推荐文章于 2025-01-23 17:56:42 发布

阅读量1.1w

点赞数 6

分类专栏： python学习有趣的Python 文章标签： Python 爬虫

本文链接：https://blog.csdn.net/bruce_6/article/details/80764000

版权

本文介绍了Python爬虫中BeautifulSoup模块的使用，包括安装、基本使用、对象类型（Tag、NavigableString、BeautifulSoup、Comment）以及搜索文档树的方法（name、id、attr、CSS等）。通过实例展示了如何提取和处理HTML标签内容，帮助读者掌握BeautifulSoup进行网页数据抓取的基本技巧。

摘要由CSDN通过智能技术生成

爬取网页的流程一般如下：

选着要爬的网址（url）
使用 python 登录上这个网址（urlopen、requests 等）
读取网页信息（read() 出来）
将读取的信息放入 BeautifulSoup
使用 BeautifulSoup 选取 tag 信息等

可以看到，页面的获取其实不难，难的是数据的筛选，即如何获取到自己想要的数据。本文就带大家学习下 BeautifulSoup 的使用。

BeautifulSoup 官网介绍如下：

Beautiful Soup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库，它能够通过你喜欢的转换器实现惯用的文档导航、查找、修改文档的方式，能够帮你节省数小时甚至数天的工作时间。

1 安装

可以利用 pip 直接安装：

$ pip install beautifulsoup4

BeautifulSoup 不仅支持 HTML 解析器，还支持一些第三方的解析器，如 lxml，XML，html5lib 但是需要安装相应的库。如果我们不安装，则 Python 会使用 Python 默认的解析器，其中 lxml 解析器更加强大，速度更快，推荐安装。

$ pip install html5lib
$ pip install lxml

2 BeautifulSoup 的简单使用

首先我们先新建一个字符串，后面就以它来演示 BeautifulSoup 的使用。

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

使用 BeautifulSoup 解析这段代码，能够得到一个 BeautifulSoup 的对象，并能按照标准的缩进格式的结构输出:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc, "lxml")
>>> print(soup.prettify())

篇幅有限，输出结果这里不再展示。

另外，这里展示下几个简单的浏览结构化数据的方法：

>>> soup.title
<title>The Dormouse's story</title>
>>> soup.title.name
'title'
>>> soup.title.string
"The Dormouse's story

最低0.47元/天解锁文章