爬虫学习——（三）Beautiful Soup的使用

59％

已于 2022-08-09 16:47:06 修改

阅读量766

点赞数 2

文章标签：爬虫学习

于 2022-08-09 16:39:58 首次发布

本文链接：https://blog.csdn.net/weixin_52024937/article/details/126250533

版权

本文详细介绍了Python解析库Beautiful Soup的使用，包括安装、基本使用、节点选择器、提取信息、关联选择以及CSS选择器等，重点讲解了如何通过Beautiful Soup解析HTML和XML，提取所需数据。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.Beautiful Soup简介

前面学习通过正则表达式提取网页信息时，如果正则表达式出现错误则无法正确提取我们所需要的结果。由于网页有一定的特殊和层级关系，利用强大的解析工具——Beautiful Soup能够借助网页的结构和属性等特性来解析网页，相比于正则表达式，它可以利用更简单的语句提取网页内容。

简单来说，Beautiful Soup是Python的一个HTML或XML的解析库，我们用它可以方便地从网页中提取数据，其官方解释如下：

2.解析器

通过对比不同解析器可以看出，LXML解析器有解析HTML和XML的功能，而且速度快，容错能力强，推荐使用。在使用LXML解析器时，只需要在初始化Beautiful Soup时，将第二个参数修改为lxml即可。

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>hello</p>','lxml')
print(soup.p.string)

运行结果：

hello

3.安装Beautiful Soup

在使用之前确保已经正确安装好Beautiful Soup和lxml两个库。在cmd里直接pip安装即可，命令如下：

pip install beautifulsoup4

pip install lxml

4.基本使用

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())  #自动补全代码 容错处理
print(soup.title.string)  #返回title的内容

运行结果：

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story

首先声明变量html字符串，但是需要注意的是这并不是一个完整的html字符串。接着将它作为第一个参数传给BeautifulSoup对象，第二个参数为解析器的类型（设置为lxml），此时完成BeautifulSoup对象的初始化，接着将这个对象赋值给soup变量。之后，就可以调用soup的各个方法和属性解析这串html代码了。

①调用prettify方法。对不标准的html字符串自动更正格式。

②调用soup.title.string。输出HTML中title节点的文本内容。

5.节点选择器

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.title) #输出title节点的选择结果
print(type(soup.title)) #输出title节点的类型
print(soup.title.string) #输出title节点里面的文字内容
print(soup.head)  #输出head节点
print(soup.p) #输出第一个p标签的内容

运行结果：

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

【注】bs4.element.Tag是BeautifulSoup中一个重要的数据结构，经过选择器选择的结果都是这种Tag类型。

6.提取信息

#下面皆由这段html文本为例：
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(h

最低0.47元/天解锁文章