Python爬虫-解析库之BeautifulSoup

最新推荐文章于 2023-09-05 10:07:03 发布

江南小作坊

最新推荐文章于 2023-09-05 10:07:03 发布

阅读量1k

点赞数 9

分类专栏： Python学习笔记

本文链接：https://blog.csdn.net/Cherish1ove/article/details/82817161

版权

使用BeautifulSoup

BeautifulSoup的安装
解析器
基本用法
BeautifulSoup查找元素
补充说明
总结

BeautifulSoup的安装

BeautifulSoup是第三方的工具，它包含在一个名称为bs4的文件包中，需要另外安装。在命令窗体中进入Python的安装目录（例如Python在c:\Python36），再进入Scripts子目录，找到pip程序，执行pip install bs4。判断是否安装bs4，可在python的命令窗口中执行语句：from bs4 import BeautifulSoup，如果这条语句没有报错，则代表安装成功了。

解析器

BeautifulSoup在解析时实际上依赖解析器，它除了支持Python标准库中的HTML解析器外，还支持一些第三方解析器（如lxml）。下表列出BeautifulSoup支持的解析器

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, “html.parser”)	Python中的内置标准库、执行速度适中、文档容错能力强	Python2.7.3及Python3.2.2之前的版本文档容错能力差
lxml HTML解析器	BeautifulSoup(markup, “lxml”)	速度快、文档容错能力强	需要安装C语言库
lxml XML解析器	BeautifulSoup(markup, “xml”)	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

基本用法

doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
        <p class="title" name="dormouse"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were three sisters; and their names were
        <a href="http://example/com/elsie" class="sister" id="link1"><!-- Elise --></a>,
        <a href="http://example/com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example/com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.</p>
        <p class="story">...</p>
"""
#创建一个名称为BeautifulSoup对象,其中doc是一个HTML文档字符串，“lxml”是一个参数，表示创建的是一个通过lxml解析器解析的文档
from bs4 import BeautifulSoup
soup = BeautifulSoup(doc,"lxml")
#prettify()方法，可以把要解析的字符串以标准的缩进格式输出
print(soup.prettify())
#输出HTML中title节点的文本内容，soup.title选出HTML中的title节点，再调用string属性可以得到里面的文本
print(soup.title.string)

运行结果如下：

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dormouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three sisters; and their names were
   <a class="sister" href="http://example/com/elsie" id="link1">
    <!-- Elise -->
   </a>
   ,
   <a class="sister" href="http://example/com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example/com/tillie" id="link3">
    Tillie
   </a>
   ;
        and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story

BeautifulSoup装载文档的功能十分强大，它在装载的过程中如果发现HTML文档中的元素有缺失的情况，它会尽可能对文档进行修复，使得最后的文档树是一颗完整的树。这一点十分重要，因为我们面临的大多数网页都或多或少有些元素是缺失的，BeautifulSoup都能正确的装载他们。

BeautifulSoup查找元素

节点选择器

直接调用节点的名称就可以选择节点元素，再调用string属性就可以得到节点内的文本了，这种选择方式速度非常快。如果单个节点结构层次非常清晰，可以选用这种方式来解析。

选择元素

doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
        <p class="title" name="dormouse"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were three sisters; and their names were
        <a href="http://example/com/elsie" class="sister" id="link1"><!-- Elise --></a>,
        <a href="http://example/com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example/com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.</p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(doc,"lxml")
#打印输出title节点的选择结果，输出结果正是title节点里面的文字内容
print(soup.title)
#输出title节点的类型
print(type(soup.title))
#Tag类型就有一些属性，如string属性，即得到节点的文本内容
print(soup.title.string)
#当有多个节点的时候，这种方式只能选择到第一个匹配的节点
print(soup.head)
print(soup.p)

运行结果如下：

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title" name="dormouse"><b>The Dormouse's story</b></p>

提取信息

(1)获取名称

可以利用name属性获取节点的名称。如上文档为例，选取title节点，调用name属性就可以得到节点名称：print(soup.title.name) 结果为：title

(2)获取属性

每个节点可能有多个属性，比如id和class等，选择这个节点元素后，可以调用attrs获取所有属性：

print(soup.p.attrs)
print(soup.p.attrs['name'])

运行结果：

{'class': ['title'], 'name': 'dormouse'}
dormouse

可见，attrs的返回结果是字典形式，它把选择的节点的所有属性和属性值组合成一个字典，接下来，要想获取name属性，就相当于从字典中获取某个键值，只需要用中括号加属性名就可以了。如attrs[‘name’]。
上述写法还是有点繁琐，还可以如下写：

print(soup.p['name'])
print(soup.p['class'])

运行结果如下：

dromouse
['title']

注意：有的返回结果是字符串，有的返回结果是字符串组成的列表。如，name属性的值是唯一的，结果就是字符串；而class，一个节点可能有多个class，，所以返回的结果是列表。

(3)获取内容

可以利用string属性获取节点元素包含的文本内容，如要获取第一个p节点的文本：

#这里选择到的p节点是第一个p节点，获取的文本也是第一个p节点的文本
print(soup.p.string)

运行结果：

The Dromouse's story

嵌套选择

在上述例子中，可知道每一个返回结果都是bs4.element.Tag类型，它同样可以继续调用节点进行下一步的选择。如，获取了head节点元素，可以继续调用head来选取其内部的head节点元素。

html = """
    <html><head><title>The

最低0.47元/天解锁文章

江南小作坊

关注

9
点赞
踩
35

收藏

觉得还不错? 一键收藏
2
评论
Python爬虫-解析库之BeautifulSoup

BeautifulSoup装载HTML文档BeautifulSoup的安装BeautifulSoup装载HTML文档BeautifulSoup查找HTML元素BeautifulSoup获取元素的属性值BeautifulSoup获取元素包含的文本值BeautifulSoup的高级查找获取元素节点的父节点获取元素节点的直接子元素节点获取元素节点的所有子孙元素节点获取元素节点的兄弟节点Beautifu...
复制链接

扫一扫