爬虫（8）bs4上

最新推荐文章于 2022-11-01 21:16:24 发布

辉子2020

最新推荐文章于 2022-11-01 21:16:24 发布

阅读量214

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/m0_46738467/article/details/112464099

版权

本章节详细介绍了BeautifulSoup4的使用，包括bs4的安装、基本操作、对象类型、遍历文档树以及重点知识find与find_all的用法。通过实例展示了如何查找、提取HTML中的标签、属性和文本内容，强调了find_all在提取多个匹配项时的便利性。

摘要由CSDN通过智能技术生成

文章目录

第八章 bs4上

第八章 bs4上

1. bs4简介

Beautiful Soup是一个可以从HTML或XML文件中提取提取数据的网页信息提取库。
首先需要安装，最好先安装pip install lxml再安装pip install bs4否则可能会出错。
bs4不需要记语法，直接调用里面的方法就可以了，这是它比正则和xpath方便的地方。

2. bs4入门

我们用一段网页文档来示例一下如何使用bs4。

from bs4 import Beautiful Soup   # 先引入Beautiful Soup类，Beautiful Soup是bs4中常用的类
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

如果我们要从上面文档中用Beautiful Soup提取需要的内容，我们要先解析成bs4对象。

from bs4 import Beautiful Soup   # 先引入Beautiful Soup类，Beautiful Soup是bs4中常用的类
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,features='lxml') # 这里我们传入两个对象，一个是刚才的文档，第二个是features='lxml'，用来解析文档的。
print(soup)  # 打印一下，得到一个Beautifull Soup对象。

结果

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

如果我想要结构更清晰一点的打印结果可以这样打印：

print(soup.prettify())

得到一个更清晰的结构树，可以方便找到各标签的关系。

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

如果我现在要把title标签的元素打印出来，可以这样操作。

print(soup.title)

得到

<title>The Dormouse's story</title>

如果要得到标签名字，和标签内的字符串，可以这样。

print(soup.title.name)
print(soup.title.string)

结果

title
The Dormouse's story

如果我要得到p标签

print(soup.p)

结果发现只找到里面三个p标签的第一个

<p class="title"><b>The Dormouse's story</b></p>

如果我都找到可以用find_all方法

res = soup.find_all('p')
print(res,len(res))

结果我们得到一个列表，返回所有的p标签作为列表的元素。发现有3各p标签，即找到了所有的p标签。

[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>] 3

下面我们发现a标签里面有一个href里面有url，我们如何获取呢？可以这样

links = soup.find_all('a')
for link in links:
    print(link.get('href'))

就拿到了

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

以上是我们对bs4的一些入门操作，可以看到是很方便简洁的。

3. bs4对象的种类

Tag :标签
Navigablestring :可导航的字符串
BeautifulSoup :soup对象
Comment :注释

我们来通过操作代码来认识它们。。

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
print(type(soup.title))
print(type(soup.p))
print(type(soup.a))

结果我们看到，以上三个都是Tag对象。

<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>

对Tag对象我们可以做以下操作

print(soup.p.name)
print(soup.p.attrs)
print(soup.p.string)

结果

p
{'class': ['title']}
The Dormouse's story

其中

print(type(soup.p.string))

得到

<class 'bs4.element.NavigableString'>

是NavigableString字符串类型，它和普通字符串一样，可以做拼接等一样的操作。

print(type(soup))

我们看到得到的是一个soup对象

<class 'bs4.BeautifulSoup'>

下面我们看看注释类型，这个并不常用。我们随便写一个注释：

html = '<a><!--新年快乐！！--></a>'
soup = BeautifulSoup(html,'lxml')
print(soup.a.string)

我们先打印一下看看效果

新年快乐！！

把注释打印出来了。我们看看类型。

html = '<a><!--新年快乐！！--></a>'
soup = BeautifulSoup(html,'lxml')
print(type(soup.a.string))

我们看到是注释类对象

<class 'bs4.element.Comment'>

好，通过以上操作我们认识了四个对象类型。

4. 遍历文档树

我们先了解一下常用的解析器：

解析器	使用方法	优势	劣势

最低0.47元/天解锁文章

辉子2020

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
爬虫（8）bs4上

文章目录1. bs4简介2. bs4入门3. bs4对象的种类4. 遍历文档树1. bs4简介Beautiful Soup是一个可以从HTML或XML文件中提取提取数据的网页信息提取库。首先需要安装，最好先安装pip install lxml再安装pip install bs4否则可能会出错。bs4不需要记语法，直接调用里面的方法就可以了，这是它比正则和xpath方便的地方。2. bs4入门我们用一段网页文档来示例一下如何使用bs4。from bs4 import Beautiful Soup
复制链接

扫一扫

专栏目录