python3 beautifulsoup 简化网页元素_BeautifulSoup详解，轻轻松松拿到网页数据

最新推荐文章于 2022-06-24 09:47:33 发布

weixin_39605345

最新推荐文章于 2022-06-24 09:47:33 发布

阅读量208

点赞数

文章标签： python3 beautifulsoup 简化网页元素

本文链接：https://blog.csdn.net/weixin_39605345/article/details/111958078

版权

优秀不够，那就要无可替代！

Python版本3.8.0，开发工具：Pycharm

上一节我们已经可以获取到网页内容，但是获取到的却是一长串的 html 代码，并不是我们想要的数据。

那这一节，我们就来看看怎么去解析这些网页，轻松的拿到我们想要的数据。

首先网页解析有很多种解析工具，包括之前的正则表达式也可以用来解析

这节我们介绍通过BeautifulSoup4 进行网页解析。

安装BeautifulSoup4

启动cmd

输入pip3 install beautifulsoup4

pip3表示Python3版本，不需要区分版本直接使用pip

安装成功后截图如下：

BeautifulSoup4 快速开始

1. 导入bs4 库

from bs4 import BeautifulSoup

2. 创建beautifulsoup对象

先创建一个demo网页

html = """

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Lacie and

Tillie

and they lived at the bottom of a well.

...

"""

创建一个beautifulsoup对象

soup = BeautifulSoup(html)

或者通过读取本地HTML文件创建对象

soup = BeautifulSoup(open('demo.html'))

3. BeautifulSoup 将 HTML 文档转换成一个树形结构,每个节点都是 Python 对象，所有对象可以归纳为4种:

Tag

NavigableString

BeautifulSoup

Comment

(1)Tag

可以看做是HTML中的一个个标签，例如

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters;

上面的title、b、p 等 HTML 标签加上中间的内容就是一个 Tag，我们来试试如何通过 beautifulsoup 进行 Tag 内容获取

print(soup.title)

# 输出：

The Dormouse's story

print(soup.head)

# 输出：

The Dormouse's story

print(soup.p)

# 输出：The Dormouse's story

注：通过标签名只能查找到所有内容中第一个符合要求的标签

每个 tag 都有自己的 name，一个 tag 也会有多个属性 attrs 。tag 属性的操作方法和字典相同

print(soup.title.name)

# 输出：title

print(soup.p.attrs)

# 输出 {'class': ['title'], 'name':'dromouse'}

通过字典方式获取其中某一个属性

# 两种方式都可

print(soup.p.get('class'))

print(soup.p['class'])

(2)NavigableString

中文解释：可以遍历的字符串。

既然已经通过 Tag 获取到具体标签，那标签的内容就可以通过 NavigableString 拿到，使用方法特别简单：

# 获取标签内容

print(soup.p.string)

(3)BeautifulSoup

BeautifulSoup 对象表示的是一个文档的全部内容。大部分时候,可以把它当作是一个特殊的 Tag，我们可以分别获取它的名称、属性

print(soup.name)

print(soup.attrs)

(4)Comment

Comment 对象是一个特殊类型的 NavigableString 对象，输出的内容不包括注释符号。

例如：

print(soup.a)

print(soup.a.string)

输出：

Elsie

a 标签的内容实际上属于注释，利用 .string 来输出它的内容，我们发现它已经把注释符号去掉了。

实际上a 标签的内容属于注释，即 Comment。可以这样操作：

from bs4 import Comment

if type(soup.a.string) == Comment:

print("comment:"+soup.a.string)

BeautifulSoup4数据查找提取

遍历文档树

通过 beautifulsoup 将 html 文档转换成树形结构，对文档树进行遍历

(1)节点内容

通过.string 属性输出节点内容

如果当前 tag 下没有标签，或者当前 tag 下只有一个子标签，则通过 .string 输出节点内容

# 当前 tag

The Dormouse's story

print(soup.head.string)

print(soup.title.string)

# 输出相同

# The Dormouse's story

如果当前 tag下有不止一个标签，则会返回 None

print(soup.html.string)

# 输出：None

那么当前 tag下有多个标签，如果返回内容呢？用 .strings

使用 strings 需要遍历获取

for string in soup.strings:

print(string)

使用 stripped_strings 去除多余空白内容

for string in soup.使用stripped_strings:

print(string)

# 输出：

Once upon a time there were three little sisters; and their names were

Lacie

and

Tillie

and they lived at the bottom of a well.

...

(2)父节点

通过.parent 得到一个父节点，.parents 得到所有父节点

使用.parent 得到一个父节点

# 定位当前节点到title

current_tag = soup.head.title

# 输出当前节点的父节点

print(current_tag.parent.name)

# head

使用.parents 递归得到所有的父节点

# 定位当前节点到title

current_tag = soup.head.title

# 输出当前节点的所有父节点

for now_tag in current_tag.parents:

print(now_tag.name)

# 输出

"""

head

html

[document]

"""

(3)子节点

.contents 可以将 tag 的子节点以列表的形式输出

print(soup.head.contents)

# 输出

# [

The Dormouse's story]

.children 返回列表迭代器，通过循环获取每个 tag 的内容

for child in soup.body.children:

print(child)

(4)兄弟节点

.next_sibling 和 .previous_sibling 分别是获取同一个父节点下的下一个 tag/上一个 tag 节点(兄弟节点)。

如果节点不存在，返回 None

# p节点的下一个兄弟节点的上一个兄弟节点，等于p 本身

print(soup.body.p.next_sibling.previous_sibling.string)

.next_siblings 和 .previous_siblings 属性可以对当前节点的兄弟节点迭代输出

for sibling in soup.body.p.next_sibling.previous_sibling:

print(sibling.string)

(5)前后节点

.next_element 和.previous_element 属性指当前节点的下一个和上一个节点，不分层次关系

# 输出title 节点的下一个节点的内容

print(soup.title.next_element.string)

# 输出title 节点的上一个节点的内容

print(soup.title.previous_element.string)

.next_elements 和 .previous_elements 属性可以对当前节点的所有下一个、上一个节点迭代输出

# 输出body 下p 节点的所有上一个节点

for current_tag in soup.body.p.previous_elements:

print(current_tag.string)

# 输出

"""

The Dormouse's story

None

"""

搜索文档树

(1)find_all：搜索所有子节点，返回列表

find_all(name, attrs, recursive, text, limit, **kwargs)：搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件

name 参数

name 参数可以查找所有名字为 name 的tag，字符串对象会被自动忽略掉

name可以传多种功能参数。

传字符串：所有的字符串标签。

例如 ‘b’ 代表 b 标签

传正则表达式：匹配所有符合正则表达式的标签。

例如 re.compile("^b") 匹配所有的 body 标签和 b 标签

传列表：查找所有在列表中的标签。

例如 [‘a’, ‘b’] 代表所有 a 标签和 b 标签

传 True：True 表示可以匹配任何值，但是不会返回字符串节点

传方法：如果方法返回 True 则表示当前元素匹配且被找到，否则返回False

attrs 参数

如果一个指定名字的参数不是搜索内置的参数名，搜索时会把该参数当作指定名字 tag 的属性来搜索;

如果包含一个名字为 id 的参数, BeautifulSoup 会搜索每个 tag 的 ”id” 属性

# 搜索所有 id 为 link2 的子节点

soup.find_all(id='link2')

# 搜索所有 class 为 sister 的 a 节点

soup.find_all("a", class_="sister")

# 搜索所有 href 匹配到 elsie 的子节点

soup.find_all(href=re.compile('elsie'))

# 同时多个属性过滤

soup.find_all(id='link2', class_="sister", href=re.compile('elsie'))

recursive 参数

当只需要搜索当前节点的子节点，不需要搜索孙节点，需要设置 recursive=False

# 遍历 html 节点的所有节点

print(soup.html.find_all(name="title"))

# 只遍历 html 节点的所有子节点

print(soup.html.find_all(name="title", recursive=False))

# 输出

"""

[

The Dormouse's story]

[]

"""

text 参数

text 参数可以搜文档中的字符串内容与 text一样。

text 参数接受字符串, 正则表达式, 列表, True

print(soup.find_all(text="Elsie"))

# []

print(soup.find_all(text=["Tillie", "Elsie", "Lacie"]))

# ['Lacie', 'Tillie']

print(soup.find_all(text=re.compile("Dormouse")))

# ["The Dormouse's story", "The Dormouse's story"]

limit 参数

当文档树特别大，搜索一遍需要很久的时候，我们可以指定返回结果的数量，相当于sql 中的 limit 关键字

# 只输出两个 a 标签即可

soup.find_all(name='a', limit=2)

(2)find：搜索所有子节点，返回结果

find_all 返回所有子节点，且返回列表

find 只返回搜索到的第一个子节点

(3)find_parent：搜索父节点

find_parent 搜索当前节点的父节点

find_parents 搜索当前节点的所有父节点

(4)find_next_sibling：搜索此节点后的兄弟节点

find_next_sibling 搜索当前节点的下一个兄弟节点的第一个节点

find_next_siblings 搜索当前节点的下一个所有兄弟节点

(5)find_previous_sibling：搜索此节点前的兄弟节点

find_previous_sibling 搜索当前节点的上一个兄弟节点的第一个节点

find_previous_siblings 搜索当前节点的上一个所有兄弟节点

(6)find_all_next`：搜索此节点后的所有节点

find_next 搜索当前节点的下一个节点的第一个节点

find_all_next 搜索当前节点的下一个所有节点

(7)find_all_previous：搜索此节点前的所有节点

find_all 搜索当前节点的上一个节点的第一个节点

find_all_previous 搜索当前节点的撒和那个一个所有节点

(2)-(7)的参数与(1)相同，按照(1)的原理应用即可

css选择器

我们在写 CSS 时，标签名不加任何修饰，类名前加点，id名前加 #

在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list

(1)通过标签名查找

查找所有找到的结果，返回 list

# 查找title标签

print(soup.select('title'))

# 查找 a 标签

print(soup.select('a'))

(2)通过类名查找

# 查找 class 是 sister 的所有结果

print(soup.select('.sister'))

(3)通过 id 名查找

# 查找 id 为 link1 的所有结果

print(soup.select('#link1'))

(4)组合查找

# 查找 p 标签中， id 为 link1 的所有结果

print(soup.select('p #link1'))

# 查找 p 标签中， class 为 sister 的所有结果

print(soup.select('p .sister'))

# 子标签查找

print(soup.select('body > p'))

# 组合查找

# body 标签下的 class 为 story 的标签下的 id 为 link1 的所有结果

print(soup.select('body .story #link1'))

(5)属性查找

查找时还可以加入属性元素，属性需要用中括号括起来。

注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

# 查找 id 为 link1 的所有结果

print(soup.select('a[id="link1"]'))

# 查找 p 标签下的 id 为 link2 的 a 标签

print(soup.select('p a[id="link2"]'))

# 输出 id 为 link 的 a 标签的内容

print(soup.select('a[id="link2"]')[0].string)

css 选择其实和 find_all 函数的本质是一样的，选择合适的使用吧

这篇大家可以收藏起来，以后用到的时候翻出来看一下。

这篇讲了利用 beautifulsoup 进行网页解析，主要介绍了它的查找功能，其实还有修改删除功能没有提到，不过我觉得在爬虫过程中，我们使用查找搜索会更频繁些，掌握今天提到的这些即可。

weixin_39605345

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python3 beautifulsoup 简化网页元素_BeautifulSoup详解，轻轻松松拿到网页数据

优秀不够，那就要无可替代！Python版本3.8.0，开发工具：Pycharm上一节我们已经可以获取到网页内容，但是获取到的却是一长串的 html 代码，并不是我们想要的数据。那这一节，我们就来看看怎么去解析这些网页，轻松的拿到我们想要的数据。首先网页解析有很多种解析工具，包括之前的正则表达式也可以用来解析这节我们介绍通过BeautifulSoup4 进行网页解析。安装BeautifulSoup4...
复制链接

扫一扫