02_BeautifulSoup4模块简介与使用/数据持久化

疋瓞

已于 2023-10-06 15:14:27 修改

阅读量555

点赞数

分类专栏： python爬虫文章标签： python

于 2023-06-05 15:58:19 首次发布

本文链接：https://blog.csdn.net/sz1125218970/article/details/131043933

版权

1、BeautifulSoup4模块简介：

本质：python的一个第三方库
作用：在获取到网页源代码的前提下，在HTML文件或者XML文件中提取数据。
安装指令：pip install BeautifulSoup4
安装说明：除了上面的指令安装之外，还可以用pycharm中的图形化安装界面安装
使用BeautifulSoup方法针对网页源代码进行文档解析，返回一个BeautifulSoup对象（本质：树结构），这个解析过程需要解析器。

2、示例代码：

html_str = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="title"><b>The Dormouse's story</b></p>
        <p class="story">Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup

# BeautifulSoup(网页源代码, 解析器)
soup = BeautifulSoup(html_str, 'html.parser')
# 对文档解析的过程其实就是将html源代码转换为树结构，便于后续的内容查找。

# print(soup, type(soup))

# 提取树结构中内容的方法和属性
# select：使用CSS选择器（标签选择器、id选择器、class选择器、
# 父子选择器、后代选择器、nth-of-type选择器等）从树结构中遍历符合CSS选择器的所有结果，存放在列表中。

# select_one：使用CSS选择器（标签选择器、id选择器、class选择器、
# 父子选择器、后代选择器、nth-of-type选择器等）从树结构中遍历符合CSS选择器的第一个结果。

# text：从标签内获取标签内容。
# attrs：从标签内属性列表中获取指定属性名对应的属性值。

# Q1:提取p标签
# 标签选择器：只写标签名，会获取到整个html源代码中的所有的某标签
p_list = soup.select('p')
print(p_list)
# 父子选择器：从最外层向最内层写，使用>链接(>左右一定留个空格)
p_list2 = soup.select('html > body > p')
print(p_list2)
# 后代选择器：从外层写向内层，使用空格链接（空格右边的是空格左边的后代）
p_list3 = soup.select('html p')
print(p_list3)

# Q2：获取三个拥有sister属性值的a标签
# class选择器：

最低0.47元/天解锁文章

疋瓞

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
02_BeautifulSoup4模块简介与使用/数据持久化

本质：python的一个第三方库作用：在获取到网页源代码的前提下，在HTML文件或者XML文件中提取数据。安装指令：pip install BeautifulSoup4安装说明：除了上面的指令安装之外，还可以用pycharm中的图形化安装界面安装使用BeautifulSoup方法针对网页源代码进行文档解析，返回一个BeautifulSoup对象（本质：树结构），这个解析过程需要解析器。
复制链接

扫一扫