BeautifulSoup模块的使用

最新推荐文章于 2023-08-22 09:09:00 发布

funNLPer

最新推荐文章于 2023-08-22 09:09:00 发布

阅读量456

点赞数

分类专栏： Python 文章标签： BeautifulSoup HTML解析数据提取正则表达式 CSS选择器

本文链接：https://blog.csdn.net/orangerfun/article/details/117265604

版权

Python 专栏收录该内容

32 篇文章 4 订阅

订阅专栏

BeautifulSoup模块的使用

1. 简介

BeautifulSoup可以快速从HTML、XML等文件中提取数据，使用beautiful soup 之前要先使用urllib.request从指定的网址上读取HTML文件，BeautifulSoup(html, "html.parser")需要两个参数，第一参数是需要提取数据的HTML文件，第二个参数指定解析器

2. BeautifulSoup的使用

使用如下html文件为例展示该模块的作用

<!DOCTYPE html>
<!-- saved from url=(0022)https://www.baidu.com/ -->
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <script async="" src="./百度一下，你就知道_files/every_cookie_4644b13.js.下载"></script>
        <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
        <meta content="always" name="referrer">
        <meta name="theme-color" content="#2932e1">
        <meta name="description" content="全球最大的中文搜索引擎、致力于让网民更便捷地获取信息，找到所求。百度超过千亿的中文网页数据库，可以瞬间找到相关的搜索结果。">
        <link rel="shortcut icon" href="https://www.baidu.com/favicon.ico" type="image/x-icon">
        <link rel="search" type="application/opensearchdescription+xml" href="https://www.baidu.com/content-search.xml" title="百度搜索">
        <link rel="icon" sizes="any" mask="" href="https://www.baidu.com/img/baidu_85beaf5496f291521eb75ba38eacbd87.svg">
        <link rel="dns-prefetch" href="https://dss0.bdstatic.com/">
        <link rel="dns-prefetch" href="https://dss1.bdstatic.com/">
        <link rel="dns-prefetch" href="https://ss1.bdstatic.com/">
        <link rel="dns-prefetch" href="https://sp0.baidu.com/">
        <link rel="dns-prefetch" href="https://sp1.baidu.com/"><link rel="dns-prefetch" href="https://sp2.baidu.com/">
        <title>百度一下，你就知道</title>
        <script defer="" src="./百度一下，你就知道_files/cd37ed75a9387c5b.js.下载"></script>
    	<script src="./百度一下，你就知道_files/s_super_async-ea15d081fd.js.下载"></script>
    	<span id="s_strpx_span1" style="visibility:hidden;position:absolute;bottom:0;left:0;font-weight:bold;font-size:12px;font-family:&#39;arial&#39;;">中</span>
    </head>
</html>

2.1 使用BeautifulSoup来提取数据

from bs4 import BeautifulSoup
file = open("./baidu.html", "r", encoding="utf-8")
html = file.read()
soup = BeautifulSoup(html, "html.parser")

打印标签及内容

直接用soup来调用标签名即可；**注意：**多个相同的标签，拿到它所找到的第一个内容

print(soup.script)
print(soup.title)

输出（对应html文件中第6行和19行）

<script async="" src="./百度一下，你就知道_files/every_cookie_4644b13.js.下载"></script>
<title>百度一下，你就知道</title>

打印标签里的内容

用soup调用标签名，然后调用string或text，注意string的输出不包含注释符号<!----somecontent—>

print(soup.title.string)
print(soup.title.text)

输出(对应html文件中19行)

百度一下，你就知道
百度一下，你就知道

打印标签内的属性

使用attrs方法

print(soup.script.attrs)

输出（对应html文件中第6行）

{'async': '', 'src': './百度一下，你就知道_files/every_cookie_4644b13.js.下载'}

遍历文档

.contents获取tag树的所有子节点，返回一个list

print(soup.head.contents)

输出

['\n', <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>, '\n', <script async="" src="./百度一下，你就知道_files/every_cookie_4644b13.js.下载"></script>, '\n', <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>, '\n', <meta content="always" name="referrer"/>, '\n', <meta content="#2932e1" name="theme-color"/>, '\n', <meta content="全球最大的中文搜索引擎、致力于让网民更便捷地获取信息，找到所求。百度超过千亿的中文网页数据库，可以瞬间找到相关的搜索结果。" name="description"/>, '\n', <link href="https://www.baidu.com/favicon.ico" rel="shortcut icon" type="image/x-icon"/>, '\n', <link href="https://www.baidu.com/content-search.xml" rel="search" title="百度搜索" type="application/opensearchdescription+xml"/>, '\n', <link href="https://www.baidu.com/img/baidu_85beaf5496f291521eb75ba38eacbd87.svg" mask="" rel="icon" sizes="any"/>, '\n', <link href="https://dss0.bdstatic.com/" rel="dns-prefetch"/>, '\n', <link href="https://dss1.bdstatic.com/" rel="dns-prefetch"/>, '\n', <link href="https://ss1.bdstatic.com/" rel="dns-prefetch"/>, '\n', <link href="https://sp0.baidu.com/" rel="dns-prefetch"/>, '\n', <link href="https://sp1.baidu.com/" rel="dns-prefetch"/>, <link href="https://sp2.baidu.com/" rel="dns-prefetch"/>, '\n', <title>百度一下，你就知道</title>, '\n', <script defer="" src="./百度一下，你就知道_files/cd37ed75a9387c5b.js.下载"></script>, '\n', <script src="./百度一下，你就知道_files/s_super_async-ea15d081fd.js.下载"></script>, '\n', <span id="s_strpx_span1" style="visibility:hidden;position:absolute;bottom:0;left:0;font-weight:bold;font-size:12px;font-family:'arial';">中</span>, '\n']

.children 获取Tag的所有子节点，返回一个生成器

还可以便利子孙节点等，具体参考文档

文档的搜索

字符串：find_all() 查找与字符串完全相同的全部标签

print(soup.find_all("link"))

输出

[<link href="https://www.baidu.com/favicon.ico" rel="shortcut icon" type="image/x-icon"/>, 
<link href="https://www.baidu.com/content-search.xml" rel="search" title="百度搜索" type="application/opensearchdescription+xml"/>, 
<link href="https://www.baidu.com/img/baidu_85beaf5496f291521eb75ba38eacbd87.svg" mask="" rel="icon" sizes="any"/>, 
<link href="https://dss0.bdstatic.com/" rel="dns-prefetch"/>,
<link href="https://dss1.bdstatic.com/" rel="dns-prefetch"/>, 
<link href="https://ss1.bdstatic.com/" rel="dns-prefetch"/>,
<link href="https://sp0.baidu.com/" rel="dns-prefetch"/>, 
<link href="https://sp1.baidu.com/" rel="dns-prefetch"/>, 
<link href="https://sp2.baidu.com/" rel="dns-prefetch"/>]

正则表达式搜索：使用search()方法来匹配，只要标签符合正则表达式，将标签及其所有内容输出

import re
print(soup.find_all(re.compile("a.")))

输出

[<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<script async="" src="./百度一下，你就知道_files/every_cookie_4644b13.js.下载"></script>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<meta content="#2932e1" name="theme-color"/>
<meta content="全球最大的中文搜索引擎、致力于让网民更便捷地获取信息，找到所求。百度超过千亿的中文网页数据库，可以瞬间找到相关的搜索结果。" name="description"/>
<link href="https://www.baidu.com/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="https://www.baidu.com/content-search.xml" rel="search" title="百度搜索" type="application/opensearchdescription+xml"/>
<link href="https://www.baidu.com/img/baidu_85beaf5496f291521eb75ba38eacbd87.svg" mask="" rel="icon" sizes="any"/>
<link href="https://dss0.bdstatic.com/" rel="dns-prefetch"/>
<link href="https://dss1.bdstatic.com/" rel="dns-prefetch"/>
<link href="https://ss1.bdstatic.com/" rel="dns-prefetch"/>
<link href="https://sp0.baidu.com/" rel="dns-prefetch"/>
<link href="https://sp1.baidu.com/" rel="dns-prefetch"/><link href="https://sp2.baidu.com/" rel="dns-prefetch"/>
<title>百度一下，你就知道</title>
<script defer="" src="./百度一下，你就知道_files/cd37ed75a9387c5b.js.下载"></script>
<script src="./百度一下，你就知道_files/s_super_async-ea15d081fd.js.下载"></script>
<span id="s_strpx_span1" style="visibility:hidden;position:absolute;bottom:0;left:0;font-weight:bold;font-size:12px;font-family:'arial';">中</span>
</head>, <span id="s_strpx_span1" style="visibility:hidden;position:absolute;bottom:0;left:0;font-weight:bold;font-size:12px;font-family:'arial';">中</span>]

输出解释：上面标签中符合正则表达式的要求，因此打印出head标签的所有内容，标签页符合，上面只打印了2个元素

传入一个函数，根据函数来搜索

def name_is_exists(tag):
    return tag.has_attr("http-equiv")
print(soup.find_all(name_is_exists))

输出：（包含 http-equiv 属性的标签）

[<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>, <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>]

在find_all()里传入参数，传入的参数与标签内的属性相关

print(soup.find_all(content="#2932e1"))

输出：

[<meta content="#2932e1" name="theme-color"/>]

find_all(text=...) 查找标签里字符串: 应用正则表达式来查找包含特定文本内容（标签里的字符串）
```
import re
print(soup.find_all(text=re.compile("知道")))
```
输出
```
['百度一下，你就知道']
```

CSS 选择器

t_list1 = bs.select("title")  # 通过标签来查找
t_list2 = bs.select(".mnav")  # 通过类名查找，`.`表示类class
t_list3 = bs.select("#u1")    # 通过id来查找
t_list4 = bs.select("a[class='bri']")   # 通过属性来查找  a标签class=bri
t_list5 = bs.select("head > title")   # 通过子标签来查找 head的子标签title

funNLPer

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
BeautifulSoup模块的使用

BeautifulSoup模块的使用1. 简介BeautifulSoup可以快速从HTML、XML等文件中提取数据，使用beautiful soup 之前要先使用urllib.request从指定的网址上读取HTML文件，BeautifulSoup(html, "html.parser")需要两个参数，第一参数是需要提取数据的HTML文件，第二个参数指定解析器2. BeautifulSoup的使用使用如下html文件为例展示该模块的作用<!DOCTYPE html><!-- s
复制链接

扫一扫

专栏目录