python爬虫05 - BeautifulSoup4的安装，下载，源码简介，使用。

最新推荐文章于 2024-07-29 16:54:00 发布

烈风回响

最新推荐文章于 2024-07-29 16:54:00 发布

阅读量1.5k

点赞数

分类专栏： python爬虫文章标签： python

本文链接：https://blog.csdn.net/LonelyDragons/article/details/108175107

版权

本文详细介绍了BeautifulSoup4的安装、源码简介和使用方法，包括基本概念、源码分析、快速开始、遍历树、搜索树等核心功能。通过实例演示了如何将HTML文档转换为实例化对象，以及如何使用find_all()、find()等方法进行数据查找，同时还讲解了遍历子节点、父节点和兄弟节点的方法。文章适合初学者入门，帮助理解并掌握BeautifulSoup4的基本操作。

摘要由CSDN通过智能技术生成

1. bs4简介

1.1 基本概念

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的网页信息提取库

1.2 源码分析

• github下载源码
• 安装
• pip install lxml
• pip install bs4

pip install bs4 -i https://pypi.douban.com/simple

在这里插入图片描述

在github 下载BeautifulSoup源码

下载第一个
在这里插入图片描述

BeautifulSoup源码简介

在这里插入图片描述

主要的源码在bs4 中间两个文档文件夹最后一个脚本文件夹先不用看

一张爱丽丝梦游仙境的插图

_init_ 就是初始化的意思
在这里插入图片描述
class BeautifulSoup(Tag): 经常出现
Tag就是标签就是让你传递一个lxml html文档
咱们再看一些有什么方法

insert before 在前面插入
insert after 在之后插入
而前面用到过pop 删除的意思那么这三个方法就是修改方法
在这里插入图片描述
find() find_all()
这就是一些查找的方法

遍历的方法
还有很多很多方法值得去注意学习的

找next_sibling 下一个兄弟的意思吧就像是导航的意思
我们用爬虫写一些代码从网上获取一些免费的资源比如文字图片平常中我们可以通过复制粘贴这个动作来创建一个新的文本但是网页中这个文字粘贴到文本里是一个比较慢的动作而爬虫的核心思想就是写一些程序这些程序能把文字爬取并且能把文字保存在文档里这些代码或者是想法就衍生出来了各种各样的工具模块
bs4就是其中的一种模块那么它是如何抓取数据的查找导航…
在这里插入图片描述
比如在next_sibling()这个方法中

就比如说平常是
对象.next_sibling() 加了装饰器@property后对象.next_sibling 就可以将next_sibling当作属性来调用

只要是通过装饰器装饰的他的小图标就是蓝紫色p标记的
有兴趣就要再多看看可以把里面内容翻译一下

2. bs4的使用

比如你到了公司刚拿到一个最新的技术点你又没有太多合适的博客资料你该如何学习
在这里插入图片描述
可以看这个文档点击你有的浏览器打开

2.1 快速开始

在这里插入图片描述
咱们也用这个例子玩玩
大家可以看出这个html文档结构有些不美观

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

我们可以看到body标签中有很多段落 p a 段落中属性比如class=“story”
href属性对应的是一个链接还有id 还有比如Elsie Lacie这样的文字
打印一下这个结构不是很美观呀
在这里插入图片描述

BeautifulSoup是一个类
可以创建一个实例化对象
BeautifulSoup()的传一个Tag 我们就传入html_doc

run后给了一个警告

‘features=“lxml”’ 就是说的你的解释器少了lxml

加上’lxml’
这样就行了结构就变得清晰了
在这里插入图片描述

将html变成一个实例化对象，这个对象的方法

比如我们找The Dormouse’s story 而不是p标签里的title

在这里插入图片描述
print(soup.title)

我们以前用xpath 得把这个数据变成一个element对象然后再写xpath的语法
写哪个标签下的标签
而这种方法直接对象.属性的方法直接找到了
而且如果你用正则表达式把那一段截取下来

中间一删除换成 (.*?)
在这里插入图片描述
这还是比较简单的正则

所以上面的那个新方法是非常简单粗暴的直接就拿到了这个数据了
获取标签的名字
在这里插入图片描述
那比如我想要中间的那个数据

那必如要找这个p导航的话
而且通过标签导航找的是第一个

那么你可能就想找到所有的p标签我们先看看有多少个p标签
而且要注意上回的p是一个属性而这个会得传一个字符串的 ‘p’ 要不然就报错了
在这里插入图片描述
确实有3个p标签

在看看print®的结果是什么

而且还用逗号分隔了第二个p标签下有3个a标签
Once upon a time there were three little sisters; and their names were
从前有三个小姐妹，她们的名字是 3个a标签中的文字
and they lived at the bottom of a well
他们住在井底

在这里插入图片描述
第二个p标签下有3个a标签

而且这些数据都是在一个列表中那么你想拿这些元素就可以遍历这个列表
那还比如你想要href中的链接

links = soup.find_all('a')

for link in links:
    print(link.get('href'))

在这里插入图片描述

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup=BeautifulSoup(html_doc,'lxml')
# "<html><head><title>(.*?)</title></head>"
# print(soup.prettify())
# print(soup.title)
# print(soup.title.string)
# print(soup.p)
# r = soup.find_all('p')
# print(len(r))
# print(r)

links = soup.find_all('a')

for link in links:
    print(link.get('href'))

也就是你想从html_doc 中找数据就先通过soup=BeautifulSoup(html_doc,‘lxml’)把html_doc变成一个对象然后这个对象有很多找数据的方法导航搜索修改…

小结

# 获取bs对象
bs = BeautifulSoup(html_doc,'lxml')
# 打印文档内容(把我们的标签更加规范的打印)
print(bs.prettify())
print(bs.title) # 获取title标签内容 <title>The Dormouse's story</title>
print(bs.title.name) # 获取title标签名称 title
print(bs.title.string) # title标签里面的文本内容 The Dormouse's story
print(bs.p) # 获取p段落

2.2 bs4的对象种类

• tag : 标签
• NavigableString : 可导航的字符串
• BeautifulSoup : bs对象
• Comment : 注释

那么刚才上面的soup是bs4的哪种对象？
当然是第三种 BeautifulSoup : bs对象

soup=BeautifulSoup(html_doc,'lxml')

打印一下其类型
在这里插入图片描述
再看看上面那个title标签是bs4的哪种对象
tag : 标签

tag是一个标签(tag)类型的对象那么按照这个结论 a p head 都是标签(tag)类型的对象
验证一下

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
from bs4 import BeautifulSoup

soup=BeautifulSoup(html_doc,'lxml')
print(type(soup))  # <class 'bs4.BeautifulSoup'>
print(type(soup.title))  # <class 'bs4.element.Tag'>
print(type(soup.a)) # <class 'bs4.element.Tag'>
print(type(soup.p)) # # <class 'bs4.element.Tag'>

那么NavigableString : 可导航的字符串这个是什么意思
比如我想看p标签的里面一些文字内容怎么搞

bs.p.string

而且是默认找第一个属性
在这里插入图片描述

那么这个soup.p.string对象的类就是NavigableString (可导航的字符串)
也就是我们通过soup.p.string导航到了文本的内容
那么最后一个Comment : 注释类对象呢？

很显然赋了值的title_tag就是第一个p标签了
在这里插入图片描述
也就是能看到p标签的 class属性还有其中的文字内容

java里的注释： // python里的注释： #
前端的注释没法打出来
在这里插入图片描述

得模拟一个注释才能看出来效果

在这里插入图片描述

print(soup.p.string) #The Dormouse's story

在这里插入图片描述
而

html_comment = '<b><!--注释--></b>'
soup=BeautifulSoup(html_comment,'lxml')

print(soup.b.string)

时结果只会是注释 (就是这个位置的内容)
再看看它的类型
在这里插入图片描述

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')
print(type(soup))  # <class 'bs4.BeautifulSoup'>
print(type(soup.title))  # <class 'bs4.element.Tag'>
print(type(soup.a)) # <class 'bs4.element.Tag'>
print(type(soup.p)) # # <class 'bs4.element.Tag'>
print(soup.p.string) #The Dormouse's story
print(type(soup.p.string)) # <class 'bs4.element.NavigableString'>

# title_tag=soup.p
# print(title_tag)
# // #
html_comment = '<b><!--注释--></b>'
soup=BeautifulSoup(html_comment,'lxml')
# print(soup.b.string)
print(type(soup.b.string))#<class 'bs4.element.Comment'>

小结

print(bs.title)
获取title标签内容
在这里插入图片描述

print(bs.title.name) # 获取title标签名称 title
print(bs.title.string) # title标签里面的文本内容 The Dormouse’s story
print(bs.title[‘class’]) #获取title标签里的属性 [‘title’]

在这里插入图片描述
我想要这个属性值怎么找


title_tag = soup.p

print(title_tag['class'])

在这里插入图片描述
返回的是一个列表想要里面的元素价格[0]就ok

文字内容在b标签中

当然我们知道 soup.p这样找p标签只能找第一个
当然还可以用find_all (上面已经演示过的)

from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,'lxml')

# print(soup.title)
# print(soup.p)
# print(soup.p.b)
# print(soup.a)
all_p=soup.find_all('p')#返回的是一个列表

print(all_p)

在这里插入图片描述

3. 遍历树遍历子节点

bs里面有三种情况，第一个是遍历，第二个是查找，第三个是修改

3.1 contents children descendants

• contents 返回的是一个列表
• children 返回的是一个迭代器通过这个迭代器可以进行迭代
• descendants 返回的是一个生成器遍历子子孙孙

迭代 iterate 指的是按照某种顺序逐个访问列表(比如列表但是还有其他例子)中的某一项例如 Python中的for语句
循环 loop 指满足某些条件下，重复执行某一段代码例如 Python中的while语句

html_doc还是那个爱丽丝梦游仙境的
contents 返回的是一个列表
在这里插入图片描述
那这个links的值是什么

在这里插入图片描述
你会发现contents把html文档(html_doc)的所有内容全部拿到了而且是把全部内容放在了列表中

children

children 返回的是一个迭代器通过这个迭代器可以进行迭代


html = '''
<div>
<a href='#'>李若彤</a>
<a href='#'>热巴</a>
<a href='#'>老师</a>
</div>
'''

soup2=BeautifulSoup(html,'lxml')
links2=soup2.contents
for li in links2:
    r=li.find_all('a')# 也正好find_all方法返回的是一个列表
    print(r)
    for l in r:
        print(l.string)

[<a href="#">李若彤</a>, <a href="#">热巴</a>, <a href="#">老师</a>]
李若彤
热巴
老师

在这里插入图片描述
可以看出links是一个可迭代的
那么我们就可以通过for遍历一下看看结果如何

html = '''
<div>
<a href='#'>李若彤</a>
<a href='#'>热巴</a>
<a href='#'>老师</a>
</div>

links=soup2.div.children
print(type(links))
for link in links:
    print(link)

在这里插入图片描述

descendants

descendants 返回的是一个生成器遍历子子孙孙
在这里插入图片描述

在这里插入图片描述
soup.contents的类型是列表所以长度是1

TypeError: object of type ‘generator’ has no len()
也就是这个generator类型的对象是没有长度的子子孙孙就没长度了
generator就是生成器的意思

D:\python\python.exe D:/LongProject/爬虫/day008/遍历子节点.py
---------
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little s