BeautifulSoup库/bs4 基础&深入技术干货

最新推荐文章于 2024-04-18 10:11:04 发布

高山莫衣

最新推荐文章于 2024-04-18 10:11:04 发布

阅读量693

点赞数 1

分类专栏： python爬虫初学笔记文章标签： python css html

原创作品，共同进步！

本文链接：https://blog.csdn.net/AdamCY888/article/details/104518934

版权

初学笔记同时被 2 个专栏收录

26 篇文章 0 订阅

订阅专栏

python爬虫

6 篇文章 0 订阅

订阅专栏

BeautifulSoup库的安装

pip install bs4

对HTML的装载,prettify()即表示整理，能清晰的显示文档结构（文档数）

 soup  = BeautifulSoup(doc,"lxml")
 s = soup.prettify()
 print("s")

如果HTML缺失缺失，beautifulsoup库会自动补缺。

BeautifulSoup查找文档元素

获取网页html代码以后，那么我们要把它装在在一个beautifulsoup的对象当中，那么如何在文档中找到目标元素

如何查找html元素
利用find_all()函数，其原型为：

find_all(self,name = None, attrs = {},
recursive = True, text = None, 
limit = None,**kwargs)

元素	含义
self	类成员函数
name	要查找的tag元素名称，默认是None，如果不提供，就是查找所有的元素
attrs	是元素的属性，字典，默认是空，如果提供就是查找有这个指定属性的元素

find_all函数是查找所有满足要求的元素节点，如果只查找一个元素节点就可以使用find（）函数

find(self, name = None, attrs = {}, 
recursive = True,text = None,
 limit = None, **kwargs)

其使用方法与find_all类似，不同的是它只返回第一个满足要求的节点，不是一个列表。

示例：查找文档中class="title"的< p >元素

from bs4 import BeautifulSoup
doc = '''
<html><head><title>the Dormouse's story</title></head>
<body>
<p class="title"><b>the Dormouse's story</b></p>
<p class="story">
once uopn a time ther were three little sisters;and their names were
<a href="http://example.com/elsie"class="sister"
id="link1"elsie</a>,
<a href="http://example.com/lacie"class="sister"
id="link2">lacie</a>and
<a href="http://example.com/tillie"class="sister"
id="link3">tillie</a>;
and they lived an the bottom of a well.
</p>
<p class="story">...</p>
</body>
</html>

'''

利用find()函数

soup = BeautifulSoup(doc,"lxml")
tag = soup.find_all("p", attrs = {"class":"title"})
print(tag)
#result：<p class="title"><b>the Dormouse's story</b></p>

因为这条目标信息因为其位置在最前面，所以用find（）同样也能找到。

soup = BeautifulSoup(doc,"lxml")
tags = soup.find_all(name = None, attrs = {"class":"sister"})
for tag in tags:
    print(tag)
#result：
<a class="sister" elsie="" href="http://example.com/elsie" id="link1">,
</a>
<a class="sister" href="http://example.com/lacie" id="link2">lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">tillie</a>

对于这个文档同样可以使用：

tags = soup.find_all("a")
tags_1 = soup.find_all("a",attrs = {"class","sister"})
print(tags)
print("\n",tags_1)
#result：
[<a class="sister" elsie="" href="http://example.com/elsie" id="link1">,
</a>, <a class="sister" href="http://example.com/lacie" id="link2">lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">tillie</a>]

 [<a class="sister" elsie="" href="http://example.com/elsie" id="link1">,
</a>, <a class="sister" href="http://example.com/lacie" id="link2">lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">tillie</a>]

两者效果一致

BeautifulSoup获取元素的属性值

如果一个元素已经找到，例如找到< a >元素，则可以通过tag[attrName]来获取tag元素的名称为attrName的属性值，其中tag是一个bs4.element.Tag对象。
例如：查找文档中所有的超链接地址

soup = BeautifulSoup(doc,"lxml")
tags = soup.find_all("a")
for tag in tags:
    print(tag["href"])
 #result：
 http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

BeautifulSoup获取元素包含的文本值

使用方法：tag,text来获取tag元素包含的文本值，其中tag是一个bs4.element.Tag对象。
例如：查找文本中所有< a >超级链接包含的文本值

soup = BeautifulSoup(doc,"lxml")
tags = soup.find_all("a")
for tag in tags:
    print(tag.text)
#result：
,

lacie
tillie

tag.text获取的结果为一个标签下的所有文本

BeautifulSoup的高级查找

一般find或者find_all都能满足我们的需要，如果还不能，则可以设计一个查找函数来进行查找。

def mefilter(tag):
    print(tag.name)
    return(tag.name == "a" and tag.has_attr("href")and tag["href"]=="http://example.com/lacie")

soup = BeautifulSoup(doc,"lxml")
tag = soup.find_all(mefilter)
print("tag")
#result:
html
head
title
body
p
b
p
a
a
a
p
tag

说明：程序的运行中定义了一个筛选函数myfilter（tag）它的参数是tag对象，在调用soup.find_all(myfilter)时候会把每个tag元素传递给myfilter函数，返回True则取，否则就丢弃。

BeautifulSoup查找文档元素

高级查找要注意“class”标签为列表结构

def mefilter(tag):
    if tag.name == "p" and tag["class"] =="story":
        return True
soup = BeautifulSoup(doc,"html.parser")
tags = soup.find_all(mefilter)
print(tags)

def mefilter(tag):
    if tag.name == "p" and tag["class"] == ["story"]:
        return True
soup = BeautifulSoup(doc,"html.parser")
tags = soup.find_all(mefilter)
print("\n",tags)
#result：
[]

 [<p class="story">
once uopn a time ther were three little sisters;and their names were
<a a="" class="sister" elsie<="" href="http://example.com/elsie" id="link1">,
<a class="sister" href="http://example.com/lacie" id="link2">lacie</a>and
<a class="sister" href="http://example.com/tillie" id="link3">tillie</a>;
and they lived an the bottom of a well.
</a></p>, <p class="story">...</p>]

tag[“class”]是一个很特殊的属性
倘若文本为< p class=“story exem”>,那么就应该写为：

tag["class"] == ["story","exem"]

BeautifulSoup遍历文档元素

获取元素节点的父节点、子节点、临近节点，所有子孙节点，兄弟节点

目的	操作
获取父节点	tag.parent
获取元素的直接子节点	tag.children
获取tag节点的所有子孙节点元素，包括element，text等类型的节点	tag.desendants
下一个兄弟节点	tag.next_sibling
前一个兄弟节点	tag.previous_sibling

#eg_1
soup = BeautifulSoup(doc,"lxml")
print(soup.name)
tag = soup.find("b")
while tag:
	print(tag.name)
	tag = tag.parent
#eg_2
suop = BeautifulSoup(doc,"lxml")
tag = soup.find("b")
print(tag.previous_sibling)

BeautifulSoup使用CSS语法查找元素
其除了自身的函数，还可以用CSS语法
CSS语法
tag.select(css)
其结构为：
[tagName][attName[=value]]
其中[…]部分是可选的

变量	含义
tagName	元素名称，没有指定就是所有元素
attName = value	属性名称，value是它对应的值
tag.select(css)	返回一个bs4.element.Tag的列表，可能只有一个元素

各种应用具体操作方法

目标	代码
soup.select(“p a”)	查找文档中所有< p >节点下的所有< a >节点
soup.select(“p[class=‘story’] a”)	查找文档中所有属性class="story"的< p >节点下的所有< a >元素节点
soup.select(“p[class] a”)	查找文档中所有具有class属性的< p >< a >元素节点
soup.select(“body head title”)	查找下面< head >的< title >节点
soup.select(“body[class]”)	查找< body >下面所有具有class属性的节点
soup.select(“body[class] a”)	查找< body >下面所有具有class属性的节点下面的< a >节点
soup.select(“a[id=‘link1’]”)	查找属性id="link1"的< a >节点

属性的语法规则

选择器	描述
[attName]	用于选取带有指定属性的元素
[attName-value]	用于选取带有指定属性的元素
attName^=value]	匹配属性值以指定值开头的每个元素
[attName$=value]	匹配属性值以指定结尾的每个元素
[attName*=value]	匹配属性中包含指定值的每个元素

#查找所有< div >节点下面的所有直接子节点< p >
#不包含孙节点
#注意 p左右有空格
soup.select("div > p")

高山莫衣

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

BeautifulSoup库/bs4 基础&深入 技术干货

BeautifulSoup库的安装

BeautifulSoup查找文档元素

BeautifulSoup获取元素的属性值

BeautifulSoup获取元素包含的文本值

BeautifulSoup的高级查找

BeautifulSoup查找文档元素

BeautifulSoup遍历文档元素

属性的语法规则

BeautifulSoup库/bs4 基础&深入技术干货