【爬虫】2.3 BeautifulSoup 查找文档元素

即使再小的船也能远航

已于 2023-08-29 16:15:31 修改

阅读量1.8k

点赞数 1

文章标签： beautifulsoup html 前端网络爬虫 find_all

于 2023-02-24 22:32:41 首次发布

本文链接：https://blog.csdn.net/qq_57268251/article/details/129106690

版权

1. 查找 HTML 元素

查找文档的元素是爬起网页信息的重要手段。BeautifulSoupt提供了一系列查找元素的方法，其中功能强大的 find_all 函数就是其中常用的一个方法。

find_all函数的原型如下：

find_all(self,name=None,attrs={},recursive=True,text=None,limit=None,kwargs)
self:表明它是一个类成员函数；
name:是要查找的tag元素名称，默认是None，如果不提供，就是查找所有元素；
attrs:是元素的属性，默认是空，如果提供就是查找有这个指定属性的元素；
recursive：指定查找是否在元素结点的子树下面全范围进行，默认是True；
text、limit、kwargs 参数比较复杂，将在后面用到时介绍；

find_all 函数返回查找到的所有指定的元素的列表，每个元素是一个bs4.element.Tag对象。

find_all 函数是查找所有满足要求的元素结点，如果只查找一个元素结点，那么可以使用find 函数。

find函数的原型如下：

find(self,name=None,attrs={},recursive=True,text=None,limit=None,kwargs)
其使用方法与 find_all 类似，但不同的是，它 只返回第一个满足要求的结点，而不是一个列表。

实例如下：


frombs4importBeautifulSoup

doc='''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="https://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="https://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="https://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body>
</html>
'''

soup=BeautifulSoup(doc, "lxml")

# 1.查找文档中的<title>元素
tag=soup.find("title")
print(type(tag), tag)  # <class 'bs4.element.Tag'> <title>The Dormouse's story</title>

# 2.查找文档中的所有<a>元素
tags=soup.find_all("a")
fortagintags:
    print(tag)
    # <a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>
    # <a class="sister" href="https://example.com/lacie" id="link2">Lacie</a>
    # <a class="sister" href="https://example.com/tillie" id="link3">Tillie</a>

# 3.查找文档中的第一个<a>元素
tag=soup.find("a")  # 与tag=soup.find("a",attrs={"class":"sister"})等价
print(tag)  # <a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>

# 4.查找文档中class="title"的<p>元素
tag=soup.find_all("p", attrs={"class": "title"})
print(tag)  # [<p class="title"><b>The Dormouse's story</b></p>]

# 5.查找文档中 class="sister"的元素
tag=soup.find_all(name=None, attrs={"class": "sister"})
fortagintags:
    print(tag)
    # <a class="sister" href="https://example.com/elsie" id="link1">Elsie</a>
    # <a class="sister" href="https://example.com/lacie" id="link2">Lacie</a>
    # <a class="sister" href="https://example.com/tillie" id="link3">Tillie</a>

2. 获取元素的属性值

如果已经找到一个元素，例如找到<a>元素，那么怎样获取它的属性值呢? BeautifulSoup 使用 tag[arrtName] 来获取 tag 元素的名称为 arrtName 的属性值，其中 tag 是一个 bs4.element.Tag 对象。


frombs4importBeautifulSoup

doc='''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="https://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="https://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="https://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body>
</html>
'''
soup=BeautifulSoup(doc, "lxml")
# 1.查找文档中的所有超链接地址
tags=soup.find_all("a")
fortagintags:
    print(tag["href"])
    # https://example.com/elsie
    # https://example.com/lacie
    # https://example.com/tillie

3. 获取元素包含的文本值

如果已经找到一个元素，例如找到<a>元素，那么怎样获取它包含的文本值呢？ BeautifulSoup 使用：tag.text 来获取 tag 元素包含的文本值，其中 tag 是一个 bs4.element.Tag对象


frombs4importBeautifulSoup

doc='''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="https://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="https://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="https://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body>
</html>
'''

soup=BeautifulSoup(doc, "lxml")

# 1.查找文档中的所有<a>超链接包含的文本值
tags=soup.find_all("a")
fortagintags:
    print(tag.text)
    # Elsie
    # Lacie
    # Tillie

# 2.查找文档中所有<p>结点包含的文本值(组合值)
tags=soup.find_all("p")
fortagintags:
    print(tag.text)
    # The Dormouse's story
    # 
    # Once upon a time there were three little sisters; and their names were
    # Elsie,
    # Lacie and
    # Tillie;
    # and they lived at the bottom of a well.
    # 
    # ...

4. 高级查找

一般，find 或者 find_all 都能满足基本需求，如果还不能，那么可以设计一个查找函数来进行查找。


frombs4importBeautifulSoup

doc='''
<html><head><title>The Dormouse's story</title></head>
<body>
<a href="https://example.com/elsie" >Elsie</a>
<a href="https://example.com/lacie" >Lacie</a>
<a href="https://example.com/tillie">Tillie</a>
</body>
</html>
'''


# 1.查找文档中 href="https://example.com/lacie" 的结点元素<a>
defmyFilter(tag):
    print(tag.name, end=" ")  # html head title body a a a
    returntag.name=="a"andtag.has_attr("href") andtag["href"] =="https://example.com/lacie"


soup=BeautifulSoup(doc, "lxml")
tag=soup.find_all(myFilter)  # 将myFilter函数的地址给find_all()函数。每个tag传入myFilter参数，判定是否符合过滤器，符合就被保留。
print(tag)  # [<a href="https://example.com/lacie">Lacie</a>]


# 2.通过函数查找一些复杂的结点元素，查找文本值以"cie"结尾的所有的<a>结点
defendWith(s, t):
    iflen(s) >=len(t):
        returns[len(s) -len(t):] ==t
    returnFalse


defmyFilter(tag):
    returntag.name=="a"andendWith(tag.text, "cie")


soup=BeautifulSoup(doc, "lxml")
tags=soup.find_all(myFilter)
fortagintags:
    print(tag)  # <a href="https://example.com/lacie">Lacie</a>

说明1：
在程序中我们定义了一个筛选函数myFilter(tag)，它的参数是tag对象，在调用soup.find_all(myFilter)时程序会把每个tag元素传递给myFilter函数，由该函数决定这个tag的取舍，如果myFilter返回True就保留这个tag到结果集中，不然就丢掉这个tag。因此程序执行时可以看到html,body,head,title,body,a,a,a等一个个tag经过myFilter的筛选，只有节点
<a href="http://example.com/lacie">Lacie</a>满足要求，因此结果为：
[<a href="http://example.com/lacie">Lacie</a>]
其中：
tag.name是tag的名称；
tag.has_attr(attName)判断tag是否有attName属性；
tag[attName]是tag的attName属性值.