BeautifulSoup4库学习笔记

最新推荐文章于 2022-10-14 00:23:52 发布

丙丁火

最新推荐文章于 2022-10-14 00:23:52 发布

阅读量142

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/caicaibird0531/article/details/90714091

版权

爬虫专栏收录该内容

11 篇文章 0 订阅

订阅专栏

一、主要的解析器

解析器	使用方法	优势
Python标准库	BeautifulSoup(text_doc,“html.parser”)	* Python的内置标准库 * 执行速度适中 * 文档容错能力强
lxml HTML 解析器	BeautifulSoup(text_doc,“lxml”)	* 速度快 * 文档容错能力强
lxml XML解析器	BeautifulSoup(text_doc,[“lxml”,“xml”]) BeautifulSoup(text_doc,“xml”)	* 速度快 * 唯一支持XML的解析器
html5lib	BeautifulSoup(text_doc,“html5lib”)	* 最好的容错性 * 以浏览器的方式解析文档 * 生成HTML5格式

二、对象的种类

BeautifulSoup将复杂的HTML文档转换成一个树形结构，每个节点都是Python对象，所有对象可归纳为4种：Tag，NavigableString，BeautifulSoup，Comment。

1.`Tag`

Tag最重要的属性是name和attributes。

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
print(type(tag))
# <class 'bs4.element.Tag'>
print(tag.name)
# 'b'
print(tag['class']) # 也可 print(tag.get('class'))
# ['boldest']
print(tag.attrs)
# {'class': ['boldest']}

2.`NavigableString`

Beautiful Soup用 NavigableString 类来包装tag中的字符串:

print(tag.string)
# Extremely bold
print(type(tag.string))
# <class 'bs4.element.NavigableString'>

3.`BeautifulSoup`

BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象,它支持 遍历文档树 和 搜索文档树 中描述的大部分的方法。

print(soup.name)
# [document]

4.`Comment`

Comment 对象是一个特殊类型的 NavigableString 对象。

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
print(comment)
# Hey, buddy. Want to buy a used parser?
print(soup.b.prettify())
# <b>
#  <!--Hey, buddy. Want to buy a used parser?-->
# </b>

三、遍历文档树


html_doc = """<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>

<p class="story">...</p>"""

from bs4 import BeautifulSoupsoup = BeautifulSoup(html_doc)

1.子节点

Tag的名字

print(soup.head)
# <head><title>The Dormouse's story</title></head>
print(soup.head.title)
# <title>The Dormouse's story</title>
print(soup.a)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
print(soup.find_all('a'))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

.contents和.children

tag的 .contents 属性可以将tag的子节点以列表的方式输出:

head_tag = soup.head
print(head_tag)
# <head><title>The Dormouse's story</title></head>
print(head_tag.contents)
# [<title>The Dormouse's story</title>]
title_tag = head_tag.contents[0]
print(title_tag)
# <title>The Dormouse's story</title>
print(title_tag.contents)
# ["The Dormouse's story"]

通过tag的 .children 生成器,可以对tag的子节点进行循环:

for child in title_tag.children:
    print(child)
# The Dormouse's story

.descendants

.contents 和.children 属性仅包含tag的直接子节点， .descendants属性可以对所有tag的子孙节点进行递归循环。

for child in head_tag.descendants:
    print(child)
# <title>The Dormouse's story</title>
# The Dormouse's story

.string

如果tag包含了多个子节点,tag就无法确定 .string 方法应该调用哪个子节点的内容, .string 的输出结果是 None。

print(soup.head.string)
# The Dormouse's story
print(soup.title.string)
# The Dormouse's story
print(soup.html.string)
# None

.strings和stripped_strings

如果tag中包含多个字符串，可以使用 .strings 来循环获取:

for string in soup.strings:
    print(repr(string))

输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容:

for string in soup.stripped_strings:
    print(repr(string))

2.父节点

可用属性有：
.parent
.parents

3.兄弟节点

可用属性有：
.next_sibling
.previous_sibling
.next_sibling
.previous_sibling

4.回退和前进

.next_element
.previous_element
.next_elements
.previous_elements

四、搜索文档树

1.过滤器

过滤器可以被用在tag的name中,节点的属性中,字符串中或他们的混合中。

字符串

最简单的过滤器是字符串。示例代码如下：

print(soup.find_all("b"))
# [<b>The Dormouse's story</b>]

正则表达式

如果传入正则表达式作为参数,BeautifulSoup会通过正则表达式的 match() 来匹配内容。
示例代码如下：

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

列表

如果传入列表参数,BeautifulSoup会将与列表中任一元素匹配的内容返回。
示例代码如下：

result = soup.find_all(['a','b'])
print(result)
# [<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

True

True 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点。示例代码如下：

for tag in soup.find_all(True):
    print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p

方法

如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数,如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False。示例代码如下：

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
result = soup.find_all(has_class_but_no_id)
print(result)
# [<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>, <p class="story">...</p>]

2.`find_all()`

find_all(name, attrs, recursive, text, **kwargs)

name参数

搜索name参数的值可以是任一类型的过滤器：字符串，正则表达式，列表，方法或是True。
示例代码参考上面过滤器。

keyword参数

搜索指定名字的属性时可以使用的参数值包括字符串，正则表达式，列表，True。示例代码如下：

result = soup.find_all(href=re.compile("lacie"))
print(result)
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

指定属性attrs

示例代码如下：

result = soup.find_all("a", attrs={"class":"sister"})
print(result)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

text参数

text参数接受字符串、正则表达式、列表和True。示例代码如下：

res1 = soup.find_all(text='Lacie')
print(res1)
# ['Lacie']
res2 = soup.find_all("a",text="Lacie")
print(res2)
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

limit参数

如“a”tag有3个符合搜索条件，现在用limit只返回2个。示例代码如下：

result = soup.find_all("a",limit=2)
print(result)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

recursive参数

调用tag的 find_all() 方法时,BeautifulSoup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False。示例代码如下：

res1 = soup.html.find_all("title")
print(res1)
# [<title>The Dormouse's story</title>]
res2 = soup.html.find_all("title",recursive=False)
print(res2)
# []

像调用find_all()一样调用tag

下面两行代码是等价的：

soup.find_all("a")
soup("a")

下面两行代码也是等价的：

soup.title.find_all(text=True)
soup.title(text=True)

3.`find()`

find(name, attrs, recursive, text, **kwargs)
find()方法相当于find_all()方法设置limit=1参数，唯一区别是find_all()方法返回结果的值是一个元素的列表，而find()方法直接返回结果。

4.其他方法

find_parents(name, attrs, recursive, text, **kwargs)
find_parent(name, attrs, recursive, text, **kwargs)

find_next_siblings(name, attrs, recursive, text, **kwargs)
find_next_sibling(name, attrs, recursive, text, **kwargs)

find_previous_siblings(name, attrs, recursive, text, **kwargs)
find_previous_sibling(name, attrs, recursive, text, **kwargs)

find_all_next(name, attrs, recursive, text, **kwargs)
find_next(name, attrs, recursive, text, **kwargs)

find_all_previous(name, attrs, recursive, text, **kwargs)
find_previous(name, attrs, recursive, text, **kwargs)

5.`CSS`选择器

BeautifulSoup支持大部分的CSS选择器，在 Tag 或 BeautifulSoup 对象的.select() 方法中传入字符串参数,即可使用CSS选择器的语法找到tag:

通过tag标签逐层查找:

soup.select("body a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("html head title")
# [<title>The Dormouse's story</title>]

找到某个tag标签下的直接子标签:

soup.select("head > title")
# [<title>The Dormouse's story</title>]

soup.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("p > a:nth-of-type(2)")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.select("p > #link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("body > a")
# []

找到兄弟节点标签:

soup.select("#link1 ~ .sister")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie"  id="link3">Tillie</a>]

soup.select("#link1 + .sister")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过CSS的类名查找:

soup.select(".sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("[class~=sister]")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过tag的id查找:


soup.select("#link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("a#link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过是否存在某个属性来查找:

soup.select('a[href]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过属性的值来查找:

soup.select('a[href="http://example.com/elsie"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select('a[href^="http://example.com/"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('a[href$="tillie"]')
# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('a[href*=".com/el"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

通过语言设置来查找:

multilingual_markup = """ 
<p lang="en">Hello</p> <p lang="en-us">Howdy, y'all</p> <p lang="en-gb">Pip-pip, old fruit</p> <p lang="fr">Bonjour mes amis</p>
"""
multilingual_soup = BeautifulSoup(multilingual_markup)multilingual_soup.select('p[lang|=en]')
# [<p lang="en">Hello</p>,#  <p lang="en-us">Howdy, y'all</p>,#  <p lang="en-gb">Pip-pip, old fruit</p>]

丙丁火

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
BeautifulSoup4库学习笔记

一、主要的解析器解析器使用方法优势Python标准库BeautifulSoup(text_doc,“html.parser”)* Python的内置标准库* 执行速度适中* 文档容错能力强lxml HTML 解析器BeautifulSoup(text_doc,“lxml”)* 速度快* 文档容错能力强lxml XML解析器BeautifulSoup(t...
复制链接

扫一扫