Python爬虫常用库（三）beautifulsoup

最新推荐文章于 2024-09-13 16:01:23 发布

Siumai

最新推荐文章于 2024-09-13 16:01:23 发布

阅读量309

点赞数 1

分类专栏： Python

本文链接：https://blog.csdn.net/qq_43749739/article/details/105088058

版权

Python 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

一、解析库

解析器	使用方法	优势	劣势
lxml HTML解析库	Beautifulsoup(markup, “lxml”)	速度快、文档容错能力强	需要安装C语言库
lxml XML解析库	Beautifulsoup(markup, “xml”)	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	Beautifulsoup(markup, “html5lib”)	最好的容错性浏览器解析方式生成HTML5格式文档	速度慢不依赖外部扩展

二、解析方法

安利：
1、使用lxml解析器
2、使用find_all（）

（一）soup的构造

soup = Beautiful(被解析对象，解析器)对被解析对象（如字符串html代码）进行解析（若不完整可进行补全），返回一个BeautifulSoup对象，该对象涵盖标签成员及一些方法调用。

soup = BeautifulSoup(html, "lxml")

（二）对象成员

被解析文本：

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>'''

1、标签对象soup.tag

print(soup.p) # soup.tag标签对象
# <p class="title"><b>The Dormouse's story</b></p>

2、标签名

print(soup.p.name) # soup.tag.name标签名对象<str>
# p

3、标签属性

（1）标签全部属性字典

print(soup.a.attrs)
# {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
for key, val in soup.a.attrs.items():
    print(key, val)
# href http://example.com/elsie
# class ['sister']
# id link1

（2）使用属性名访问属性值，注意存在多值属性时，返回属性值列表

print(soup.a["class"])
# ['sister']
print(soup.a.get("class"))
# ['sister']
soup.a["class"].append("yan")
print(soup.a["class"])
# ['sister', 'yan']

（3）属性变更

soup.a["extra_attr"] = "extra"
print(soup.a.attrs)
# {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1', 'extra_attr': 'extra'}
del soup.a["extra_attr"]
print(soup.a.attrs)
# {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

4、标签文本

（1）访问标签内文本soup.tag.string

soup = BeautifulSoup('''<p class="title"><b>The Dormouse's story</b></p>''', 'lxml')
print(soup.p.string)
# The Dormouse's story

（2）标签内文本替换soup.tag.string.replace_with(str)注意，文本不可修改，只可替换

ssoup.p.string.replace_with("hahahahahah")
print(soup.p.string)
# hahahahahahah

（三）标签访问

1、向下访问直接子标签

（1）soup.tag.contens返回tag直接子标签组成的列表

soup = BeautifulSoup(html, "lxml")
i = 0
for tag in soup.body.contents:
    i += 1
    print(i, ": ", tag)
# 1 :  
# 
# 2 :  <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# 3 :  
# 
# 4 :  <p class="story">...</p>

（2）soup.tag.children返回tag所有直接子标签的迭代对象，无法直接使用下标访问

for i, tag in enumerate(soup.body.children):
    print(i, ": ", tag)
# 0 :  
# 
# 1 :  <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# 2 :  
# 
# 3 :  <p class="story">...</p>

2、soup.tag.descendants向下访问所有子标签，包括直系子标签及非直系子标签

i = 1
for tag in soup.body.descendants:
    print(i, ": ", tag)
    i += 1
# 1 :  
# 
# 2 :  <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# 3 :  Once upon a time there were three little sisters; and their names were
# 
# 4 :  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 5 :  Elsie
# 6 :  ,
# 
# 7 :  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 8 :  Lacie
# 9 :   and
# 
# 10 :  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
# 11 :  Tillie
# 12 :  ;
# and they lived at the bottom of a well.
# 13 :  
# 
# 14 :  <p class="story">...</p>
# 15 :  ...

3、访问标签的所有文本内容

（1）soup.tag.strings返回tag内所有文本内容，类型为generator

i = 1
for str in soup.p.strings:
    print(i, ": ", str)
    i += 1
# 1 :  Once upon a time there were three little sisters; and their names were
#
# 2 :  Elsie
# 3 :  ,
#
# 4 :  Lacie
# 5 :   and
#
# 6 :  Tillie
# 7 :  ;
# and they lived at the bottom of a well.

（2）soup.tag.stripped_strings返回tag标签内所有文本内容的无空白符格式

for str in soup.p.stripped_strings:
    print(i, ": ", str)
    i += 1
# 1 :  Once upon a time there were three little sisters; and their names were
# 2 :  Elsie
# 3 :  ,
# 4 :  Lacie
# 5 :  and
# 6 :  Tillie
# 7 :  ;

4、soup.tag.parent返回tag的直接父标签，顶级标签如返回None

print(soup.p.parent)
# <body>
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# <p class="story">...</p></body>
print(soup.parent)
# None

5、soup.tag.parents返回tag的所有直系、非直系父标签迭代对象，无法直接下标访问

soup = BeautifulSoup(html, "lxml")
for i, tag in enumerate(soup.p.parents):
    print(i, ": ", tag)
# 0 :  <body>
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# <p class="story">...</p></body>
# 1 :  <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# <p class="story">...</p></body></html>
# 2 :  <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# <p class="story">...</p></body></html>

6、soup.tag.previous_sibling和soup.tag.next_sibling访问直接左右兄弟标签，无兄弟返回None，注意：该返回值常常是个空白符，如换行

print(soup.p.previous_sibling)
# None
print(soup.p.next_sibling)
# <p class="story">...</p>

7、soup.tag.previous_siblings和soup.tag.next_siblings返回所有左右兄弟标签迭代对象

for i, tag in enumerate(soup.p.next_siblings):
    print(i, ": ", tag)
# 0 :
#
# 1 :  <p class="story">...</p>
for i, tag in enumerate(soup.p.previous_siblings):
    print(i, ": ", tag)
# 0 :

8、soup.tag.previous_elements和soup.tag.next_elements返回标签的所有左上、右上标签迭代对象

（四）标签检索

1、向下检索

soup.tag.find(name, attrs, recursive, string, limit, **kwargs)返回首个指定标签
soup.tag.find_all(name, attrs, recursive, text, limit, **kwargs)返回所有指定标签的列表

（1）name指定目标标签名

支持参数：正则表达式对象，字符串列表，True，函数

（2）attrs指定目标标签属性

格式：
find_all(id="…") 注意特定格式不支持改格式，如data-*及find_all参数如name， attrs， class格式可以使用class_
find_all(attrs={“id”:"…"})
支持参数：字符串，正则表达式，函数， True

（3）string指定目标标签文本内容

支持参数：字符串，字符串列表，正则表达式，函数，True

（4）limit限制返回的最大标签数，参数为非负整数

（5）recursive=True时仅检索直系字标签

（6）简略写法：soup(name, attrs, recursive, string, limit, **kwargs)

（7）关于函数

def has_class_and_id(tag):
    return tag.has_attr("class") and tag.has_attr("id")

for tag in soup.find_all(has_class_and_id):# 未指定参数，则传入标签对象
    print(tag)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

def no_lacie(href):
    return href and not re.search("lacie", href)

for tag in soup.find_all(href=no_lacie):# 指定href属性，则传入属性值
    print(tag)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

2、向上检索

soup.tag.find_parents(name, attrs, recursive, string, limit, **kwargs)
soup.tag.find_parent(name, attrs, recursive, string, limit, **kwargs)

3、左右检索（标签）

soup.find_previous_siblings()
soup.find_previous_sibling()
soup.find_next_siblings()
soup.find_next_sibling()

4、左右检索（所有元素）

soup.find_next()
soup.find_all_next()
soup.find_previous()
soup.find_all_previous()

（五）标签选择器select(字符串类)

1、按照标签名选择

print(soup.select("title"))
# [<title>The Dormouse's story</title>]

（1）soup.select(“name1 name2”)检索标签名name1下的所有标签名为name2的标签（空格符嵌套）

for tag in soup.select("body a"):
    print(i, ": ", tag)
    i += 1
# 1 :  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 2 :  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 3 :  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

（2）soup.select(“name1 > name2”)检索标签名name1下所有名为name2的直接子标签（>符嵌套）

print(soup.select("body > a"))# body标签下无a直系子标签
# []

2、按属性值检索（.符指定为class属性值）

（1）按class属性值检索

soup.selct(.val)检索所有class=val的标签（空格符嵌套）

for tag in soup.select(".story .sister"):# 检索class="story"的标签下所有class="sister"的子标签
    print(i, ": ", tag)
    i += 1
# 1 :  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 2 :  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 3 :  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

（2）按id属性值检索（#符指定为id属性值）

soup.select("#val")检索所有id=val的标签（空格符嵌套）

print(soup.select("#link1"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

3、多目标并列检索（,逗号符号并列条件）
注：无空格分隔并列表示要求同时满足两个条件

for tag in soup.select("title,.sister"):# 检索所有标签名为title及class="sister"的标签
    print(i, ": ", tag)
    i += 1
# 1 :  <title>The Dormouse's story</title>
# 2 :  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 3 :  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 4 :  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

4、指定标签名、属性值检索soup.select(name[attr="…"])

for tag in soup.select('a[class="sister"]'):
    print(i, ": ", tag)
    i += 1
# 1 :  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 2 :  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 3 :  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>