Python爬虫常用库(三)beautifulsoup

一、解析库

解析器使用方法优势劣势
lxml HTML解析库Beautifulsoup(markup, “lxml”)速度快、文档容错能力强需要安装C语言库
lxml XML解析库Beautifulsoup(markup, “xml”)速度快、唯一支持XML的解析器需要安装C语言库
html5libBeautifulsoup(markup, “html5lib”)最好的容错性 浏览器解析方式 生成HTML5格式文档速度慢 不依赖外部扩展

二、解析方法

安利:
1、使用lxml解析器
2、使用find_all()

(一)soup的构造

soup = Beautiful(被解析对象,解析器)对被解析对象(如字符串html代码)进行解析(若不完整可进行补全),返回一个BeautifulSoup对象,该对象涵盖标签成员及一些方法调用。

soup = BeautifulSoup(html, "lxml")

(二)对象成员

被解析文本:

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>'''

1、标签对象soup.tag

print(soup.p) # soup.tag标签对象
# <p class="title"><b>The Dormouse's story</b></p>

2、标签名

print(soup.p.name) # soup.tag.name标签名对象<str>
# p

3、标签属性

(1)标签全部属性字典
print(soup.a.attrs)
# {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
for key, val in soup.a.attrs.items():
    print(key, val)
# href http://example.com/elsie
# class ['sister']
# id link1
(2)使用属性名访问属性值,注意存在多值属性时,返回属性值列表
print(soup.a["class"])
# ['sister']
print(soup.a.get("class"))
# ['sister']
soup.a["class"].append("yan")
print(soup.a["class"])
# ['sister', 'yan']
(3)属性变更
soup.a["extra_attr"] = "extra"
print(soup.a.attrs)
# {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1', 'extra_attr': 'extra'}
del soup.a["extra_attr"]
print(soup.a.attrs)
# {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

4、标签文本

(1)访问标签内文本soup.tag.string
soup = BeautifulSoup('''<p class="title"><b>The Dormouse's story</b></p>''', 'lxml')
print(soup.p.string)
# The Dormouse's story
(2)标签内文本替换soup.tag.string.replace_with(str)注意,文本不可修改,只可替换
ssoup.p.string.replace_with("hahahahahah")
print(soup.p.string)
# hahahahahahah

(三)标签访问

1、向下访问直接子标签

(1)soup.tag.contens返回tag直接子标签组成的列表
soup = BeautifulSoup(html, "lxml")
i = 0
for tag in soup.body.contents:
    i += 1
    print(i, ": ", tag)
# 1 :  
# 
# 2 :  <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# 3 :  
# 
# 4 :  <p class="story">...</p>
(2)soup.tag.children返回tag所有直接子标签的迭代对象,无法直接使用下标访问
for i, tag in enumerate(soup.body.children):
    print(i, ": ", tag)
# 0 :  
# 
# 1 :  <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# 2 :  
# 
# 3 :  <p class="story">...</p>

2、soup.tag.descendants向下访问所有子标签,包括直系子标签及非直系子标签

i = 1
for tag in soup.body.descendants:
    print(i, ": ", tag)
    i += 1
# 1 :  
# 
# 2 :  <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# 3 :  Once upon a time there were three little sisters; and their names were
# 
# 4 :  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 5 :  Elsie
# 6 :  ,
# 
# 7 :  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 8 :  Lacie
# 9 :   and
# 
# 10 :  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
# 11 :  Tillie
# 12 :  ;
# and they lived at the bottom of a well.
# 13 :  
# 
# 14 :  <p class="story">...</p>
# 15 :  ...

3、访问标签的所有文本内容

(1)soup.tag.strings返回tag内所有文本内容,类型为generator
i = 1
for str in soup.p.strings:
    print(i, ": ", str)
    i += 1
# 1 :  Once upon a time there were three little sisters; and their names were
#
# 2 :  Elsie
# 3 :  ,
#
# 4 :  Lacie
# 5 :   and
#
# 6 :  Tillie
# 7 :  ;
# and they lived at the bottom of a well.
(2)soup.tag.stripped_strings返回tag标签内所有文本内容的无空白符格式
for str in soup.p.stripped_strings:
    print(i, ": ", str)
    i += 1
# 1 :  Once upon a time there were three little sisters; and their names were
# 2 :  Elsie
# 3 :  ,
# 4 :  Lacie
# 5 :  and
# 6 :  Tillie
# 7 :  ;

4、soup.tag.parent返回tag的直接父标签,顶级标签如返回None

print(soup.p.parent)
# <body>
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# <p class="story">...</p></body>
print(soup.parent)
# None

5、soup.tag.parents返回tag的所有直系、非直系父标签迭代对象,无法直接下标访问

soup = BeautifulSoup(html, "lxml")
for i, tag in enumerate(soup.p.parents):
    print(i, ": ", tag)
# 0 :  <body>
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# <p class="story">...</p></body>
# 1 :  <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# <p class="story">...</p></body></html>
# 2 :  <html><head><title>The Dormouse's story</title></head>
# <body>
# <p class="story">Once upon a time there were three little sisters; and their names were
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
# and they lived at the bottom of a well.</p>
# <p class="story">...</p></body></html>

6、soup.tag.previous_sibling和soup.tag.next_sibling访问直接左右兄弟标签,无兄弟返回None,注意:该返回值常常是个空白符,如换行

print(soup.p.previous_sibling)
# None
print(soup.p.next_sibling)
# <p class="story">...</p>

7、soup.tag.previous_siblings和soup.tag.next_siblings返回所有左右兄弟标签迭代对象

for i, tag in enumerate(soup.p.next_siblings):
    print(i, ": ", tag)
# 0 :
#
# 1 :  <p class="story">...</p>
for i, tag in enumerate(soup.p.previous_siblings):
    print(i, ": ", tag)
# 0 :  

8、soup.tag.previous_elements和soup.tag.next_elements返回标签的所有左上、右上标签迭代对象

(四)标签检索

1、向下检索

soup.tag.find(name, attrs, recursive, string, limit, **kwargs)返回首个指定标签
soup.tag.find_all(name, attrs, recursive, text, limit, **kwargs)返回所有指定标签的列表

(1)name指定目标标签名

支持参数:正则表达式对象,字符串列表,True, 函数

(2)attrs指定目标标签属性

格式:
find_all(id="…") 注意特定格式不支持改格式,如data-*及find_all参数如name, attrs, class格式可以使用class_
find_all(attrs={“id”:"…"})
支持参数:字符串,正则表达式, 函数, True

(3)string指定目标标签文本内容

支持参数:字符串,字符串列表, 正则表达式,函数,True

(4)limit限制返回的最大标签数,参数为非负整数
(5)recursive=True时仅检索直系字标签
(6)简略写法:soup(name, attrs, recursive, string, limit, **kwargs)
(7)关于函数
def has_class_and_id(tag):
    return tag.has_attr("class") and tag.has_attr("id")

for tag in soup.find_all(has_class_and_id):# 未指定参数,则传入标签对象
    print(tag)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
def no_lacie(href):
    return href and not re.search("lacie", href)

for tag in soup.find_all(href=no_lacie):# 指定href属性,则传入属性值
    print(tag)
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

2、向上检索

soup.tag.find_parents(name, attrs, recursive, string, limit, **kwargs)
soup.tag.find_parent(name, attrs, recursive, string, limit, **kwargs)

3、左右检索(标签)

soup.find_previous_siblings()
soup.find_previous_sibling()
soup.find_next_siblings()
soup.find_next_sibling()

4、左右检索(所有元素)

soup.find_next()
soup.find_all_next()
soup.find_previous()
soup.find_all_previous()

(五)标签选择器select(字符串类)

1、按照标签名选择

print(soup.select("title"))
# [<title>The Dormouse's story</title>]
(1)soup.select(“name1 name2”)检索标签名name1下的所有标签名为name2的标签(空格符嵌套)
for tag in soup.select("body a"):
    print(i, ": ", tag)
    i += 1
# 1 :  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 2 :  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 3 :  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
(2)soup.select(“name1 > name2”)检索标签名name1下所有名为name2的直接子标签(>符嵌套)
print(soup.select("body > a"))# body标签下无a直系子标签
# []

2、按属性值检索(.符指定为class属性值)

(1)按class属性值检索

soup.selct(.val)检索所有class=val的标签(空格符嵌套)

for tag in soup.select(".story .sister"):# 检索class="story"的标签下所有class="sister"的子标签
    print(i, ": ", tag)
    i += 1
# 1 :  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 2 :  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 3 :  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
(2)按id属性值检索(#符指定为id属性值)

soup.select("#val")检索所有id=val的标签(空格符嵌套)

print(soup.select("#link1"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

3、多目标并列检索(,逗号符号并列条件)
注:无空格分隔并列表示要求同时满足两个条件

for tag in soup.select("title,.sister"):# 检索所有标签名为title及class="sister"的标签
    print(i, ": ", tag)
    i += 1
# 1 :  <title>The Dormouse's story</title>
# 2 :  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 3 :  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 4 :  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

4、指定标签名、属性值检索soup.select(name[attr="…"])

for tag in soup.select('a[class="sister"]'):
    print(i, ": ", tag)
    i += 1
# 1 :  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 2 :  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 3 :  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值