BeautifulSoup详解
一. 简介
BeautifulSoup是一个高效的网页解析库,可以从HTML或XML文件中提取数据
支持不同的解析器,比如,对HTML解析,对XML解析,对HTML5解析
就是一个非常强大的工具,爬虫利器
一个灵感又方便的网页解析库,处理高效,支持多种解析器
利用它就不用编写正则表达式也能方便的实现网页信息的抓取
二. 解析库
解析器 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
Python标准库 | BeautifulSoup(markup, “html.parser”) | Python的内置标准库、执行速度适中 、文档容错能力强 | Python 2.7.3 or 3.2.2)前的版本中文容错能力差 |
lxml HTML 解析器 | BeautifulSoup(markup, “lxml”) | 速度快、文档容错能力强 | 需要安装C语言库 |
lxml XML 解析器 | BeautifulSoup(markup, “xml”) | 速度快、唯一支持XML的解析器 | 需要安装C语言库 |
html5lib | BeautifulSoup(markup, “html5lib”) | 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 | 速度慢、不依赖外部扩展 |
三. 安装
pip install BeautifulSoup4
pip install lxml
四. 基本使用
4.1 标签选择器
4.1.1 .string
— 获取文本内容
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title" name="dromouse"><b><span>The Dormouse's story</span></b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
# 1,导包
from bs4 import BeautifulSoup
#,2,实例化对象
soup = BeautifulSoup(html, 'lxml') # 参数1:要解析的内容 参数2:解析器
# print(soup.prettify()) # 代码补全
# 通过标签选取,会返回包含标签本身及其里面的所有内容
print(soup.head) # 包含head标签在内的所有内容
print(soup.p) # 返回匹配的第一个结果
print(soup.title.string) #title是个节点, .string是属性,作用是获取字符串文本
print(soup.html.head.title.string)
运行结果如下:
<head>
<title>The Dormouse's story</title>
</head>
<p class="title" name="dromouse"><b><span>The Dormouse's story</span></b></p>
The Dormouse's story
The Dormouse's story
4.1.2 .name()
— 获取标签本身名称
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title.name) # 结果为标签本身名称 --> title
print(soup.p.name) # --> 获取标签名
运行结果如下:
title
p
4.1.3 .attrs()
— 通过属性拿属性的值
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title asdas" name="abc" id = "qwe"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/123" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])# 获取p标签name属性的属性值
print(soup.p.attrs['id']) # 获取p标签id属性的属性值
print(soup.p['id']) #第二种写法
print(soup.p['class']) # 以列表得形式保存
print(soup.a['href']) # 也是只返回第一个值
运行结果如下:
abc
qwe
qwe
['title', 'asdas']
http://example.com/123
4.2 嵌套选择
子节点和子孙节点
一定要有子父级关系
html = """
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The abc Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.b) #层层往下找
运行结果如下:
<b>The abc Dormouse's story</b>
4.2.1 .contents
获取标签子节点
.contents
获取标签子节点 以列表形式返回
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# 标签选择器只能拿到部分内容 ,不能拿到所有,那如何解决??
print(soup.p.a)
print("-----"*20)
# .contents属性可以将标签的子节点以列表的形式输出
print(soup.p.contents) # a是p的子节点,获取P标签所有子节点内容 返回一个list
print("-----"*20)
for i in soup.p.contents:
print(i)
运行结果如下:
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
----------------------------------------------------------------------------------------------------
['\n Once upon a time there were three little sisters; and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n and\n ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n and they lived at the bottom of a well.\n ']
----------------------------------------------------------------------------------------------------
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
and they lived at the bottom of a well.
4.2.2 .children
获取子节点
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# .children是一个list类型的迭代器
print(soup.p.children) # 获取子节点 返回一个迭代器
# for i in soup.p.children:
# print(i)
print("----------------------"*5)
# enumerate() 函数用于将一个可遍历的数据对象添加一个索引序列
#同时列出数据和数据下标,一般用在 for 循环当中
for i, child in enumerate(soup.p.children):
print(i, child)
运行结果如下:
<list_iterator object at 0x000001FEC6D6CAF0>
--------------------------------------------------------------------------------------------------------------
0
Once upon a time there were three little sisters; and their names were
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2
3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4
and
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6
and they lived at the bottom of a well.
4.2.3 descendants
获取子孙节点
descendants
获取子孙节点 返回的是一个生成器
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants) # 获取子孙节点 生成器本身是一种特殊的迭代器
for i, child in enumerate(soup.p.descendants):
print(i, child)
运行结果如下:
<generator object Tag.descendants at 0x000001FEC6D3D040>
0
Once upon a time there were three little sisters; and their names were
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2
3 <span>Elsie</span>
4 Elsie
5
6
7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9
and
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12
and they lived at the bottom of a well.
父节点和祖先节点
4.2.4 .parent
获取父节点
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent) # parent获取父节点
运行结果如下:
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
4.2.5 .parents
获取祖先节点
.parents
获取祖先节点 返回的是生成器
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# print(soup.a.parents) # 获取祖先节点 返回的是生成器,生成器本身是一种特殊的迭代器
# print(list(soup.a.parents)) #list是内置的列表类,它有一个构造函数,可以接受一个Iterable(可迭代)的对象作为参数,返回一个列表对象
for i, child in enumerate(soup.a.parents):
print(i, child)
运行结果如下:
0 <p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
1 <body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body>
2 <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body></html>
3 <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body></html>
兄弟节点
4.2.6 .next_siblings
获取后面的兄弟节点
.previous_siblings
获取前面的兄弟节点
两者返回的都是一个生成器对象
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
<span>abcqweasd</span>
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.next_siblings)
print(list(enumerate(soup.a.next_siblings))) # 后边的所有的兄弟节点
print('---'*15)
print(list(enumerate(soup.a.previous_siblings))) # 前边的
运行结果如下:
<generator object PageElement.next_siblings at 0x000001FEC6D3D190>
[(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n and\n '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n and they lived at the bottom of a well.\n ')]
---------------------------------------------
[(0, '\n Once upon a time there were three little sisters; and their names were\n '), (1, <span>abcqweasd</span>), (2, '\n')]
4.3 标准选择器
find_all( name , attrs , recursive , text , **kwargs )
可根据标签名、属性、内容查找文档
4.3.1 find_all()
根据标签名查找
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo-2</li>
<li class="element">Bar-2</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# print(soup.find_all('ul')) # 拿到所有ul标签及其里面内容
print(soup.find_all('div'))
# print(soup.find_all('ul')[0])
运行结果如下:
[<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo-2</li>
<li class="element">Bar-2</li>
</ul>
</div>
</div>, <div class="panel-heading">
<h4>Hello</h4>
</div>, <div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo-2</li>
<li class="element">Bar-2</li>
</ul>
</div>]
4.3.2 .string
获取文本值
for ul in soup.find_all('ul'):
# print(ul)
for i in ul.find_all("li"):
# print(i)
print(i.string)
运行结果如下:
Foo
Bar
Jay
Foo-2
Bar-2
4.3.3 get_text()
获取内容
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element2">Foo</li>
<li class="element2">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
# print(ul)
for i in ul.find_all('li'):
# print(i)
# print(i.string)
print(i.get_text()) # 有时候.string不一定获取的到,可使用get_text()
运行结果如下:
Foo
Bar
Jay
Foo
Bar
4.3.4 find_all()
根据属性查找
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# 第一种写法 通过attrs指定属性
# 语法格式:attrs={'属性':'属性名'}
print(soup.find_all(attrs={'id': 'list-1'})) # 根据id属性
# print("-----"*10)
# print(soup.find_all(attrs={'name': 'elements'})) # 根据name属性
# print("-----"*10)
# for ul in soup.find_all(attrs={'name': 'elements'}):
# print(ul) # 从列表中遍历取出
# print(ul.li.string) #只返回第一个值,原因:严格遵从层层往下查找
# # # # # print('-----')
# for li in ul:
# print(li) # 都是同级标签
# print(li.string)
# 第二种写法
# 语法格式:(属性='属性名')
# print(soup.find_all(id='list-1'))
# 特殊属性查找
# print(soup.find_all(class='element')) # 注意:错误举例!!!
# print(soup.find_all(class_='element')) # class属于Python关键字,做特殊处理 _
# 第三种 推荐的查找方法!!! --- 指定标签和属性
print(soup.find_all('li',{'class','element'}))
运行结果如下:
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
4.3.5 text=()
根据文本值选择
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# 语法格式:text='要查找的文本内容'
print(soup.find_all(text='Foo')) # 可以做内容统计用
print(soup.find_all(text='Bar'))
print(len(soup.find_all(text='Foo'))) # 统计数量
运行结果如下:
['Foo', 'Foo']
['Bar', 'Bar']
2
4.3.6 find( name , attrs , recursive , text , **kwargs )
find
返回单个元素,find_all
返回所有元素
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find('ul')) # 只返回匹配到的第一个
print('---------'*5)
print(soup.find('li'))
print('---------'*5)
print(soup.find('page')) # 如果标签不存在返回None
运行结果如下:
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
---------------------------------------------
<li class="element">Foo</li>
---------------------------------------------
None
总结
1,find_parents()
2,find_parent()
区别:find_parents()返回所有祖先节点,find_parent()返回直接父节点。
3,find_next_siblings()
4,find_next_sibling()
区别:find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。
5,find_previous_siblings()
6,find_previous_sibling()
区别:find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。
7,find_all_next()
8,find_next()
区别:find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点
9,find_all_previous()
10,find_previous()
区别:find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点
4.4 CSS选择器
介绍:
1,类别选择器 -- class
2,标签选择器 -- <p></p>
3,ID选择器 -- id
详情了解:css选择器
使用:
通过select()直接传入CSS选择器即可完成选择
如果对HTML里的CSS选择器很熟悉可以考虑用此方法
注意:
1,用CSS选择器时,标签名不加任何修饰,class类名前加. , id名前加#
2,用到的方法是soup.select(),返回类型是list
3,多个过滤条件需要用空格隔开,严格遵守从前往后逐层筛选
html='''
<div class="pan">q321312321</div>
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
# 根据标签去找 标签不加任何修饰 多个条件用空格隔开
print(soup.select('ul li'))
print("----"*10)
# class类名前加.
print(soup.select('.panel'))
print("----"*10)
# 多个条件用空格隔开
print(soup.select('.panel .panel-heading'))
print("----"*10)
# 注意:可以混合使用!!
# 比如:根据id和class去找
print(soup.select('#list-1 .element')) #从这个例子可以看出.select方法会获取满足条件的所有内容
# print("----"*10)
运行结果如下:
<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
----------------------------------------
[<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>]
----------------------------------------
[<div class="panel-heading">
<h4>Hello</h4>
</div>]
----------------------------------------
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
获取属性的值
两种写法:
1,ul['id']
2,ul.attrs['id']
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
# print(ul)
# print(ul['id'])
# print(ul['class'])
print(ul.attrs['id'])
print(ul.attrs['class'])
#以上只是展示两种不同写法
运行结果如下:
list-1
['list']
list-2
['list', 'list-small']
总结
- 推荐使用lxml解析库
- 标签选择筛选功能弱但是速度快
- 建议使用find()、find_all() 查询匹配单个结果或者多个结果
- 如果对CSS选择器熟悉建议使用select()
- 记住常用的获取属性和文本值的方法