安装:pip3 install beautifulsoup4
解析库
解析器 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
Python标准库 | BeautifulSoup(markup,"html.parser") | Python的内置标准库、执行速度适中、文档容错能力强 | Python 2.7.3 or 3.2.2)前的版本中文容错能力差 |
lxml HTML 解析器 | BeautifulSoup(markup,"lxml") | 速度快、文档容错能力强 | 需要安装C语言库 |
lxml XML 解析器 | BeautifulSoup(markup,"xml") | 速度快、唯一支持XML的解析器 | 需要安装C语言库 |
html5lib | BeautifulSoup(markup,"html5lib") | 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 | 速度慢、不依赖外部扩展 |
基本使用
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.prettify()) #格式化代码的方法
print(soup.title.string) #打印出title里面的内容
运行结果:
①:
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title" name="dromouse">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
②:The Dormouse's story
标签选择器
选择元素
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.title) #打印出title标签
print(type(soup.title)) #打印出title标签的类型
print(soup.head) #打印出head标签
print(soup.p) #打印出p标签
运行结果:
①:<title>The Dormouse's story</title>
②:<class 'bs4.element.Tag'>
③:<head><title>The Dormouse's story</title></head>
④:<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
获取名称
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.title.name) #打印出标签的名称
运行结果:
title
获取属性
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.p.attrs['name']) #打印p标签里面的name属性
print(soup.p['name']) #另一种打印p标签里name属性的方法
运行结果:
dromouse
dromouse
获取内容
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.p.string) #打印出p标签里面的内容
运行结果:
The Dormouse's story
嵌套选择
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters;and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.head.title.string) #用层层迭代的方式往下选择,选择head里面的title标签的内容
运行结果:
The Dormouse's story
子节点和子孙节点
html = '''
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.p.contents) #获取p标签的所有子节点,并以列表的形式返回
运行结果:
['\n Once upon a time there were three little sisters; and their names were\n
', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '\n
and\n ', <a class="sister" href="http://example.com/tillie"
id="link3">Tillie</a>, '\n and they lived at the bottom of a well.\n ']
PS:下面为另一种方法获取子节点
html = '''
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.p.children) #(.children)实际上是一个迭代器,并不是一个列表的形式,需要用循环的方式获取内容,获取子节点
for i,child in enumerate(soup.p.children): #用enumerate方法将其子节点遍历输出,可以返回节点内容和索引
print(i,child)
运行结果:
<list_iterator object at 0x0000028DA218BBE0>
0
Once upon a time there were three little sisters; and their names were
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2
3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4
and
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6
and they lived at the bottom of a well.
ps:获取子孙节点
html = '''
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.p.descendants) #返回结果依然是迭代器,获取p标签的所有子节点和子孙节点,会将子孙节点也单独列出来
for i,child in enumerate(soup.p.descendants):
print(i,child)
运行结果:
<generator object Tag.descendants at 0x0000015CFAA0A318>
0
Once upon a time there were three little sisters; and their names were
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2
3 <span>Elsie</span>
4 Elsie
5
6
7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9
and
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12
and they lived at the bottom of a well.
父节点和祖先节点
html = '''
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.a.parent) #获取a标签的父节点
运行结果:
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
html = '''
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(list(enumerate(soup.a.parents))) #(.parents)方法,获取a标签所有的父节点和祖先节点
运行结果:
[(0, <p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>), (1, <body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body>), (2, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body></html>), (3, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body></html>)]
兄弟节点
html = '''
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(list(enumerate(soup.a.next_siblings))) #获取后面的兄弟节点
print(list(enumerate(soup.a.previous_siblings))) #获取前面的兄弟节点
运行结果:
①:[(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>),
(2, '\n and\n '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n and they
lived at the bottom of a well.\n ')]
②:[(0, '\n Once upon a time there were three little sisters; and their names
were\n
标准选择器
find_all(name,attrs,recursive,text,**kwargs) ※类似于正则表达式,找出所有能匹配成功的结果以列表的形式返回
可根据标签名、属性、内容查找文档
查找name(标签名):
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.find_all('ul')) #传入ul标签名,获取所有的ul标签;根据标签名查找
print(type(soup.find_all('ul')[0])) #把ul标签的第一个单独拿出来看一下类型是什么
运行结果:
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
for ul in soup.find_all('ul'): #将所有ul标签拿出来遍历
print(ul.find_all("li")) #提取出ul标签里面的li标签
运行结果:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
attrs:根据属性进行查找
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.find_all(attrs={'id':'list-1'})) #attrs传入的类型是字典类型,键名是属性名称,键值是属性的值
print(soup.find_all(attrs={'name':'elements'})) #返回的是一个元素的列表
运行结果:
①:[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
②:[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
PS:另一种根据属性查找的方法
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.find_all(id="list-1")) #可以查找特殊的属性
print(soup.find_all(class_="element"))
运行结果:
①:
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
②:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li
class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
text:根据文本的内容进行选择
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.find_all(text="Foo")) #返回的是标签里面的内容
运行结果:
['Foo', 'Foo']
find(name,attrs,recursive,text,**kwargs) :使用方法与find_all一样,区别只有返回的元素
find返回单个元素,find_all返回所有元素
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))
运行结果:
①:
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
②:
<class 'bs4.element.Tag'>
③:
None
find_parents():使用方法与find_all一样
返回所有祖先节点
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
primaryconsumers = soup.find_all(class_ = 'element') #首先找到所有要找父类的标签;如果用find方法,下面这一句可以省略
primaryconsumer = primaryconsumers[0] #将这个标签列表的第一个标签赋值
print(primaryconsumer.find_parents('div')) #查找这个标签的所有祖先节点,以列表的形式返回
运行结果:
[<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>, <div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>]
find_parent():使用方法与find_all一样
返回直接父亲节点
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
primaryconsumer = soup.find(class_ = 'element') #定位到属性是element的标签
print(primaryconsumer.find_parent('div')) #查找这个标签的父亲节点
运行结果:
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
find_next_siblings():
返回后面所有兄弟节点
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
primaryconsumer = soup.find(class_ = 'element') #定位属性为element的标签
print(primaryconsumer.find_next_siblings('li')) #查找这个标签后面所有的兄弟标签,并以列表的形式输出
运行结果:
[<li class="element">Bar</li>, <li class="element">Jay</li>]
find_next_sibling():
返回后面第一个兄弟节点
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
primaryconsumer = soup.find(class_ = 'element') #定位属性为element的标签
print(primaryconsumer.find_next_sibling('li')) #查找这个标签后面第一个兄弟标签
运行结果:
<li class="element">Bar</li>
find_previous_siblings():
返回前面所有的兄弟节点
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
primaryconsumers = soup.find_all(class_ = 'element') #查找所有属性为element的标签
primaryconsumer = primaryconsumers[2] #定位第三个标签
print(primaryconsumer.find_previous_siblings('li')) #查找这个标签前面所有的兄弟节点
运行结果:
[<li class="element">Bar</li>, <li class="element">Foo</li>]
find_previous_sibling():
返回前面第一个兄弟节点
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
primaryconsumers = soup.find_all(class_ = 'element') #查找所有属性为element的标签
primaryconsumer = primaryconsumers[2] #定位第三个标签
print(primaryconsumer.find_previous_sibling('li')) #查找这个标签前面第一个兄弟节点
运行结果:
<li class="element">Bar</li>
find_all_next():
返回节点后所有符合条件的节点
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
primaryconsumer = soup.find("div") #定位一个div的标签
print(primaryconsumer.find_all_next('li')) #返回这个标签后面的所有li标签
运行结果:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li
class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
find_next():
返回第一个符合条件的节点
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
primaryconsumer = soup.find("div") #定位一个div的标签
print(primaryconsumer.find_next('li')) #返回这个标签后面的第一个li标签
运行结果:
<li class="element">Foo</li>
find_all_previous():
返回节点前所有符合条件的标签
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
primaryconsumer = soup.find(class_="list") #定位一个属性为list的标签
print(primaryconsumer.find_all_previous('div')) #返回这个标签前面所有的div标签
运行结果:
[<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>, <div class="panel-heading">
<h4>Hello</h4>
</div>, <div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>]
find_previous():
返回第一个符合条件的标签
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
primaryconsumer = soup.find(class_="list") #定位一个属性为list的标签
print(primaryconsumer.find_previous('div')) #返回这个标签前面第一个div标签
运行结果:
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
CSS选择器
通过select()直接传入CSS选择器即可完成选择
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
print(soup.select('.panel .panel-heading')) #如果传入得是class,选择器前面加.
print(soup.select('ul li')) #如果传入的是标签,前面不需要加入任何内容
print(soup.select("#list-2 .element")) #如果传入得是id,那么前面加#
print(type(soup.select('ul')[0])) #打印输出节点的类型
运行结果:
①:
[<div class="panel-heading">
<h4>Hello</h4>
</div>]
②:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li
class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
③:
[<li class="element">Foo</li>, <li class="element">Bar</li>]
④:
<class 'bs4.element.Tag'> ※:bs4.element.Tag表示可以进行嵌套选择
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
for ul in soup.select('ul'): #层层迭代方法
print(ul.select('li'))
运行结果:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li
class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
PS:另一种方法
获取属性
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
for ul in soup.select('ul'):
print(ul['id']) #获取ul的id属性,结果一下面一致,方法不同结果一致
print(ul.attrs['id'])
运行结果:
list-1
list-1
list-2
list-2
获取内容
html = '''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup #引入BeautifulSoup库
soup = BeautifulSoup(html,"lxml") #声明一个BeautifulSoup对象,把html传入,第二个参数传入解析器lxml
for ul in soup.select('li'): #循环遍历li标签
print(ul.get_text()) #用get_text方法获取标签里面的文本
运行结果:
Foo
Bar
Jay
Foo
Bar
总结:
- 推荐使用lxml解析库,必要时使用html.parser
- 标签选择筛选功能弱但是速度快
- 建议使用find()、find_all()查询匹配单个结果或多个结果
- 如果对CSS选择器熟悉建议使用select()
- 记住常用的获取属性和文本值得方法