BeautifulSoup解析库详解
BeautifulSoup是灵活又方便的网页解析库,处理高效,支持多种解析器
利用它不用编写正则表达式即可方便地实现网页信息的提取
安装
pip3 install beautifulsoup4
用法详解
解析器 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
Python标准库 | BeautifulSoup(makeup,“html.parser”) | python的内置标准库,执行速度适中,文档容错能力强 | python2.7 or python3.2.2前的版本中文容错能力差 |
lxml HTML解析器 | BeautifulSoup(makeup,“lxml”) | 速度快,文档容错能力强 | 需要安装C语言库 |
lxm XML解析器 | BeautifulSoup(makeup,“xmlr”) | 速度快,唯一支持xml的解析器 | 需要安装C语言库 |
html5lib | BeautifulSoup(makeup,“html5lib”) | 最好的容性,以浏览器方式解析文档 | 速度慢,不依赖外部扩展 |
基本使用方法
import bs4
from bs4 import BeautifulSoup
#下面是一段不完整的 html代码
html = '''
<html><head><title>The Demouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Domouse's story</b></p>
<p class="story">Once upon a time there were three little sisters,and their name were
<a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a>
<a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a>
<a href="http://examlpe.com/title" class="sister" ld="link3"><title></a>
and they lived the bottom of a wall</p>
<p clas="stuy">..</p>
'''
# 构建beautifulsoup对象
soup = BeautifulSoup(html,'lxml')
#将代码补全,也就是容错处理
print(soup.prettify())
print(soup.title.string)
输出结果为:
<html>
<head>
<title>
The Demouse's story
</title>
</head>
<body>
<p class="title" name="dromouse">
<b>
The Domouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters,and their name were
<a class="sister" href="http://examlpe.com/elele" ld="link1">
<!--Elsle-->
</a>
<a class="sister" href="http://examlpe.com/lacie" ld="link2">
<!--Elsle-->
</a>
<a class="sister" href="http://examlpe.com/title" ld="link3">
<title>
</title>
</a>
and they lived the bottom of a wall
</p>
<p clas="stuy">
..
</p>
</body>
</html>
The Demouse's story
标签选择器
from bs4 import BeautifulSoup
#下面是一段不完整的 html代码
html = '''
<html><head><title>The Demouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Domouse's story</b></p>
<p class="story">Once upon a time there were three little sisters,and their name were
<a href="http://examlpe.com/elele" class="sister" ld="link1"><!--Elsle--></a>
<a href="http://examlpe.com/lacie" class="sister" ld="link2"><!--Elsle--></a>
<a href="http://examlpe.com/title" class="sister" ld="link3"><title></a>
and they lived the bottom of a wall</p>
<p clas="stuy">..</p>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.title)
print(soup.title.name) # 获取标题的名称
print(type(soup.title)) # 获取标题的类型
print(soup.head)
print(soup.p)
print(soup.p.attrs['name']) # 获取属性
print(soup.p['name'])
标准选择器
find_all(name,attrs,recursive,text,**kwargs)
可根据标签名、属性、内容查找文档
根据name查找
from bs4 import BeautifulSoup
html = '''
<div class="panel">
<div class="panel-heading">
<h4>hello</h4>
</div>
<div class="panel-body">
<url class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">jay</li>
</url>
<url class="list list-small" id="list-2">
<li lass="element">Foo</li>
<li lass="element">Bar</li>
</url>
</div>
</div>
'''
soup = BeautifulSoup(html,'lxml')
for url in soup.find_all('url'):
print(url.find_all('li'))
输出结果为:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">jay</li>]
[<li lass="element">Foo</li>, <li lass="element">Bar</li>]
根据attrs进行查找:
attrs传入的参数为字典形式的参数,如:
from bs4 import BeautifulSoup
html = '''
<div class="panel">
<div class="panel-heading">
<h4>hello</h4>
</div>
<div class="panel-body">
<url class="list" id="list-1" name='elements'>
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">jay</li>
</url>
<url class="list list-small" id="list-2">
<li lass="element">Foo</li>
<li lass="element">Bar</li>
</url>
</div>
</div>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={'id':'list-1'}))#也可以soup.find_all(id='list-1')这样的来进行查找
print(soup.find_all(attrs={'name':'elements'}))
输出结果为:
[<url class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">jay</li>
</url>]
[<url class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">jay</li>
</url>]
利用test进行查找
from bs4 import BeautifulSoup
html = '''
<div class="panel">
<div class="panel-heading">
<h4>hello</h4>
</div>
<div class="panel-body">
<url class="list" id="list-1" name='elements'>
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">jay</li>
</url>
<url class="list list-small" id="list-2">
<li lass="element">Foo</li>
<li lass="element">Bar</li>
</url>
</div>
</div>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text='Foo'))
输出结果为:
['Foo', 'Foo']
find方法,用法跟find_all方法是完全一样的,只不过find_all返回所有元素,是一个列表,find返回单个元素,列表中的第一个值
find(name,attrs,recurslve,text,**kwargs)
CSS选择器
通过select()直接传入css选择器即可完成选择
from bs4 import BeautifulSoup
html = '''
<div class="panel">
<div class="panel-heading">
<h4>hello</h4>
</div>
<div class="panel-body">
<url class="list" id="list-1" name='elements'>
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">jay</li>
</url>
<url class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</url>
</div>
</div>
'''
soup = BeautifulSoup(html,'lxml')
#如果选择的是class,需要加上一个点,.panel .panel-heading
print(soup.select('.panel .panel-heading'))
#直接选择标签
print(soup.select('url li'))
#选择id,要用#来选
print(soup.select('#list-2 .element'))
输出结果为:
[<div class="panel-heading">
<h4>hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
获取属性
print(url[‘id’])
也可以使用print(url.attrs[‘id’])
获取内容
from bs4 import BeautifulSoup
html = '''
<div class="panel">
<div class="panel-heading">
<h4>hello</h4>
</div>
<div class="panel-body">
<url class="list" id="list-1" name='elements'>
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">jay</li>
</url>
<url class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</url>
</div>
</div>
'''
soup = BeautifulSoup(html,'lxml')
for l in soup.select('li'):
print(l.get_text())
输出结果为:
Foo
Bar
jay
Foo
Bar