BeautifulSoup
灵活又方便的网页解析库,处理高效,支持多种解析器,利用它不用编写正则表达式即可方便地实现网页信息的提取。
安装
pip install beautifulsoup4
解析库
Python标准库 BeautifulSoup(markup,"html.parser")
lxml HTML 解析库 BeautifulSoup(markup,"lxml") 需要安装C语言库
lxml XML 解析库 BeautifulSoup(markup,"xml") 需要安装C语言库
html5lib BeautifulSoup(markup,"html5lib")
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.prettify())
print(soup.title.string)
标签选择器
选择元素
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)
只输出第一个匹配结果,eg 多个P标签,只选取第一个P标签
获取名称
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.title.name)
返回标签的名称,title
获取属性
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.p.attrs['name'])
print(soup.p['name'])
获取p标签的name属性
获取内容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.p.string)
获取p标签中间的内容
嵌套选择
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.head.title.string)
获取title的内容
子节点和子孙节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.p.contents)
返回一个列表
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.p.children)
for i,child in enumerate(soup.p.children):
print(i,child)
children 是一个迭代器,只输出子节点,需要用循环的方式来取值,返回节点内容和索引
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
print(i,child)
获取子孙节点,descendants也是一个迭代器,获取所有的子孙节点
父节点和祖先节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.a.parent)
输出a的父节点p
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(list(enumerate(soup.a.parents)))
输出a节点的祖先节点
兄弟节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(list(enumerate(soup.a.next_siblings)))
print(list(enumerate(soup.a.previous_siblings)))
next_siblings输出后一个兄弟节点
previous_siblings输出前一个兄弟节点
标准选择器
find_all(name,attrs,recursive,text,**kwargs)
可根据标签名,属性,内容查找文档
name
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))
find_all 返回一个列表,类型bs4.element.Tag,返回一个bs4.element.Tag列表
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
for ul in soup.find_all('ul'):
print(ul.find_all('li'))
attrs
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.find_all(attrs={'id':'list-1'}))
print(soup.find_all(attrs={'name':'elements'}))
attrs 传入的参数的类型是字典形式的,字典的key是属性的名称,value是属性的值,通过字典形式来查找
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))
class 是个关键字,所以使用class_代替
text
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.find_all(text='Foo'))
返回是内容,适合内容匹配,不适合内容查找
find(name,attrs,recursive,text,**kwargs)
find返回单个元素,find_all返回所有元素
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))
如果查找的标签不存在,返回None
find_parents() find_parent()
find_parents()返回所有祖先节点,find_parent()返回直接父节点
find_next_siblings() find_next_sibling()
find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点
find_previous_siblings() find_previous_sibling()
find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点
find_all_next() find_next()
find_all_next()返回节点后所有符合条件的节点,find_next()返回第一个符合条件的节点
find_all_previous() find_previous()
find_all_previous()返回节点前所有符合条件的节点,find_previous()返回第一个符合条件的节点
CSS选择器
通过select()直接传入css选择器即可完成选择
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')))
class使用点(.),id使用井号(#)开头 (#list-2 .element点前面有空格)
依旧是bs4.element.Tag
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
for ul in soup.select('ul'):
print(ul.select('li'))
使用嵌套输出
获取属性
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
for ul in soup.select('ul'):
print(ul['id'])
print(ul.attrs['id'])
获取内容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
for li in soup.select('li'):
print(ul.get_text())
总结
**推荐使用lxml解析库,必要时使用html.parser
**标签选择筛选功能弱但是速度快
**建议使用find(),find_all()查询匹配单个结果或者多个结果
**如果对css选择器熟悉建议使用select()
**记住常用的获取属性和文本值的方法