BeautifulSoup

BeautifulSoup

灵活又方便的网页解析库,处理高效,支持多种解析器,利用它不用编写正则表达式即可方便地实现网页信息的提取。

安装

pip install beautifulsoup4

解析库

Python标准库 BeautifulSoup(markup,"html.parser")

lxml HTML 解析库 BeautifulSoup(markup,"lxml") 需要安装C语言库

lxml XML 解析库 BeautifulSoup(markup,"xml") 需要安装C语言库

html5lib BeautifulSoup(markup,"html5lib")

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.prettify())
print(soup.title.string) 

标签选择器

选择元素
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.title) 
print(type(soup.title))
print(soup.head)
print(soup.p)

只输出第一个匹配结果,eg 多个P标签,只选取第一个P标签

获取名称
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.title.name) 

返回标签的名称,title

获取属性
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.p.attrs['name'])
print(soup.p['name'])

获取p标签的name属性

获取内容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.p.string)

获取p标签中间的内容

嵌套选择
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.head.title.string)

获取title的内容

子节点和子孙节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.p.contents)

返回一个列表

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.p.children)
for i,child in enumerate(soup.p.children):
  print(i,child)
children 是一个迭代器,只输出子节点,需要用循环的方式来取值,返回节点内容和索引

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
  print(i,child)

获取子孙节点,descendants也是一个迭代器,获取所有的子孙节点

父节点和祖先节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.a.parent)

输出a的父节点p

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(list(enumerate(soup.a.parents)))

输出a节点的祖先节点

兄弟节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(list(enumerate(soup.a.next_siblings)))
print(list(enumerate(soup.a.previous_siblings)))

next_siblings输出后一个兄弟节点

previous_siblings输出前一个兄弟节点


标准选择器

find_all(name,attrs,recursive,text,**kwargs)

可根据标签名,属性,内容查找文档

name
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))

find_all 返回一个列表,类型bs4.element.Tag,返回一个bs4.element.Tag列表

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
for ul in soup.find_all('ul'):
  print(ul.find_all('li'))
attrs
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.find_all(attrs={'id':'list-1'}))
print(soup.find_all(attrs={'name':'elements'}))

attrs 传入的参数的类型是字典形式的,字典的key是属性的名称,value是属性的值,通过字典形式来查找

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))

class 是个关键字,所以使用class_代替

text
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.find_all(text='Foo'))

返回是内容,适合内容匹配,不适合内容查找

find(name,attrs,recursive,text,**kwargs)

find返回单个元素,find_all返回所有元素

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))
如果查找的标签不存在,返回None


find_parents() find_parent()

find_parents()返回所有祖先节点,find_parent()返回直接父节点

find_next_siblings() find_next_sibling()

find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点

find_previous_siblings() find_previous_sibling()

find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点

find_all_next() find_next()

find_all_next()返回节点后所有符合条件的节点,find_next()返回第一个符合条件的节点

find_all_previous() find_previous()

find_all_previous()返回节点前所有符合条件的节点,find_previous()返回第一个符合条件的节点


CSS选择器

通过select()直接传入css选择器即可完成选择
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')))

class使用点(.),id使用井号(#)开头 (#list-2 .element点前面有空格)

依旧是bs4.element.Tag

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
for ul in soup.select('ul'):
  print(ul.select('li'))
使用嵌套输出
获取属性
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
for ul in soup.select('ul'):
 print(ul['id']) 
 print(ul.attrs['id'])
获取内容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
for li in soup.select('li'):
 print(ul.get_text()) 

总结

**推荐使用lxml解析库,必要时使用html.parser

**标签选择筛选功能弱但是速度快

**建议使用find(),find_all()查询匹配单个结果或者多个结果

**如果对css选择器熟悉建议使用select()

**记住常用的获取属性和文本值的方法

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值