BeautifulSoup

最新推荐文章于 2024-08-19 09:22:32 发布

--aasher

最新推荐文章于 2024-08-19 09:22:32 发布

阅读量316

点赞数 1

分类专栏： Python爬虫文章标签： Python BeautifulSoup Python爬虫

本文链接：https://blog.csdn.net/SysEchoo/article/details/80514735

版权

Python爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

BeautifulSoup

灵活又方便的网页解析库，处理高效，支持多种解析器，利用它不用编写正则表达式即可方便地实现网页信息的提取。

安装

pip install beautifulsoup4

解析库

Python标准库 BeautifulSoup(markup,"html.parser")

lxml HTML 解析库 BeautifulSoup(markup,"lxml") 需要安装C语言库

lxml XML 解析库 BeautifulSoup(markup,"xml") 需要安装C语言库

html5lib BeautifulSoup(markup,"html5lib")

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.prettify())
print(soup.title.string)

标签选择器

选择元素

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.title) 
print(type(soup.title))
print(soup.head)
print(soup.p)

只输出第一个匹配结果，eg 多个P标签，只选取第一个P标签

获取名称

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.title.name)

返回标签的名称，title

获取属性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.p.attrs['name'])
print(soup.p['name'])

获取p标签的name属性

获取内容

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.p.string)

获取p标签中间的内容

嵌套选择

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.head.title.string)

获取title的内容

子节点和子孙节点

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.p.contents)

返回一个列表

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.p.children)
for i,child in enumerate(soup.p.children):
  print(i,child)

children 是一个迭代器，只输出子节点，需要用循环的方式来取值，返回节点内容和索引

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
  print(i,child)

获取子孙节点，descendants也是一个迭代器，获取所有的子孙节点

父节点和祖先节点

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.a.parent)

输出a的父节点p

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(list(enumerate(soup.a.parents)))

输出a节点的祖先节点

兄弟节点

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(list(enumerate(soup.a.next_siblings)))
print(list(enumerate(soup.a.previous_siblings)))

next_siblings输出后一个兄弟节点

previous_siblings输出前一个兄弟节点

标准选择器

find_all(name,attrs,recursive,text,**kwargs)

可根据标签名，属性，内容查找文档

name

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))

find_all 返回一个列表，类型bs4.element.Tag,返回一个bs4.element.Tag列表

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
for ul in soup.find_all('ul'):
  print(ul.find_all('li'))

attrs

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.find_all(attrs={'id':'list-1'}))
print(soup.find_all(attrs={'name':'elements'}))

attrs 传入的参数的类型是字典形式的，字典的key是属性的名称，value是属性的值，通过字典形式来查找

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))

class 是个关键字，所以使用class_代替

text

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.find_all(text='Foo'))

返回是内容，适合内容匹配，不适合内容查找

find(name,attrs,recursive,text,**kwargs)

find返回单个元素，find_all返回所有元素

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))

如果查找的标签不存在，返回None

find_parents() find_parent()

find_parents()返回所有祖先节点，find_parent()返回直接父节点

find_next_siblings() find_next_sibling()

find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点

find_previous_siblings() find_previous_sibling()

find_previous_siblings()返回前面所有兄弟节点，find_previous_sibling()返回前面第一个兄弟节点

find_all_next() find_next()

find_all_next()返回节点后所有符合条件的节点，find_next()返回第一个符合条件的节点

find_all_previous() find_previous()

find_all_previous()返回节点前所有符合条件的节点，find_previous()返回第一个符合条件的节点

CSS选择器

通过select()直接传入css选择器即可完成选择

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')))

class使用点(.),id使用井号(#)开头 (#list-2 .element点前面有空格)

依旧是bs4.element.Tag

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
for ul in soup.select('ul'):
  print(ul.select('li'))

使用嵌套输出

获取属性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
for ul in soup.select('ul'):
 print(ul['id']) 
 print(ul.attrs['id'])

获取内容

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,"lxml")
for li in soup.select('li'):
 print(ul.get_text())

总结

**推荐使用lxml解析库，必要时使用html.parser

**标签选择筛选功能弱但是速度快

**建议使用find(),find_all()查询匹配单个结果或者多个结果

**如果对css选择器熟悉建议使用select()

**记住常用的获取属性和文本值的方法

--aasher

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录