python_解释器_BeautifulSoup

最新推荐文章于 2024-03-29 00:20:16 发布

zk仔的博客

最新推荐文章于 2024-03-29 00:20:16 发布

阅读量529

点赞数

分类专栏： python_爬虫

本文链接：https://blog.csdn.net/weixin_39532362/article/details/87934746

版权

python_爬虫专栏收录该内容

14 篇文章 0 订阅

订阅专栏

解释器_BeautifulSoup

导入模块
构造
常用方法
标准选择器
css选择器
example
xpath

导入模块

依赖包：lxml

from bs4 import BeautifulSoup

构造

BeautifulSoup(markup,'html.parser')：python内置标准库，速度适中，容错强
BeautifulSoup(markup,'lxml')：速度快，容错强
BeautifulSoup(markup,'xml')：速度快，仅支持xml
BeautifulSoup(markup,'html5lib')：最强容错，浏览器方式解析文件，生成h5格式文档，速度慢

常用方法

soup也可以理解为一个节点node

soup.tag：返回第一个匹配节点
soup.tag.tag：嵌套选择节点
node.name：返回获取节点标签名
node.string：返回节点内字符串【多节点时获取不了】
node.get_text()：返回节点内字符串【不包含节点内容】
node.contents：返回列表，包含所有直系子节点
node.children：返回构造器，包含所有直系子节点
node.descendants：返回构造器，包含所有子孙节点
enumerate(node.children)：返回枚举，包含所有直系子节点
enumerate(node.descendants)：返回枚举，包含所有子孙节点
node.next_sibling：返回后一个兄弟节点
node.next_siblings：返回构造器，包含后面所有兄弟节点
node.previous_sibling：返回前一个兄弟节点
node.previous_siblings：返回构造器，包含前面所有兄弟节点
node.attrs['key']：返回节点属性key内容，若key为class则返回列表
node['key']：返回节点属性key内容，若key为class则返回列表
node.get('key')：返回节点属性key内容，若key为class则返回列表

标准选择器

soup.find(name,attrs,recursive,text,**kwargs)：返回第一个匹配节点
soup.find_all(name,attrs,recursive,text,**kwargs)：返回列表，包含所有匹配节点
soup.find_parent()：返回直属父节点
soup.find_parents()：返回列表，包含所有祖先节点
soup.find_next_sibling()：返回后一个兄弟节点
soup.find_next_siblings()：返回列表，包含所有祖先节点
soup.find_previous_sibling()：返回前一个兄弟节点
soup.find_previous_siblings()：返回列表，包含所用前面兄弟节点
soup.find_next()：返回后一个节点
soup.find_all_next()：返回列表，包含所有后面节点
soup.find_previous()：返回前一个节点
soup.find_all_previous()：返回列表，包含所有前面节点

css选择器

soup.select(css)：返回列表，包含所有匹配节点

example

from bs4 import BeautifulSoup

html = '''
<html>
  <head><title>The Dormouse's story</title></head>
  <body>
    <div class="title"><b>The Dormouse's story</b></div>
    <p class="story">
      Once upon a time there were three little sisters; and their names were
      <a href="http://example.com/elsie" class="sister sister1" id="link1">Elsie</a>
      <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
      <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
      and they lived at the bottom of a well.
    </p>
    <p class="story">...</p>
  </body>
</html>
'''

soup=BeautifulSoup(html,'lxml')

# 直接查找
soup.a

# 按标签
soup.find(name='a')

# 按属性
soup.find(id='link1')
soup.find(class_='sister')

# 按定制属性
soup.find(attrs={'id':'link1'})

# 按正则或字符串
import re
regexp=re.compile('.*')
soup.find(text=regexp)

# 按函数
soup.find(lambda tag:tag.has_attr('id') and tag.get('id') == 'link1')

# 通过css查找节点
soup.select('.story #link3')[0]['class']

xpath

form lxml import etree
html=etree.HTML(text)
result=html.xpath('//*').decode('utf-8')

zk仔的博客

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python_解释器_BeautifulSoup

解释器_BeautifulSoup导入模块构造常用方法标准选择器css选择器example导入模块依赖包：lxmlfrom bs4 import BeautifulSoup构造BeautifulSoup(markup,'html.parser')：python内置标准库，速度适中，容错强BeautifulSoup(markup,'lxml')：速度快，容错强Beautiful...
复制链接

扫一扫

专栏目录