BeautifulSoup4和JsonPath

最新推荐文章于 2024-04-25 15:45:00 发布

带着梦想飞翔

最新推荐文章于 2024-04-25 15:45:00 发布

阅读量1k

点赞数 3

分类专栏： python基本知识文章标签： BeautifulSoup4和JsonPath

本文链接：https://blog.csdn.net/u013008795/article/details/99885209

版权

BeautifulSoup4和JsonPath

文章目录

BeautifulSoup4和JsonPath

BeautifulSoup4

BeautifulSoup可以从HTML、XML中提取数据，目前BS4在持续开发。
官方中文文档https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
安装
1. pip install beautifulsoup4
导入：from bs4 import BuautifulSoup
初始化：
1. BeautifulSoup(markup="",features=None)
  - markup,被解析对象，可以是文件对象或者html字符串
  - feature指定解析器
  - return:返回一个文档对象

from bs4 import BeautifulSoup

#文件对象
soup = BeautifulSoup(open("test.html"))
# 标记字符串
soup = BeautifulSoup("<html>data</html>")

可以不指定解析器，就依赖系统已经安装的解析器库了。

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup,“html.parser”)	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3、3.2.2前的版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup,“lxml”)	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup,[“lxml”,“xml”]) BeautifulSoup(markup,“xml”)	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup,“html5lib”)	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

BeautifulSoup(markup,“html.parser”)使用Python标准库，容错差且性能一般。
BeautifulSoup(markup,“lxml”)容错能力强，速度快。需要安装系统C库。
推荐使用lxml作为解析器，效率高。
需要手动指定解析器，以保证代码在所有运行环境中解析器一致。

使用下面内容构建test.html使用bs4解析它

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>首页</title>
</head>
<body>
<h1>xdd欢迎您</h1>
<div id="main">
    <h3 class="title highlight"><a href="http://www.python.org">python</a>高级班</h3>
    <div class="content">
        <p id="first">字典</p>
        <p id="second">列表</p>
        <input type="hidden" name="_csrf" value="absdoia23lkso234r23oslfn">
        <!-- comment -->
        <img id="bg1" src="http://www.xdd.com/">
        <img id="bg2" src="http://httpbin.org/">
    </div>
</div>
<p>bottom</p>
</body>

四种对象

BeautifulSoup将HTML文档解析成复杂的树型结构，每个节点都是Python的对象，可分为4种：
- BeautifulSoup、Tag、NavigableString、Comment
1. BeautifulSoup对象：代表整个文档。
2. Tag对象：对应着HTML中的标签。有2个常用的属性：
  1. name:Tag对象的名称，就是标签名称
  2. attrs:标签的属性字典
    - 多值属性，对于class属性可能是下面的形式，<h3 class="title highlight">python高级班</h3>这个属性就是多值({“class”:[“title”,“highlight”]})
    - 属性可以被修改、删除

BeautifulSoup.prettify() #带格式输出解析的文档对象(即有缩进的输出)，注意：直接输出BeautifulSoup会直接输出解析的文档对象，没有格式。
BeautifulSoup.div #输出匹配到的第一个div对象中的内容，返回对象是bs4.element.Tag类型
BeautifulSoup.h3.get(“class”) #获取文档中第一个标签为h3对象中class属性值

from bs4 import BeautifulSoup

with open("d://xdd.html",encoding="utf-8") as f:
    soup = BeautifulSoup(f,"lxml")
    print(soup.builder)
    # print(0,soup) #输出整个解析的文档对象(不带格式）
    # print(1,soup.prettify()) #按照格式输出文档内容
    print("- "*30)
    # print(2,soup.div,type(soup.div)) #类型bs4.element.Tag，Tag对象
    # print(3,soup.div["class"]) #会报错，keyError，div没有class属性
    print(3,soup.div.get("class")) #获取div的class属性，没有返回None

    print(4,soup.div.h3["class"]) #多值属性
    print(4,soup.h3.get("class")) #多值属性,获取文档中第一h3标签中的class属性
    print(4,soup.h3.attrs.get("class")) #多值属性

    print(5,soup.img.get("src")) #获取img中src属性值
    soup.img["src"] = "http://www.xddupdate.com" #修改值
    print(5,soup.img["src"])

    print(6,soup.a) #找不到返回None
    del soup.h3["class"] #删除属性
    print(4,soup.h3.get("class"))

bsoup_001

注意：我们一般不使用声明这种方式来操作HTML，此代码时为了熟悉对象类型
NavigableString

如果只想输出标记的文本，而不关心标记的话，就要使用NavigableString.

print(soup.div.p.string) #第一个div下第一个p的字符串
print(soup.p.string) #同上

注释对象：这就是HTML中的注释，它被BeautifulSoup解析后对应Comment对象。

遍历文档树

在文档树中找到关心的内容才是日常的工资，也就是说如何遍历树中的节点。使用上面的test.html来测试

使用Tag
- soup.div可以找到从根节点开始查找第一个div节点,返回一个Tag对象
- soup.div.p说明从根节点开始找到第一个div后返回一个Tag对象，这个Tag对象下继续找第一个p，找到返回Tag对象
- soup.p返回了文字“字典”，而不是文字“bottom"说明遍历时深度优先，返回也是Tag对象
遍历直接子节点
- Tag.contents #将对象的所有类型直接子节点以列表方式输出
- Tag.children #返回子节点的迭代器
  - Tag.children #等价于Tag.contents

遍历所有子孙节点

Tag.descendants #返回节点的所有类型子孙节点，可以看出迭代次序是深度优先

from bs4 import BeautifulSoup
from bs4.element import Tag

with open("d://xdd.html",encoding="utf-8") as f:
    soup = BeautifulSoup(f,"lxml")
    print(soup.p.string)
    print(soup.div.contents) #直接子标签列表
    print("- "*30)

    for i in soup.div.children: #直接子标签可迭代对象
        print(i.name)
    print("- "*30)
    print(list(map(
        lambda x:x.name if x.name else x,
        soup.div.descendants #所有子孙
    )))

bsoup_002

遍历字符串

在前面的例子中，soup.div.string返回None，是因为string要求soup.div只能有一个NavigableString类型子节点，也就是这样<div>only string</div>。
Tag.string #获取Tag下的string对象，如果多余1个结点返回None
Tag.strings #返回迭代器，带多余的空白字符。所有的string对象
Tag.stripped_strings #返回，会去除多余空白字符

from bs4 import BeautifulSoup
from bs4.element import Tag

with open("d://xdd.html",encoding="utf-8") as f:
    soup = BeautifulSoup(f,"lxml")
    print(soup.div.string) #返回None，因为多余1个子节点
    print("- "*30)
    print("".join(soup.div.strings).strip()) #返回迭代器，带多余的空白字符
    print("- "*30)
    print("".join(soup.div.stripped_strings)

最低0.47元/天解锁文章

带着梦想飞翔

关注

3
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
BeautifulSoup4和JsonPath

BeautifulSoup4和JsonPath文章目录BeautifulSoup4和JsonPathBeautifulSoup4遍历文档树搜索文档树CSS选择器Json解析BeautifulSoup4BeautifulSoup可以从HTML、XML中提取数据，目前BS4在持续开发。官方中文文档https://www.crummy.com/software/BeautifulSoup...
复制链接

扫一扫