lxml基本语法

最新推荐文章于 2024-08-30 20:41:27 发布

顶峰相见_li

最新推荐文章于 2024-08-30 20:41:27 发布

阅读量455

点赞数

文章标签： python 开发语言

本文链接：https://blog.csdn.net/m0_53957045/article/details/126402994

版权

本文详细介绍了XML数据格式及其与JSON的对比，强调了XPath在XML解析中的重要作用。通过实例展示了XPath的基本语法，包括获取节点、内容、属性，以及使用谓语进行条件筛选。此外，还讲解了XPath的通配符和路径组合等高级用法。

摘要由CSDN通过智能技术生成

"""
Author: 余婷
Time: 2022/8/18 09:22
Good Good Study, Day Day Up!
"""
from lxml import etree

# 1. xpath基本概念
"""
1)树：整个html(xml)代码结构就是一个树结构
2)节点：树结构中的每一个元素(标签)就是一个节点
3)根节点(根元素): html或者xml最外面的那个标签(元素)
4)节点内容：标签内容
5)节点属性：标签属性
"""

# 2. xml数据格式
# xml和json一样，是一种通用的数据格式(绝大部分编程语言都支持的数据格式)
"""
xml是通过标签（元素）的标签内容和标签属性来保存数据的。

示例：保存一个超市信息
1）json数据
{
    "name": "永辉超市",
    "address": "肖家河大厦",
    "staffs":[
        {"name"："小明", "id": "s001", "position": "收营员", "salary": 4000},
        {"name"："小花", "id": "s002", "position": "促销员", "salary": 3500},
        {"name"："张三", "id": "s003", "position": "保洁", "salary": 3000},
        {"name"："李四", "id": "s004", "position": "收营员", "salary": 4000},
        {"name"："王五", "id": "s005", "position": "售货员", "salary": 3800}
    ],
    "goodsList":[
        {"name": "泡面", "price": 3.5, "count": 120, "discount"：0.9},
        {"name": "火腿肠", "price": 1.5, "count": 332, "discount"：1},
        {"name": "矿泉水", "price": 2, "count": 549, "discount"：1},
        {"name": "面包", "price": 5.5, "count": 29, "discount"：0.85}
    ]
}

xml数据：
<supermarket name="永辉超市" address="肖家河大厦">
    <staffs>
        <staff  id="s001">
            <name>小明</name>
            <position>收营员</position>
            <salary>4000</salary>
        </staff>
        <staff  id="s002">
            <name>小花</name>
            <position>促销员</position>
            <salary>3500</salary>
        </staff>
        <staff  id="s003">
            <name>张三</name>
            <position>保洁</position>
            <salary>3000</salary>
        </staff>
        <staff  id="s004">
            <name>李四</name>
            <position>收营员</position>
            <salary>4000</salary>
        </staff>
        <staff  id="s005">
            <name>王五</name>
            <position>售货员</position>
            <salary>3800</salary>
        </staff>
    </staffs>
    
    <goodsList>
        <goods discount="0.9">
            <name>泡面</name>
            <price>3.5</price>
            <count>120</count>
        </goods>
        <goods>
            <name>火腿肠</name>
            <price>1.5</price>
            <count>332</count>
        </goods>
        <goods>
            <name>矿泉水</name>
            <price>2</price>
            <count>549</count>
        </goods>
        <goods discount="8.5">
            <name>面包</name>
            <price>5.5</price>
            <count>29</count>
        </goods>
    </goodsList>
</supermarket>
"""

# 3. xpath语法
# 1）创建树结构获取树的根节点
# etree.XML(xml数据)
# etree.HTML(html数据)
f = open('files/data.xml', encoding='utf-8')
root = etree.XML(f.read())
f.close()

# 2)根据xpath获取指定标签
# 节点对象.xpath(路径)    -   返回路径对应的所有的标签，返回值是列表，列表中的元素是标签对象(节点对象)
"""
路径的写法：
1. 绝对路径:   用"/"开头的路径  -   /标签在树结构中的路径    (路径必须从根节点开始写)
2. 相对路径:   路径开头用"."标签当前节点(xpath前面是谁，'.'就代表谁), ".."表示当前节点的上层节点
3. 全路径:     用"//"开头的路径  -   在整个树中获取标签

注意：绝对路径和全路径的写法以及查找方式和是用谁去点的xpath无关
"""
result = root.xpath('/supermarket/staffs/staff/name/text()')
print(result)

result = root.xpath('./staffs/staff/name/text()')
print(result)

staff1 = root.xpath('./staffs/staff')[0]            # 获取第一个员工对应的staff标签
result = staff1.xpath('./name/text()')
print(result)       # ['小明']

result = staff1.xpath('../staff/name/text()')
print(result)       # ['小明', '小花', '张三', '李四', '王五']

result = root.xpath('//name/text()')
print(result)

result = staff1.xpath('//goods/name/text()')
print(result)

# 3)获取标签内容
# 节点对象.xpath(获取标签的路径/text())        -       获取指定路径下所有标签的标签内容
result = root.xpath('//position/text()')
print(result)

# 4)获取标签属性值
# 节点对象.xpath(获取标签的路径/@属性名)
result = root.xpath('/supermarket/@name')
print(result)       # ['永辉超市']

result = root.xpath('//staff/@id')
print(result)

# 5)谓语（条件）
# a. 位置相关谓语
"""
[N]     -      第N个
[last()]    -   最后一个
[last()-N];   [last()-1] -  倒数第2个 、 [last()-2] - 倒数第3个
[position()>N]、[position()<N]、[position()>=N]、[position()<=N]
"""
result = root.xpath('//staff[1]/name/text()')
print(result)       # ['小明']

result = root.xpath('//staff[last()]/name/text()')
print(result)       # ['王五']

result = root.xpath('//staff[last()-1]/name/text()')
print(result)       # ['李四']

result = root.xpath('//staff[position()<3]/name/text()')
print(result)   # ['小明', '小花']

# b.属性相关谓语
"""
[@属性名=属性值]      -      指定属性是指定值的标签
[@属性名]      -   拥有指定属性的标签
"""
# staff[@class="c1"] == staff.c1
result = root.xpath('//staff[@class="c1"]/name/text()')
print(result)

result = root.xpath('//staff[@id="s003"]/name/text()')
print(result)

result = root.xpath('//goods[@discount]/name/text()')
print(result)

# c.子标签内容相关谓语       -    根据子标签的内容来筛选标签
"""
[子标签名>数据]
[子标签名<数据]
[子标签名>=数据]
[子标签名<=数据]
[子标签名=数据]
"""
result = root.xpath('//goods[price=2]/name/text()')
print(result)

# 6)通配符  - 写路径的时候用*来表示所有标签或者所有属性
result = root.xpath('//staff[1]/*/text()')
print(result)

# *[@class="c1"]  == .c1
result = root.xpath('//*[@class="c1"]/name/text()')
print(result)

result = root.xpath('//goods[@*]/name/text()')
print(result)


result = root.xpath('/supermarket/@*')
print(result)


# 7)若干路径 - |
# 路径1|路径2       -      同时获取路径1和路径2的内容
result = root.xpath('//goods/name/text()|//staff/position/text()')
print(result)