Python lxml库的基本使用

最新推荐文章于 2024-02-11 21:22:44 发布

shier_smile

最新推荐文章于 2024-02-11 21:22:44 发布

阅读量2.2k

点赞数 3

分类专栏：文件处理文章标签： python

本文链接：https://blog.csdn.net/m0_57459724/article/details/121288639

版权

文件处理专栏收录该内容

1 篇文章 0 订阅

订阅专栏

title: lxml库的基本使用
date: 2021-11-01 18:46:03
tags:python

本贴用于记录自己的一些学习轨迹。（我只是计算机系学生(小白)）
这个是我看了其他人的文章，加上自己的一些补充和理解之后写的，但是时间太久了，原文连接找不到了，找到了和麻烦我说一下，我加上去，谢谢。

python lxml库的基本使用

1. lxml库的安装

使用win+r 输入cmd打开控制台
安装：

使用conda安装，

conda install lxml

使用pip

pip install lxml

2.lxml库的基本使用

lxml库的导入

import lxml
from lxml import etree

1.Element类

Element类为XML处理的核心，Element对象可以直观的理解为XML的节点，基础的处理方式都是围绕着Element类来进行的，Element类的基本操作方式主要有三种：

节点的操作
节点属性的操作
节点文本的操作

1.节点的操作

节点的创建

root = etree.Element('root')
print(root)
# tostring为查看节点的内容
print(etree.tostring(root))

输出：

<Element root at 0x2aafbc6fe88>
b'<root/>'

节点属性查看

# 使用tag方法来查看节点的具体属性
print(root.tag)

输出：

root

节点内容的输出

print(etree.tostring(root))

输出：

b'<root/>'

子节点的添加

root_child1 = etree.SubElement(root,'sub1')
root_child2 = etree.SubElement(root,'sub2')
root_child3 = etree.SubElement(root,'sub3')
print(etree.tostring(root))

输出：

<root><sub1/><sub2/><sub3/></root>

子节点的删除

注：remove方法只会清除该节点的子节点，不会清除节点本身，且当子节点重名时，会选择一起进行清楚，clear方法则会一次性清除所有子节点

root.remove(root_child1)
print(etree.tostring(root))
root.clear()
print(etree.tostring(root))

输出：

b'<root><sub2/><sub3/></root>'
b'<root/>'

节点的简便操作：
可以使用列表方式来操作子节点(按照节点的先后来进行标号)

root = etree.Element('root')
root_child1 = etree.SubElement(root,'sub1')
root_child2 = etree.subElement(root,'sub2')
root_child3 = etree.SubElement(root,'sub3')
print(etree.tostring(root))
for child in root:
    print(child.tag)
print(root[0].tag)
print(root.index(root_child2))
print(root.pop(0))
print(len(root))
root.insert(0,etree.Element('insert_Element'))
print(root[:])
root.append(etree.Element('append_Element'))
print(etree.tostring(root))

输出：

b'<root><sub1/><sub2/><sub3/></root>'
sub1
sub2
sub3
sub1
1
sub1
2
b'<root><insert_Element/><sub2/><sub3/></root>'
b'<root><insert_Element/><sub2/><sub3/><append_Element/></root>'

获取节点的父节点

print(root[1].getparent().tag)

输出：

root

2节点的属性操作

创建属性

# 在创建节点的同时使用interesting创建节点的属性
root = etree.Elenment('root',classic= 'person')
print(etree.tostring(root))
print(root)

输出：

b'<root classic="person"/>'
<Element root at 0x2aafbccb708>

使用set方法来给Element对象设置属性，第一个参数为属性名，第二个为属性

root.set('color','white')
print(etree.tostring(root))

输出：

b'<root classic="person" color = "white"/>'

获取属性
在Element实例对象中，属性以键值对形式组成，可以用字典的方式来操作

print(root.get('classic',None))
print(root.values())
print(root.keys())
print(root.items())

输出：

person
["person","white"]
["classic","color"]
[("classic","person"),("color","white")]

使用attrib属性
可以使用attrib属性一次性取出所有的属性来操作，返回一个字典，注：修改字典中的内容节点的属性也会随之修改

attribute = root.attrib
print(attribute)

attribute["test"] = "one"
print(etree.tostring(root))

输出：

{"classic":"person","color":"white"}
{"classic":"person","color":"white","test":"one"}

3.节点的文本操作

使用text或者tail获取节点中的文本内容

root = etree.Element('root',classic = "person")
root.text = "i am a boy"
print(etree.tostring(root))
# 在XML中标签一般成对出现
print(root.text)

输出：

b'<root classic="person">i am a boy</root>'
i am a boy

tail方法为在标签的后面添加文本,且在tostring方法中可以添加method参数来过滤标签，生成文本

root.tail = "i am the tail"
print(etree.tostring(root))
print(etree.tostring(root),method = 'text')

输出：

b'<root classic="person">i am a boy</root>i am the tail'
i am a boyi am the tal

4.xpath(最后给出xpath基础语法和完整教程链接)

在Element类中可以使用xpath来获取参数及其内容

# 过滤标签，返回文本
print(root.xpath("string()"))
# 以标签作为分隔，返回一个列表,文本会带有标签的所有属性信息
print(root.xpath("//text()"))
text = root.xpath("//text()")
print(text[0],text[0].tag)
print(text[0].is_text)
print(text[0].is_tail)

输出：

i am a boyi am the tail
['i am a boy','i am the tail']
root
True
False

2.文本的解析和输出

文本的解析

文本的解析主要分为3中：

fromstring
XML方法
HTML方法，此方法会自动补全标签

fromstring方法

data = '<root>data<root>'
root = etree.fromstring(data)
print(root.tag)
print(etree.tostring(root))

输出：

root
b'<root>data</root>'

XML方法

root = etree.XML(data)
print(etree.tostring(root))

输出：

b'<root>data</root>'

HTML方法

root = etree.HTML(data)
print(etree.tostring(root))

输出：

b'<html><body><root>data</root></body></html>'

文本的输出

主要为tostring函数

root = etree.XML('<root><a>info</a></root>')
print(etree.tostring(root))
print(etree.tostring(root),xml_declaratin=True)
print(etree.tostring(root),endcoding='utf-8')

输出：

b'<root><a>info</a></root>'
b"<?xml version='1.0' encoding='ASCII'?>\n<root><a>info</a></root>"
b'<root><a>info</a></root>'

3.ElementTree对象(最常用的)

ElementTree对象可以理解为由Element对象组成的一颗树，而ElementPath则用于定位每个Element节点的位置

import xml.etree.ElementTree as ET

path = "test.xml"
# 使用ET.parse方法来读取xml文件，并生成一个ET对象
tree = ET.parse(path)
# getroot获取ElementTree的根结点
root = tree.getroot()
# findall，find方法查找节点，不完全支持xpath语法em。。自行体会
# 若想完全支持，则使用lxml下的etree即可
find_node = root.findall('.//c')# 查找根节点下的所有c节点

注意：lxml.etree中的Element类和xml.etree.ElementTree.Element不一样

不要通用。

XPATH

具体教程地址

表达式	描述
nodename	选取此节点的所有子节点。
/	从根节点选取。
//	从匹配选择的当前节点选择文档中的节点，而不考虑它们的位置。
.	选取当前节点。
…	选取当前节点的父节点。
@	选取属性。

路径表达式	结果
bookstore	选取 bookstore 元素的所有子节点。
/bookstore	选取根元素 bookstore。注释：假如路径起始于正斜杠( / )，则此路径始终代表到某元素的绝对路径！
bookstore/book	选取属于 bookstore 的子元素的所有 book 元素。
//book	选取所有 book 子元素，而不管它们在文档中的位置。
bookstore//book	选择属于 bookstore 元素的后代的所有 book 元素，而不管它们位于 bookstore 之下的什么位置。
//@lang	选取名为 lang 的所有属性。

/bookstore/book[1]	选取属于 bookstore 子元素的第一个 book 元素。
/bookstore/book[last()]	选取属于 bookstore 子元素的最后一个 book 元素。
/bookstore/book[last()-1]	选取属于 bookstore 子元素的倒数第二个 book 元素。
/bookstore/book[position()❤️]	选取最前面的两个属于 bookstore 元素的子元素的 book 元素。
//title[@lang]	选取所有拥有名为 lang 的属性的 title 元素。
//title[@lang=‘eng’]	选取所有 title 元素，且这些元素拥有值为 eng 的 lang 属性。
/bookstore/book[price>35.00]	选取 bookstore 元素的所有 book 元素，且其中的 price 元素的值须大于 35.00。
/bookstore/book[price>35.00]/title	选取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值须大于 35.00。

通配符	描述
*	匹配任何元素节点。
@*	匹配任何属性节点。
node()	匹配任何类型的节点。

//book/title \| //book/price	选取 book 元素的所有 title 和 price 元素。
//title \| //price	选取文档中的所有 title 和 price 元素。
/bookstore/book/title \| //price	选取属于 bookstore 元素的 book 元素的所有 title 元素，以及文档中所有的 price 元素。