[Python] 标准库 xml.etree.ElementTree 学习笔记

最新推荐文章于 2024-01-04 17:12:35 发布

wy_hhxx

最新推荐文章于 2024-01-04 17:12:35 发布

阅读量1k

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/wy_hhxx/article/details/99863658

版权

python 专栏收录该内容

67 篇文章 0 订阅

订阅专栏

官方文档：

python 3 https://docs.python.org/3/library/xml.etree.elementtree.html

python 2.7 https://docs.python.org/2/library/xml.etree.elementtree.html

=========================================================

以下是阅读笔记及一些尝试和思考~

xml.etree.ElementTree(以下简称 ET)的两个类：ElementTree和 Element。

ElementTree将整个XML文档表示为树，与整个文档的交互（读取和写入文件）通常在ElementTree级别上完成。

Element表示此树中的单个节点，与单个XML元素及其子元素的交互在元素级别完成。

1.解析XML

以官方提供的XML为例，以Python 3.6 为例

<?xml version="1.0"?>
<data>    <!--Element 'data' at 0x7eff4d8d8f48-->
   <country name="Liechtenstein">    <!--Element 'data' at 0x7eff4d7aee58-->
        <rank>1</rank>               <!--Element 'data' at 0x7eff4d7ca458-->
        <year>2008</year>            <!--Element 'data' at 0x7eff4d0afc28-->
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">    <!--Element 'data' at 0x7eff4d04a458-->
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">    <!--Element 'data' at 0x7eff4d04a5e8-->
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

读文件导入数据：获取根元素，赋给变量root

>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('country_data.xml')
>>> root = tree.getroot()
>>>
>>> print(tree)
<xml.etree.ElementTree.ElementTree object at 0x7eff4d7c7358>
>>> print(root)
<Element 'data' at 0x7eff4d8d8f48>
>>>

元素有如下4种属性，其中属性(attrib)是字典

tag	字符串	element的名称
text	字符串	element的内容
attrib	字典	element的属性
tail	字符串	element闭合之后的尾迹

举例：

>>> root.tag
'data'
>>> root.attrib
{}
>>> root.find('country').text
'\n        '
>>> root.find('country').attrib
{'name': 'Liechtenstein'}
>>> root.find('country/year').text
'2008'
>>> root.tail
>>> root.find('country/year').tail
'\n        '
>>> root.find('country').tail
'\n    '
>>>

说明：

1）根元素root的标签是“data”，因为data标签里不带属性，所以root的attrib是一个空字典

2）查看root第一个子元素<country></country>的属性，它有一个属性name

3）查看root第一个子元素<country></country>的子元素<year></year>的内容用“.text”，值为‘2008’

4）由于文档的最后一行就是结束标签</data>, root.tail没有回车和空格之类的，root第一个子元素<country></country>的尾迹是回车加一个Tab，root第一个子元素<country></country>的子元素<year></year>的尾迹是回车加两个Tab

也可以遍历某个元素的子元素

>>> for child in root:
...     print(child.tag, child.attrib)
...
country {'name': 'Liechtenstein'}
country {'name': 'Singapore'}
country {'name': 'Panama'}
>>>
>>> for child in root.find('country'):
...     print(child.tag, child.attrib)
...
rank {}
year {}
gdppc {}
neighbor {'name': 'Austria', 'direction': 'E'}
neighbor {'name': 'Switzerland', 'direction': 'W'}
>>>

说明：

1）遍历root的子元素，三个<country>...</country>

2）遍历root的第一个<country>...</country>子元素的子元素

如果想遍历的<country>...</country>不是第一个怎么办？比如我想遍历新加坡那个country的子标签

以下是一个错误示范（捂脸）

>>> for child in root.findall(".//country[@name='Singapore']"):
...     print(child.tag, child.attrib)
...
country {'name': 'Singapore'}
>>>

说明：root.findall(".//country[@name='Singapore']") -> 寻找root的、所有属性name='Singapore'的country子元素

参考官方文档章节“Supported XPath syntax”

//	Selects all subelements, on all levels beneath the current element. For example, .//egg selects all egg elements in the entire tree.
[@attrib='value']	Selects the parent element. Returns None if the path attempts to reach the ancestors of the start element (the element find was called on).

为什么捏？和前两例比较一下（定睛一看最后那个是个list）

>>> print(root)
<Element 'data' at 0x7eff4d8d8f48>
>>>
>>> print(root.find('country'))
<Element 'country' at 0x7eff4d7aee58>
>>>
>>> print(root.findall(".//country[@name='Singapore']"))
[<Element 'country' at 0x7eff4d04a458>]
>>>
>>> type(root)
<class 'xml.etree.ElementTree.Element'>
>>> type(root.findall(".//country[@name='Singapore']"))
<class 'list'>
>>>
>>> type(root.find('country'))
<class 'xml.etree.ElementTree.Element'>
>>> type(root.findall(".//country[@name='Singapore']"))
<class 'list'>
>>>

说明： root.findall() 找出来的是一个列表，即使只有一个元素也是列表；所以要记得取这个列表的第一个元素

所以，正确操作如下：

>>> for child in root.findall(".//country[@name='Singapore']")[0]:
...     print(child.tag, child.attrib)
...
rank {}
year {}
gdppc {}
neighbor {'name': 'Malaysia', 'direction': 'N'}
>>>

2.需找感兴趣的元素

Element.iter() 递归迭代当前元素下的所有子元素（它的子元素，子元素的子元素等等）
Element.find() 找到当前元素下具有特定标记的第一个子元素，Element.text访问元素的文本内容
Element.findall() 仅查找当前元素下具有特定标记的直接子元素
Element.get() 访问当前元素的属性

1）Element.iter() 举例

1.root的儿子country
>>> for country in root.iter('country'):
...     print(country.attrib)
...
{'name': 'Liechtenstein'}
{'name': 'Singapore'}
{'name': 'Panama'}
>>>

2.root的孙子neighbor
>>> for neighbor in root.iter('neighbor'):
...     print(neighbor.attrib)
...
{'name': 'Austria', 'direction': 'E'}
{'name': 'Switzerland', 'direction': 'W'}
{'name': 'Malaysia', 'direction': 'N'}
{'name': 'Costa Rica', 'direction': 'W'}
{'name': 'Colombia', 'direction': 'E'}
>>>

3.第3个country的儿子neighbor
>>> country_Panama = root.findall('country')[2]
>>> for neighbor in country_Panama.iter('neighbor'):
...     print(neighbor.attrib)
...
{'name': 'Costa Rica', 'direction': 'W'}
{'name': 'Colombia', 'direction': 'E'}
>>>

2）Element.find() 举例

>>> root.find('country').attrib
{'name': 'Liechtenstein'}
>>>
>>> country_Panama = root.findall('country')[2]
>>> country_Panama.find('neighbor').attrib
{'name': 'Costa Rica', 'direction': 'W'}
>>>

3）Element.findall() 举例

>>> root.findall('rank')
[]
>>> root.find('country').findall('rank')
[<Element 'rank' at 0x7eff4d7ca458>]
>>>
>>> root.find('country').findall('rank').text
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'text'
>>>
>>> root.find('country').findall('rank')[0].text
'1'
>>>

4) Element.get() 举例

>>> for country in root.findall('country'):
...     rank = country.find('rank').text
...     name = country.get('name')
...     print(name, rank)
...
Liechtenstein 1
Singapore 4
Panama 68
>>>

3.修改XML文件

修改域的值 Element.text
添加、修改属性 Element.set()
添加子元素 Element.append()
删除元素 Element.remove()

1）修改域的值 -> rank标签的值 +1；添加属性 -> rank标签增加updated属性，值为yes

>>> for rank in root.iter('rank'):
...     new_rank = int(rank.text) + 1
...     rank.text = str(new_rank)
...     rank.set('updated', 'yes')
...
>>> tree.write('country_data.xml')
>>>

写完文件后，XML文件如下：

<data>
    <country name="Liechtenstein">
        <rank updated="yes">2</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor direction="E" name="Austria" />
        <neighbor direction="W" name="Switzerland" />
    </country>
    <country name="Singapore">
        <rank updated="yes">5</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor direction="N" name="Malaysia" />
    </country>
    <country name="Panama">
        <rank updated="yes">69</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor direction="W" name="Costa Rica" />
        <neighbor direction="E" name="Colombia" />
    </country>
</data>

2）添加子元素，例如再添加一个country，创建 new_country.py 内容如下

from xml.etree import ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
# （为了格式）找到当前最后一个country元素，它的尾迹是“回车”，将其改成“回车加4个空格”
last_country = root.findall('country')[-1]
last_country.tail = '\n' + ' '*4

# 创建准备添加的country标签，属性名name设置为Secret
# （为了格式）内容是“回车加8个空格”，因为之后添加子标签是这个缩进
# （为了格式）尾迹是“回车”，因为最后这个</country>回车之后就是</data>,无缩进
country_toadd = ET.Element('country')
country_toadd.attrib = {'name':'Secret'}
country_toadd.text = '\n' + ' '*8
country_toadd.tail = '\n'

# 创建country的子标签，将子标签放到一个列表里，设置一个计数器
country_children = ['rank','year','gdppc','neighbor']
count = 0

# 循环append子标签
#（为了格式）如果是最后一个子标签，尾迹取“回车加4个空格”，否则是“回车加8个空格”
for ele in country_children:
    child_add = ET.Element(ele)
    count = count + 1
    if count == len(country_children):
        child_add.tail = '\n' + ' '*4
    else:
        child_add.tail = '\n' + ' '*8
    #将rank的值设为0，注意类型是字符串，不加引号即不会报错也不会写
    if ele == 'rank':
       child_add.text = '0'
    country_toadd.append(child_add)

# 将整个新创建的country元素append到根元素下，并且写入一个新文件country_data1.xml
root.append(country_toadd)
tree.write('country_data1.xml')

执行 python new_country.py 得到文件 country_data1.xml 内容如下

[root@xxx ~]# cat country_data1.xml
<data>
    <country name="Liechtenstein">
        <rank updated="yes">2</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor direction="E" name="Austria" />
        <neighbor direction="W" name="Switzerland" />
    </country>
    <country name="Singapore">
        <rank updated="yes">5</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor direction="N" name="Malaysia" />
    </country>
    <country name="Panama">
        <rank updated="yes">69</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor direction="W" name="Costa Rica" />
        <neighbor direction="E" name="Colombia" />
    </country>
    <country name="Secret">
        <rank>0</rank>
        <year />
        <gdppc />
        <neighbor />
    </country>
</data>[root@xxx ~]#

3）删除元素，例如从 country_data1.xml 中删除刚添加的country。需要注意text内容是字符串，比较大小需转换类型。

>>> tree1 = ET.parse('country_data1.xml')
>>> root1 = tree1.getroot()
>>>
>>> for country in root1.findall('country'):
...     rank = int(country.find('rank').text)
...     if rank < 1:
...             root1.remove(country)
...
>>> tree.write('country_data1.xml')
>>>

再去查看country_data1.xml ，又和country_data.xml 长一样了。

又如，想把每个country中的neighbor仅保留第一个，其余的删除，写入到文件country_data2.xml

>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('country_data.xml')
>>> root = tree.getroot()
>>> for country in root.findall('country'):
...     neighbor_list = country.findall('neighbor')
...     if len(neighbor_list) > 1:
...             for neighbor in neighbor_list[1:]:
...                     country.remove(neighbor)
...
>>> tree.write('country_data2.xml')

查看文件country_data2.xml，现在每个country中只有一个neighbor了

[root@xxx ~]# cat country_data2.xml
<data>
    <country name="Liechtenstein">
        <rank updated="yes">2</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor direction="E" name="Austria" />
        </country>
    <country name="Singapore">
        <rank updated="yes">5</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor direction="N" name="Malaysia" />
    </country>
    <country name="Panama">
        <rank updated="yes">69</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor direction="W" name="Costa Rica" />
        </country>
[root@xxx ~]#

需要注意的地方：删除元素的时候出了一些问题，删除不成功，可能是因为读了文件又写了，却又没有重新读；或者改了没写放在内存中，最后退出python3，并使用初始备份的xml，进入python3 操作就成功了。所以如果修改xml出了语法错误最好清理环境重新读一下文件。

wy_hhxx

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
[Python] 标准库 xml.etree.ElementTree 学习笔记

官方文档：python 3https://docs.python.org/3/library/xml.etree.elementtree.htmlpython 2.7https://docs.python.org/2/library/xml.etree.elementtree.html================================================...
复制链接

扫一扫