python3 中解析简单的xml文档

**
xml是一种十分常见的表机性语言,可提供统一的方法来描述应用程序的结构化数据。
**
[code lang="xml"]
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>

那么在python当中如何解析xml文档呢?

我们可以使用标准库中的xml.etree.ElementTree,其中的parse函数可以解析xml文档。

from xml.etree.ElementEtree import parse #导入这个函数

parse这个函数有两个参数parse(source,parse=None)

Python
In [28]: parse? Signature: parse(source, parser=None) Docstring: Parse XML document into element tree. *source* is a filename or file object containing XML data, *parser* is an optional parser instance defaulting to XMLParser. Return an ElementTree instance. File: /usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/xml/etree/ElementTree.py Type: function
1
2
3
4
5
6
7
8
9
10
11
In [ 28 ] : parse ?
Signature : parse ( source , parser = None )
Docstring :
Parse XML document into element tree .
 
* source * is a filename or file object containing XML data ,
* parser * is an optional parser instance defaulting to XMLParser .
 
Return an ElementTree instance .
File :        / usr / local / Cellar / python3 / 3.6.2 / Frameworks / Python . framework / Versions / 3.6 / lib / python3 . 6 / xml / etree / ElementTree . py
Type :        function

可以把上面这个xml文件作为source也就是输入元。

Python
In [31]: from xml.etree.ElementTree import fromstring,parse In [32]: f = open('01.xml') In [33]: et = parse(f) In [34]: root = et.getroot()
1
2
3
4
5
6
7
In [ 31 ] : from xml.etree.ElementTree import fromstring , parse
 
In [ 32 ] : f = open ( '01.xml' )
 
In [ 33 ] : et = parse ( f )
 
In [ 34 ] : root = et . getroot ( )
Python
# 获取根地址 In [35]: root Out[35]: <Element 'data' at 0x107949818>
1
2
3
4
# 获取根地址
 
In [ 35 ] : root
Out [ 35 ] : < Element 'data' at 0x107949818 >
Python
In [36]: root.tag Out[36]: 'data' # 获取标签
1
2
3
In [ 36 ] : root . tag
Out [ 36 ] : 'data'
# 获取标签
Python
# 获取属性 In [37]: root.attrib Out[37]: {}
1
2
3
# 获取属性
In [ 37 ] : root . attrib
Out [ 37 ] : { }
Python
# 获取文本 In [38]: root.text Out[38]: '\n ' In [39]: root.text.strip() Out[39]: ''
1
2
3
4
5
6
# 获取文本
In [ 38 ] : root . text
Out [ 38 ] : '\n    '
 
In [ 39 ] : root . text . strip ( )
Out [ 39 ] : ''
Python
# 获取子节点 In [40]: root.getchildren() Out[40]: [<Element 'country' at 0x107d76778>, <Element 'country' at 0x107d3e188>, <Element 'country' at 0x107d3e0e8>]
1
2
3
4
5
6
# 获取子节点
In [ 40 ] : root . getchildren ( )
Out [ 40 ] :
[ < Element 'country' at 0x107d76778 > ,
< Element 'country' at 0x107d3e188 > ,
< Element 'country' at 0x107d3e0e8 > ]
Python
# 获取子节点的文本 In [43]: for x in root: ...: print(x.get('name')) ...: ...: Liechtenstein Singapore Panama
1
2
3
4
5
6
7
8
# 获取子节点的文本
In [ 43 ] : for x in root :
     . . . :      print ( x . get ( 'name' ) )
     . . . :
     . . . :
Liechtenstein
Singapore
Panama
Python
# 获取root 节点的 country(第一个节点) In [44]: root.find('country') Out[44]: <Element 'country' at 0x107d76778>
1
2
3
4
# 获取root 节点的 country(第一个节点)
 
In [ 44 ] : root . find ( 'country' )
Out [ 44 ] : < Element 'country' at 0x107d76778 >
Python
# 获取root 节点下的所遇的country In [45]: root.findall('country') Out[45]: [<Element 'country' at 0x107d76778>, <Element 'country' at 0x107d3e188>, <Element 'country' at 0x107d3e0e8>]
1
2
3
4
5
6
7
# 获取root 节点下的所遇的country
 
In [ 45 ] : root . findall ( 'country' )
Out [ 45 ] :
[ < Element 'country' at 0x107d76778 > ,
< Element 'country' at 0x107d3e188 > ,
< Element 'country' at 0x107d3e0e8 > ]
Python
# 获取root 节点下的所遇的country 生成器 In [46]: root.iterfind('country') Out[46]: <generator object prepare_child.<locals>.select at 0x107147b48> In [47]: for e in root.iterfind('country'): ...: print(e.get('name')) ...: Liechtenstein Singapore Panama
1
2
3
4
5
6
7
8
9
10
# 获取root 节点下的所遇的country 生成器
In [ 46 ] : root . iterfind ( 'country' )
Out [ 46 ] : < generator object prepare_child . < locals > . select at 0x107147b48 >
 
In [ 47 ] : for e in root . iterfind ( 'country' ) :
     . . . :          print ( e . get ( 'name' ) )
     . . . :
Liechtenstein
Singapore
Panama
Python
# 不能获取孙子节点 In [48]: root.findall('rank') Out[48]: [] In [49]: root.iter() Out[49]: <_elementtree._element_iterator at 0x107d530a0>
1
2
3
4
5
6
7
8
# 不能获取孙子节点
 
In [ 48 ] : root . findall ( 'rank' )
Out [ 48 ] : [ ]
 
 
In [ 49 ] : root . iter ( )
Out [ 49 ] : < _elementtree . _element_iterator at 0x107d530a0 >
Python
# 获取所有的节点 In [50]: [x for x in root.iter()] Out[50]: [<Element 'data' at 0x107949818>, <Element 'country' at 0x107d76778>, <Element 'rank' at 0x107d76818>, <Element 'year' at 0x107d76728>, <Element 'gdppc' at 0x107d76598>, <Element 'neighbor' at 0x10784f9f8>, <Element 'neighbor' at 0x10784f8b8>, <Element 'country' at 0x107d3e188>, <Element 'rank' at 0x107d3e908>, <Element 'year' at 0x107d3e548>, <Element 'gdppc' at 0x107d3ec78>, <Element 'neighbor' at 0x107d3e638>, <Element 'country' at 0x107d3e0e8>, <Element 'rank' at 0x107d3ed18>, <Element 'year' at 0x107d3ea48>, <Element 'gdppc' at 0x107d3e368>, <Element 'neighbor' at 0x107d3e9a8>, <Element 'neighbor' at 0x107d3eb88>]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 获取所有的节点
 
In [ 50 ] : [ x for x in root . iter ( ) ]
Out [ 50 ] :
[ < Element 'data' at 0x107949818 > ,
< Element 'country' at 0x107d76778 > ,
< Element 'rank' at 0x107d76818 > ,
< Element 'year' at 0x107d76728 > ,
< Element 'gdppc' at 0x107d76598 > ,
< Element 'neighbor' at 0x10784f9f8 > ,
< Element 'neighbor' at 0x10784f8b8 > ,
< Element 'country' at 0x107d3e188 > ,
< Element 'rank' at 0x107d3e908 > ,
< Element 'year' at 0x107d3e548 > ,
< Element 'gdppc' at 0x107d3ec78 > ,
< Element 'neighbor' at 0x107d3e638 > ,
< Element 'country' at 0x107d3e0e8 > ,
< Element 'rank' at 0x107d3ed18 > ,
< Element 'year' at 0x107d3ea48 > ,
< Element 'gdppc' at 0x107d3e368 > ,
< Element 'neighbor' at 0x107d3e9a8 > ,
< Element 'neighbor' at 0x107d3eb88 > ]
Python
# 获取接点是 rank的所有节点,包括rank In [51]: [x for x in root.iter('rank')] Out[51]: [<Element 'rank' at 0x107d76818>, <Element 'rank' at 0x107d3e908>, <Element 'rank' at 0x107d3ed18>]
1
2
3
4
5
6
# 获取接点是 rank的所有节点,包括rank
In [ 51 ] : [ x for x in root . iter ( 'rank' ) ]
Out [ 51 ] :
[ < Element 'rank' at 0x107d76818 > ,
< Element 'rank' at 0x107d3e908 > ,
< Element 'rank' at 0x107d3ed18 > ]
Python
# 查找 country 下所有的子元素 In [52]: root.findall('country/*') Out[52]: [<Element 'rank' at 0x107d76818>, <Element 'year' at 0x107d76728>, <Element 'gdppc' at 0x107d76598>, <Element 'neighbor' at 0x10784f9f8>, <Element 'neighbor' at 0x10784f8b8>, <Element 'rank' at 0x107d3e908>, <Element 'year' at 0x107d3e548>, <Element 'gdppc' at 0x107d3ec78>, <Element 'neighbor' at 0x107d3e638>, <Element 'rank' at 0x107d3ed18>, <Element 'year' at 0x107d3ea48>, <Element 'gdppc' at 0x107d3e368>, <Element 'neighbor' at 0x107d3e9a8>, <Element 'neighbor' at 0x107d3eb88>]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 查找 country 下所有的子元素
In [ 52 ] : root . findall ( 'country/*' )
Out [ 52 ] :
[ < Element 'rank' at 0x107d76818 > ,
< Element 'year' at 0x107d76728 > ,
< Element 'gdppc' at 0x107d76598 > ,
< Element 'neighbor' at 0x10784f9f8 > ,
< Element 'neighbor' at 0x10784f8b8 > ,
< Element 'rank' at 0x107d3e908 > ,
< Element 'year' at 0x107d3e548 > ,
< Element 'gdppc' at 0x107d3ec78 > ,
< Element 'neighbor' at 0x107d3e638 > ,
< Element 'rank' at 0x107d3ed18 > ,
< Element 'year' at 0x107d3ea48 > ,
< Element 'gdppc' at 0x107d3e368 > ,
< Element 'neighbor' at 0x107d3e9a8 > ,
< Element 'neighbor' at 0x107d3eb88 > ]
Python
# 选中当前元素下的所有的rank In [54]: root.findall('.//rank') Out[54]: [<Element 'rank' at 0x107d76818>, <Element 'rank' at 0x107d3e908>, <Element 'rank' at 0x107d3ed18>]
1
2
3
4
5
6
# 选中当前元素下的所有的rank
In [ 54 ] : root . findall ( './/rank' )
Out [ 54 ] :
[ < Element 'rank' at 0x107d76818 > ,
< Element 'rank' at 0x107d3e908 > ,
< Element 'rank' at 0x107d3ed18 > ]
Python
# 选择country 包含name的属性 In [58]: root.findall('country[@name]') Out[58]: [<Element 'country' at 0x107d76778>, <Element 'country' at 0x107d3e188>, <Element 'country' at 0x107d3e0e8>]
1
2
3
4
5
6
7
# 选择country 包含name的属性
 
In [ 58 ] : root . findall ( 'country[@name]' )
Out [ 58 ] :
[ < Element 'country' at 0x107d76778 > ,
< Element 'country' at 0x107d3e188 > ,
< Element 'country' at 0x107d3e0e8 > ]
Python
# 选择country 包含name的属性是"Liechtenstein"的节点 In [59]: root.findall('country[@name="Liechtenstein"]') Out[59]: [<Element 'country' at 0x107d76778>]
1
2
3
4
5
# 选择country 包含name的属性是"Liechtenstein"的节点
 
 
In [ 59 ] : root . findall ( 'country[@name="Liechtenstein"]' )
Out [ 59 ] : [ < Element 'country' at 0x107d76778 > ]
Python
# 选择country 必须包含 rank 节点 In [60]: root.findall('country[rank]') Out[60]: [<Element 'country' at 0x107d76778>, <Element 'country' at 0x107d3e188>, <Element 'country' at 0x107d3e0e8>]
1
2
3
4
5
6
7
# 选择country 必须包含 rank 节点
 
In [ 60 ] : root . findall ( 'country[rank]' )
Out [ 60 ] :
[ < Element 'country' at 0x107d76778 > ,
< Element 'country' at 0x107d3e188 > ,
< Element 'country' at 0x107d3e0e8 > ]
Python
# 选择country 必须包含 rank 节点 而且节点的值必须是4 In [63]: root.findall('country[rank="4"]') Out[63]: [<Element 'country' at 0x107d3e188>]
1
2
3
4
# 选择country 必须包含 rank 节点 而且节点的值必须是4
 
In [ 63 ] : root . findall ( 'country[rank="4"]' )
Out [ 63 ] : [ < Element 'country' at 0x107d3e188 > ]



  • zeropython 微信公众号 5868037 QQ号 5868037@qq.com QQ邮箱
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值