Python解析XML简单介绍

最新推荐文章于 2022-10-15 15:42:54 发布

小大小丑

最新推荐文章于 2022-10-15 15:42:54 发布

阅读量1.3k

点赞数

分类专栏： Python

Python 专栏收录该内容

51 篇文章 0 订阅

订阅专栏

Python解析XML简单介绍

1. 自己保存为free.xml

<?xml version='1.0' encoding='utf-8'?>
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
  <title>dive into mark</title>
  <subtitle>currently between addictions</subtitle>
  <id>tag:diveintomark.org,2001-07-29:/</id>
  <updated>2009-03-27T21:56:07Z</updated>
  <link rel='alternate' type='text/html' href='http://diveintomark.org/'/>
  <entry>
    <author>
      <name>Mark</name>
      <uri>http://diveintomark.org/</uri>
    </author>
    <title>Dive into history, 2009 edition</title>
    <link rel='alternate' type='text/html'
      href='http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'/>
    <id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id>
    <updated>2009-03-27T21:56:07Z</updated>
    <published>2009-03-27T17:20:42Z</published>
    <category scheme='http://diveintomark.org' term='diveintopython'/>
    <category scheme='http://diveintomark.org' term='docbook'/>
    <category scheme='http://diveintomark.org' term='html'/>
    <summary type='html'>Putting an entire chapter on one page sounds
      bloated, but consider this &mdash; my longest chapter so far
      would be 75 printed pages, and it loads in under 5 seconds&hellip;
      On dialup.</summary>
  </entry>
  <entry>
    <author>
      <name>Mark</name>
      <uri>http://diveintomark.org/</uri>
    </author>
    <title>Accessibility is a harsh mistress</title>
    <link rel='alternate' type='text/html'
      href='http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress'/>
    <id>tag:diveintomark.org,2009-03-21:/archives/20090321200928</id>
    <updated>2009-03-22T01:05:37Z</updated>
    <published>2009-03-21T20:09:28Z</published>
    <category scheme='http://diveintomark.org' term='accessibility'/>
    <summary type='html'>The accessibility orthodoxy does not permit people to
      question the value of features that are rarely useful and rarely used.</summary>
  </entry>
  <entry>
    <author>
      <name>Mark</name>
    </author>
    <title>A gentle introduction to video encoding, part 1: container formats</title>
    <link rel='alternate' type='text/html'
      href='http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats'/>
    <id>tag:diveintomark.org,2008-12-18:/archives/20081218155422</id>
    <updated>2009-01-11T19:39:22Z</updated>
    <published>2008-12-18T15:54:22Z</published>
    <category scheme='http://diveintomark.org' term='asf'/>
    <category scheme='http://diveintomark.org' term='avi'/>
    <category scheme='http://diveintomark.org' term='encoding'/>
    <category scheme='http://diveintomark.org' term='flv'/>
    <category scheme='http://diveintomark.org' term='GIVE'/>
    <category scheme='http://diveintomark.org' term='mp4'/>
    <category scheme='http://diveintomark.org' term='ogg'/>
    <category scheme='http://diveintomark.org' term='video'/>
    <summary type='html'>These notes will eventually become part of a
      tech talk on video encoding.</summary>
  </entry>
</feed>

-------------------------------------------------------------------------------------------------------------------------------------------

2. Python 解析XML

Python可以使用几种不同的方式解析xml文档。它包含了dom和sax解析器，这里用的是ElementTree库, Python自带的一个标准库。

2.1 调用解析XML

>>> import xml.etree.ElementTree as etree    	# ElementTree属于Python标准库的一部分，它的位置为xml.etree.ElementTree

>>> tree = etree.parse('examples/feed.xml')  	# Linux下可以这么写
>>> tree = etree.parse('C:\\feed.xml')  	# Windows下可以这么写
	# 这里就是解析xml文件, parse的参数可以使文件名, 也可以使流对象
	
>>> root = tree.getroot()                   	# 获取根元素
>>> root                                 	# 显示如下
<Element {http://www.w3.org/2005/Atom}feed at cd1eb0>
#  其中 http://www.w3.org/2005/Atom 是名字空间, feed是标签名, 所以根元素被表示为{http://www.w3.org/2005/Atom}feed
#  ElementTree使用{namespace}localname来表达xml元素

2.2 枚举一个元素的子元素

在ElementTree API中，元素的行为就像列表一样。列表中的项即该元素的子元素。

>>> root                            # 显示如下, 这里是显示对象
<Element {http://www.w3.org/2005/Atom}feed at cd1eb0>

>>> root.tag                        	# 显示元素的tag, 注意区别上面的
'{http://www.w3.org/2005/Atom}feed'

>>> len(root)                       	# 元素的子元素的个数(元素的行为就像列表一样)
8

>>> root[4]                          # 是不是像列表, 也可以用索引来操作
<Element {http://www.w3.org/2005/Atom}link at e181b0>

>>> for child in root:              		# 循环打印元素的子元素
...   print(child) 
... 
<Element {http://www.w3.org/2005/Atom}title at e2b5d0>
<Element {http://www.w3.org/2005/Atom}subtitle at e2b4e0>
<Element {http://www.w3.org/2005/Atom}id at e2b6c0>
<Element {http://www.w3.org/2005/Atom}updated at e2b6f0>
<Element {http://www.w3.org/2005/Atom}link at e2b4b0>
<Element {http://www.w3.org/2005/Atom}entry at e2b720>
<Element {http://www.w3.org/2005/Atom}entry at e2b510>
<Element {http://www.w3.org/2005/Atom}entry at e2b750>
# 从输出可以看到，根元素总共有8个子元素：所有feed级的元数据（title，subtitle，id，updated, link和3个entry）

元素就是列表.

xml的结构就是树结构, 通过上面的代码, 枚举整个xml的元素的方法已经出来了.

2.3 获取元素的属性

xml不只是元素的集合；每一个元素还有其属性集。一旦获取了某个元素的引用，我们可以像操作Python的字典一样轻松获取到其属性。

>>> root.attrib                      	# XML文件上的内容: <feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
{'{http://www.w3.org/XML/1998/namespace}lang': 'en'}
# 根元素中显示属性带上名字空间. 比较下面的就知道了

>>> root[4]                          	# 子元素root[4], link元素
<Element {http://www.w3.org/2005/Atom}link at e181b0>
>>> root[4].attrib                   	# 可以看到, 子元素root[4]有3个属性, 注意属性显示的格式
{'href': 'http://diveintomark.org/',
 'type': 'text/html',
 'rel': 'alternate'}
>>> root[3]                          	# 子元素root[3], updated元素     
<Element {http://www.w3.org/2005/Atom}updated at e2b4e0>
>>> root[3].attrib             		# root[3]元素, 没有属性
{}

属性attrib对象是一个字典对象。

2.4.查找XML文档中的结点(任意元素)

2.4.1 元素的findall方法

>>> import xml.etree.ElementTree as etree
>>> tree = etree.parse('examples/feed.xml')
>>> root = tree.getroot()
>>> root.findall('{http://www.w3.org/2005/Atom}entry')    # 通过findall方法查找匹配特定格式的子元素, 注意参数的格式
[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
 <Element {http://www.w3.org/2005/Atom}entry at e2b510>,
 <Element {http://www.w3.org/2005/Atom}entry at e2b540>]
# 找到了root元素的3个entry子元素

>>> root.tag
'{http://www.w3.org/2005/Atom}feed'
>>> root.findall('{http://www.w3.org/2005/Atom}feed')     # root元素中并没有feed子元素.
[]
>>> root.findall('{http://www.w3.org/2005/Atom}author')   # root元素中并没有author子元素
[]

可以理解findall是某元素找子元素. 看下面代码,

>>> tree.findall('{http://www.w3.org/2005/Atom}entry')    # 注意这里, 对象tree（调用etree.parse()的返回值）
[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
 <Element {http://www.w3.org/2005/Atom}entry at e2b510>,
 <Element {http://www.w3.org/2005/Atom}entry at e2b540>]
# 可以理解tree的findall其实就是tree.getroot().findall('{http://www.w3.org/2005/Atom}author')
#  root元素的确有3个entry 子元素

>>> tree.findall('{http://www.w3.org/2005/Atom}author')   #  root元素并没有author'子元素
[]

2.4.2 见好就收的find方法(元素的find方法)

find()方法用来返回第一个匹配到的元素。

>>> entries = tree.findall('{http://www.w3.org/2005/Atom}entry') 	# 返回entry元素列表(因为有3个entry子元素)
>>> len(entries)
3
>>> title_element = entries[0].find('{http://www.w3.org/2005/Atom}title') # 查找entries[0]的title子元素
>>> title_element.text							# 注意 text, <title>与</title>之间的文本内容
'Dive into history, 2009 edition'

>>> foo_element = entries[0].find('{http://www.w3.org/2005/Atom}foo') 	# entries[0]并没有foo子元素
>>> foo_element								# foo_element返回值为None
>>> type(foo_element)							# foo_element现在没有类型
<class 'NoneType'>

从上面代码可以看到, element.find('...')返回的是false的话, 代表element没有子元素; element.find('...')返回的是None的话代表没有找到匹配的子元素, 这是两回事.

在布尔上下文中，如果ElementTree元素对象不包含子元素，其值则会被认为是False（即如果len(element)等于0）。这就意味着if element.find('...')并非在测试是否find()方法找到了匹配项；这条语句是在测试匹配到的元素是否包含子元素！想要测试find()方法是否返回了一个元素，则需使用if element.find('...') is not None。(不是很明白!!!)

2.4.3 直接查找某元素(不通过嵌套查找)

>>> all_links = tree.findall('//{http://www.w3.org/2005/Atom}link')  	# 注意参数的格式开头多了两个斜干
>>> all_links								# 这两条斜线告诉findall()方法“不要只在直接子
									# 元素中查找；查找的范围可以是任意嵌套层次”。
[<Element {http://www.w3.org/2005/Atom}link at e181b0>,
 <Element {http://www.w3.org/2005/Atom}link at e2b570>,
 <Element {http://www.w3.org/2005/Atom}link at e2b480>,
 <Element {http://www.w3.org/2005/Atom}link at e2b5a0>]
>>> all_links[0].attrib                                              
{'href': 'http://diveintomark.org/',
 'type': 'text/html',
 'rel': 'alternate'}
>>> all_links[1].attrib                                              
{'href': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
 'type': 'text/html',
 'rel': 'alternate'}
>>> all_links[2].attrib
{'href': 'http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress',
 'type': 'text/html',
 'rel': 'alternate'}
>>> all_links[3].attrib
{'href': 'http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats',
 'type': 'text/html',
 'rel': 'alternate'}

疑问: 从上往下定位简单, 但是怎么从下往上定位, 例如我要找元素的父元素?

3. 小结

总的来说，ElementTree的findall()方法是其一个非常强大的特性，但是它的查询语言却让人有些出乎意料。官方描述它为“有限的XPath支持。”XPath是一种用于查询xml文档的W3C标准。对于基础地查询来说，ElementTree与XPath语法上足够相似，但是如果已经会XPath的话，它们之间的差异可能会使你感到不快。现在，我们来看一看另外一个第三方xml库，它扩展了ElementTree的api以提供对XPath的全面支持。

转自: http://woodpecker.org.cn/diveintopython3/xml.html

小大小丑

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python解析XML简单介绍

Python解析XML简单介绍1. 自己保存为free.xml dive into mark currently between addictions tag:diveintomark.org,2001-07-29:/ 2009-03-27T21:56:07Z Mark http://diveintomar
复制链接

扫一扫

专栏目录