python中如何解析xml文档

最新推荐文章于 2024-08-06 17:00:20 发布

jj_liuxin

最新推荐文章于 2024-08-06 17:00:20 发布

阅读量7.9k

点赞数

分类专栏： python 文章标签： python xml encoding 文档 string html

本文链接：https://blog.csdn.net/jj_liuxin/article/details/3563643

版权

python 专栏收录该内容

22 篇文章 0 订阅

订阅专栏

在实际的应用中，处理xml是很重要也很常用的，相应的处理方法也是多种多样的，本文专注于通用性的xml处理；但为了简单起见，仅包括python中的xml.dom.minidom模块。
xml.dom.minidom是python中处理xml的一个轻量级接口，但很实用。

1)创建xml对象
xml应用一般以创建xml对象为起点,使用minidom创建xml对象很简单，可以传入的参数有3类：文件名、文件对象、字符串
例如：

 
 #-*-encoding:utf-8-*-
from xml.dom.minidom import parse, parseString
fileName = 'example.xml'
dom1 = parse(fileName) # parse an XML file by name
datasource = open(fileName)
dom2 = parse(datasource)   # parse an open file
dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>')
 

xml文档如下：

 
 <?xml version='1.0' encoding='utf-8'?>
<parent>
    <childs name='childs'>
        <child name='1' />
        <child name='2' />
    </childs>
</parent>
 

从上文可以看出，使用字符串构造xml对象时，不需要第一行的xm文档声明；如果使用第一行的话，很不幸的，会抛出一个这样的异常：parser.Parse(string, True) xml.parsers.expat.ExpatError: XML or text declaration not at start of entity

具体地，其调用方式为：
1>xml.dom.minidom.parse(filename or file[, parse])
2>xml.dom.minidom.parseString(string[, parse])
上述两个函数会返回一个Document对象，上面的parse表示一个SAX2对象，什么意思，大家想想就明白了额。

注意：当xml操作完成之后，切记 删除变量。因为某些版本的Python不支持循环引用变量的垃圾收集，清除dom变量可以使用dom对象的unlink()函数。
例如：

 
 dom1.unlink()
dom2.unlink()
dom3.unlink()
 

2)xml.dom.minidom与DOM Level1标准
W3C推荐的DOM标准在Python的实现是由xml.dom.minidom支持的，但二者还是存在一些差别的，具体的
1>node.unlink()
2>node.writexml( writer [, indent="" [, addindent="" [, newl="" [, encoding="" ] ] ] ])
3>node.toxml([encoding])
4>node.toprettyxml( [ indent="" [, newl="" [, encoding="" ] ] ])

下面是python文档中给出的例子，简单、典型，发上来大家看看。

 
 import xml.dom.minidom
document = """/
<slideshow>
<title>Demo slideshow</title>
<slide><title>Slide title</title>
<point>This is a demo</point>
<point>Of a program for processing slides</point>
</slide>
<slide><title>Another demo slide</title>
<point>It is important</point>
<point>To have more than</point>
<point>one slide</point>
</slide>
</slideshow>
"""
dom = xml.dom.minidom.parseString(document)
def getText(nodelist):
    rc = ""
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc = rc + node.data
    return rc
def handleSlideshow(slideshow):
    print("<html>")
    handleSlideshowTitle(slideshow.getElementsByTagName("title")[0])
    slides = slideshow.getElementsByTagName("slide")
    handleToc(slides)
    handleSlides(slides)
    print("</html>")
def handleSlides(slides):
    for slide in slides:
        handleSlide(slide)
def handleSlide(slide):
    handleSlideTitle(slide.getElementsByTagName("title")[0])
    handlePoints(slide.getElementsByTagName("point"))
def handleSlideshowTitle(title):
    print("<title>%s</title>" % getText(title.childNodes))
def handleSlideTitle(title):
    print("<h2>%s</h2>" % getText(title.childNodes))
def handlePoints(points):
    print("<ul>")
    for point in points:
        handlePoint(point)
    print("</ul>")
def handlePoint(point):
    print("<li>%s</li>" % getText(point.childNodes))
def handleToc(slides):
    for slide in slides:
        title = slide.getElementsByTagName("title")[0]
        print("<p>%s</p>" % getText(title.childNodes))
handleSlideshow(dom)
 

另外，xml.dom.minidom也有一些没有实现的东西，例如：

DOMTimeStamp
DocumentType
DOMImplementation
CharacterData
CDATASection
Notation
Entity
EntityReference
DocumentFragment

jj_liuxin

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
python中如何解析xml文档

在实际的应用中，处理xml是很重要也很常用的，相应的处理方法也是多种多样的，本文专注于通用性的xml处理；但为了简单起见，仅包括python中的xml.dom.minidom模块。 xml.dom.minidom是python中处理xml的一个轻量级接口，但很实用。1)创建xml对象 xml应用一般以创建xml对象为起点,使用minidom创建xml对象很简单，可以传入的参数有3类：文件名
复制链接

扫一扫

专栏目录