python解析xml(创建xml)

最新推荐文章于 2024-08-14 17:13:53 发布

shanliangliuxing

最新推荐文章于 2024-08-14 17:13:53 发布

阅读量6.4k

点赞数

分类专栏： Python

Python 专栏收录该内容

185 篇文章 1 订阅

订阅专栏

转自：http://blog.csdn.net/hechaoyuyu/article/details/6534639

python中解析xml文件一般可用三种库：xml.dom.minidom（python从2.0版后自带）、cElementTree（依赖于ElementTree库）、lxml（构建在两个 C 库之上：libxml2 和 libxslt）。

当我用一个2K的xml文件来测试它们的效率时，解析时间上没有什么差别，但内存使用上分别为3.5M、2.9M、4.7M；当用968K的xml文件时，解析用时分别为0.44s、0.03s、0.02s，占用内存分别为69.0M、9.8M、12.1M；当用23M的xml文件时三者之间的效率差别就更明显了，用xml.dom.minidom.parse(filename)后电脑直接就卡得动不了，好不容易切换到终端下top一看，cpu使用率几乎到100%，内存占用直线上升，居然到1.2G了，而且还在上升，只有kill掉它；ET.parse(filename)用时0.91s，占用内存171.2M；lxml.etree.parse(filename)用时0.37s，占用内存193.7M。

所以在解析内容很少的xml文件时，比如一些图形界面的xml文件，就用python自带的xml.dom.minidom就可以了，当然cElementTree在占用内存上最少，但安装cElementTree外，还要额外安装ElementTree模块。在解析比较大的xml文件时，xml.dom根本就应付不了，cElementTree比lxml占用更少的内存，但lxml解析时间最短。

当然要解析超大的xml文件时，比如超过1G大小，那么就要用到ElementTree和lxml提供的iterparse 方法来迭代解析了。

最近又发现两种解析xml文件的库：xml.parsers.expat 和 pyRXP。经过测试，用xml.parsers.expat来处理超大xml文件时效率更高，这主要是由于它是顺序处理，不会保存已经处理过的标记，所以不会创建一个非常大的XML内存表示。

转自：http://blog.sina.com.cn/s/blog_7d31bbee0101la2w.html

import xml.dom.minidom as Dom
import httplib,urllib2,urllib;
def config__publish(self):
      
        #拼接xml
        document = Dom.Document()
        resources = document_createElement_x_x("resources")
        resources.setAttribute("version",str(config.version))
            
        resources.setAttribute("channelId","afdaf")
                                                                                                
        resources.setAttribute("apkVersion","afdaf")
        
        document.a(resources)
#当前行号
      
        document = document.toprettyxml(indent = "\t",newl="\n", encoding="utf-8")
        logging.info( "传出xml文件：%s"% str(document))
                                           

        url = SLIDE_PUBLISH_URL
        try:
            urllib2.urlopen(url, data)
        except Exception,e:
            logging.error("Error %s" %(str(e)),exc_info=True)
            return

转自：http://blog.csdn.net/dikatour/article/details/2031997

今天想使用python来创建一个xml文件。找了下资料，发现资料不是很多，基本上都是使用python来解析xml文件的。
比如我要将内容为
<?xml version="1.0" encoding="utf-8"?>
    <root>
        <book isbn="34909023">
            <author>
                dikatour
            </author>
        </book>
    </root>

写入到xmlstuff.xml文件中去。
其实也很简单，基本原理如下：
我使用xml的DOM方式，先在内存中创建一个空的DOM树，然后不断增加我要的节点，最后形成我想要的DOM，最后输出到文件中去。

1.我使用xml.dom.minidom这个module来创建xml文件
from xml.dom import minidom
2. 每个xml文件都是一个Document对象,代表着内存中的DOM树
doc = minidom.Document()

3.有了空的DOM树后，我们在上面添加根节点
rootNode = doc.createElement("root")
doc.appendChild(rootNode) #注意python的library reference里说,createElement后并没有将节点对象加到DOM树上，需要自己手工加上

4.创建其它的节点

5.输出到xml文件中去
doc.writexml(f, "/t", "/t", "/n", "utf-8") #第一个参数f就是你的目标文件对象，第二个参数好像是<?xml>和下面一个根节点的缩进排列格式,
第三个参数好像是其他节点与子节点的缩进排列格式，第四个参数制定了换行的格式(如果你填入" "，那就不换行了，所有的xml都缩在了一行上面 :) )
，第五个参数制定了xml内容的编码。除了第一个参数是必须的，其他参数都是可选择的。

最终代码如下(这个程序没什么价值，只是用来测试验证自己的想法,你更可能定义一个简单类或函数，将你的数据结构序列化到xml文件中)：
from xml.dom import minidom
import traceback

try:
    f = open("xmlstuff.xml", "w")

    try:
        doc = minidom.Document()

        rootNode = doc.createElement("root")
        doc.appendChild(rootNode)

        bookNode = doc.createElement("book")
        bookNode.setAttribute("isbn", "34909023")
        rootNode.appendChild(bookNode)

        authorNode = doc.createElement("author")
        bookNode.appendChild(authorNode)

        authorTextNode = doc.createTextNode("dikatour")
        authorNode.appendChild(authorTextNode)

        doc.writexml(f, "/t", "/t", "/n", "utf-8")
    except:
        trackback.print_exc()
    finally:
        f.close()

except IOException:
    print "open file failed"

总结：
    1. 目标(将一串xml字符串写到文件中)=>得到一串xml字符串=>dom树(minidom中有toxml方法将DOM树的xml信息输出成字符串）
    2. 使用python 2.5 documentation(也就是安装python时一起安装的python手册）中的library reference中的第8章(structrued Markup Processing Tools),查阅手册很重要,另外查阅一些简明的python书籍

    3.多思考，逻辑清晰了，即时象我一样对如何使用python操纵xml一无所知，稍微查下资料也就可以完成功能了

    4. 恰好证明了python这门语言的强大的功能性 :) gets job done..

转自：http://www.cnblogs.com/wangshide/archive/2011/10/29/2228936.html

使用Python生成XML

1. bookstore.py

#encoding:utf-8
'''
根据一个给定的XML Schema，使用DOM树的形式从空白文件生成一个XML。
'''
from xml.dom.minidom import Document

doc = Document()  #创建DOM文档对象

bookstore = doc.createElement('bookstore') #创建根元素
bookstore.setAttribute('xmlns:xsi',"http://www.w3.org/2001/XMLSchema-instance")#设置命名空间
bookstore.setAttribute('xsi:noNamespaceSchemaLocation','bookstore.xsd')#引用本地XML Schema
doc.appendChild(bookstore)
############book:Python处理XML之Minidom################
book = doc.createElement('book')
book.setAttribute('genre','XML')
bookstore.appendChild(book)

title = doc.createElement('title')
title_text = doc.createTextNode('Python处理XML之Minidom') #元素内容写入
title.appendChild(title_text)
book.appendChild(title)

author = doc.createElement('author')
book.appendChild(author)
author_first_name = doc.createElement('first-name')
author_last_name  = doc.createElement('last-name')
author_first_name_text = doc.createTextNode('张')
author_last_name_text  = doc.createTextNode('三')
author.appendChild(author_first_name)
author.appendChild(author_last_name)
author_first_name.appendChild(author_first_name_text)
author_last_name.appendChild(author_last_name_text)
book.appendChild(author)

price = doc.createElement('price')
price_text = doc.createTextNode('28')
price.appendChild(price_text)
book.appendChild(price)
############book1:Python写网站之Django####################
book1 = doc.createElement('book')
book1.setAttribute('genre','Web')
bookstore.appendChild(book1)

title1 = doc.createElement('title')
title_text1 = doc.createTextNode('Python写网站之Django')
title1.appendChild(title_text1)
book1.appendChild(title1)

author1 = doc.createElement('author')
book.appendChild(author1)
author_first_name1 = doc.createElement('first-name')
author_last_name1  = doc.createElement('last-name')
author_first_name_text1 = doc.createTextNode('李')
author_last_name_text1  = doc.createTextNode('四')
author1.appendChild(author_first_name1)
author1.appendChild(author_last_name1)
author_first_name1.appendChild(author_first_name_text1)
author_last_name1.appendChild(author_last_name_text1)
book1.appendChild(author1)

price1 = doc.createElement('price')
price_text1 = doc.createTextNode('40')
price1.appendChild(price_text1)
book1.appendChild(price1)

########### 将DOM对象doc写入文件
f = open('bookstore.xml','w')
f.write(doc.toprettyxml(indent = ''))
f.close()

2. bookstore.xsd

<?xml version="1.0" encoding="utf-8"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">

  <xsd:element name="bookstore" type="bookstoreType"/>

  <xsd:complexType name="bookstoreType">
    <xsd:sequence maxOccurs="unbounded">
      <xsd:element name="book" type="bookType"/>
    </xsd:sequence>
  </xsd:complexType>

  <xsd:complexType name="bookType">
    <xsd:sequence>
      <xsd:element name="title" type="xsd:string"/>
      <xsd:element name="author" type="authorName"/>
      <xsd:element name="price" type="xsd:decimal"/>
    </xsd:sequence>
    <xsd:attribute name="genre" type="xsd:string"/>
  </xsd:complexType>

  <xsd:complexType name="authorName">
    <xsd:sequence>
      <xsd:element name="first-name" type="xsd:string"/>
      <xsd:element name="last-name" type="xsd:string"/>
    </xsd:sequence>
  </xsd:complexType>

</xsd:schema>

3. 根据上面的XML Schema用Python minidom生成的XML

bookstore.xml

<?xml version="1.0" ?>
<bookstore xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="bookstore.xsd">
  <book genre="XML">
    <title>
      Python处理XML之Minidom
    </title>
    <author>
      <first-name>
        张
      </first-name>
      <last-name>
        三
      </last-name>
    </author>
    <price>
      28
    </price>
  </book>
  <book genre="Web">
    <title>
      Python写网站之Django
    </title>
    <author>
      <first-name>
        李
      </first-name>
      <last-name>
        四
      </last-name>
    </author>
    <price>
      40
    </price>
  </book>
</bookstore>

官方链接：

http://docs.python.org/2/library/xml.dom.minidom.html

本地代码：

#!usr/bin/env python
#coding: utf-8
'''
Created on 2013-10-12

@author: root
'''

import json
from xml.dom import minidom
from xml.dom.minidom import parse, parseString

def testjson():
    dd = {'id':'1001', 'result':False, 'errorcode':'101', 'errordesc':'Channel Authentication failed.'}
    return [json.dumps(dd)]

def testxml():
    #doc = minidom.Document()
    #rootNode = doc.createElement("SMS")
    #doc.appendChild(rootNode)
    
    
    try:
        
        try:
            doc = minidom.Document()
            
            rootNode = doc.createElement("root")
            doc.appendChild(rootNode)
            
            bookNode = doc.createElement("book")
            bookNode.setAttribute("isbn", "34909023")
            rootNode.appendChild(bookNode)
            
            authorNode = doc.createElement("author")
            bookNode.appendChild(authorNode)
            
            authorTextNode = doc.createTextNode("dikatour")
            authorNode.appendChild(authorTextNode)
            
            s = doc.toxml("utf-8")
            print type(s)
            print s
            s2 = doc.toprettyxml(indent = "\t",newl="\n", encoding="utf-8")
            print type(s2)
            print s2
        except Exception, e:
            print e
        finally:
            print 'false'
            
    except Exception:
        print "open file failed"
    
    #xml_str = '<?xml version="1.0" encoding="utf-8"?><SMS><RETURN>true</RETURN><ERROR></ERROR></SMS>'
    
    #dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>')
    #dom3 = parseString(xml_str)
    #print type(dom3), dom3
    
    #return [xml_str]
    pass

def testxml2():
    #doc = minidom.Document()
    #rootNode = doc.createElement("SMS")
    #doc.appendChild(rootNode)
    
    
    try:
        doc = minidom.Document()
        
        rootNode = doc.createElement("SMS")
        doc.appendChild(rootNode)
        
        node_return = doc.createElement('RETURN')
        node_return_text = doc.createTextNode('true') #元素内容写入
        node_return.appendChild(node_return_text)
        rootNode.appendChild(node_return)
        
        node_error = doc.createElement('ERROR')
        node_error_text = doc.createTextNode('') #元素内容写入
        node_error.appendChild(node_error_text)
        rootNode.appendChild(node_error)
        
        
        node_desc = doc.createElement('DESC')
        node_desc_text = doc.createTextNode('abcdefag设置命名空间') #元素内容写入
        node_desc.appendChild(node_desc_text)
        rootNode.appendChild(node_desc)
        
        
        
        
        s = doc.toxml("utf-8")
        print type(s)
        print s
        s2 = doc.toprettyxml(indent = "\t",newl="\n", encoding="utf-8")
        print type(s2)
        print s2
    except Exception, e:
        print e
    finally:
        print 'false'
    
    #xml_str = '<?xml version="1.0" encoding="utf-8"?><SMS><RETURN>true</RETURN><ERROR></ERROR></SMS>'
    
    #dom3 = parseString('<myxml>Some data<empty/> some more data</myxml>')
    #dom3 = parseString(xml_str)
    #print type(dom3), dom3
    
    #return [xml_str]
    pass

def main():
    s1 = testjson()
    print type(s1), s1
    
    testxml()
    testxml2()
    
    

if __name__ == '__main__':
    main()

转自：http://my.oschina.net/mutour/blog/32364

Python 使用minidom读写xml

python31高级[使用minidom读写xml]

http://www.cnblogs.com/itech/archive/2011/01/06/1924972.html

一 python提供的xml支持
2种工业标准的xml解析方法-SAX和DOM。SAX（simple API for XML），是基于事件处理的，当XML文档顺序地读入时，每次遇到一个元素会触发相应的事件处理函数来处理。DOM（Document Object Model），通过构建一个树结构来表现整个xml文档，一旦树被构建，可以通过DOM提供了接口来遍历树和提取相应的数据。

python还提供了python独特的xml解析方法，相比于SAX和DOM更容易使用和更加快速，此方法为ElementTree。

python的xml模块为：

1)xml.dom.minidom

2)xml.elementtree

3)xml.sax + xml.dom

二 xml实例：（employees.xml）

<? xml version="1.0" encoding="UTF-8" ?>

< employees >

< employee >

< name >l inux </ name >

< age > 30 </ age >

</ employee >

< employee >

< name >windows </ name >

< age > 20 </ age >

</ employee >

</ employees >

三使用xml.dom.minidom来读写xml

1）使用xml.dom.minidom来解析xml：

def TestMiniDom():
    from xml.dom   import minidom
  doc   = minidom.parse( " employees.xml " )

    # get root element: <employees/>
  root   = doc.documentElement

    # get all children elements: <employee/> <employee/>
  employees   = root.getElementsByTagName( " employee " )

    for employee   in employees:
      print ( " ------------------------------------------- " )
      # element name : employee
      print (employee.nodeName)
      # element xml content : <employee><name>windows</name><age>20</age></employee>
      # basically equal to toprettyxml function
      print (employee.toxml())

    nameNode   = employee.getElementsByTagName( " name " )[0]
      print (nameNode.childNodes)
      print (nameNode.nodeName   +    " : "    + nameNode.childNodes[0].nodeValue)
    ageNode   = employee.getElementsByTagName( " age " )[0]
      print (ageNode.childNodes)
      print (ageNode.nodeName   +    " : "    + ageNode.childNodes[0].nodeValue)

      print ( " ------------------------------------------- " )
      # children nodes :  \n is one text element
      # [
      # <DOM Text node "' \n    '">,
      # <DOM Element: name at 0xc9e490>,
      # <DOM Text node "'\n    '">,
      # <DOM Element: age at 0xc9e4f0>,
      # <DOM Text node "'\n  '">
      # ]
      for n   in employee.childNodes:
        print (n)

TestMiniDom()

运行结果：

-------------------------------------------
employee
< employee >
      < name > linux </ name >
      < age > 30 </ age >
    </ employee >
[ < DOM   Text node "'linux'" > ]
name:linux
[ < DOM   Text node "'30'" > ]
age:30
-------------------------------------------
< DOM   Text node "' \n    '" >
< DOM   Element: name at 0xc9f590 >
< DOM   Text node "'\n    '" >
< DOM   Element: age at 0xc9f5f0 >
< DOM   Text node "'\n  '" >
-------------------------------------------
employee
< employee >
      < name > windows </ name >
      < age > 20 </ age >
    </ employee >
[ < DOM   Text node "'windows'" > ]
name:windows
[ < DOM   Text node "'20'" > ]
age:20
-------------------------------------------
< DOM   Text node "' \n    '" >
< DOM   Element: name at 0xc9f6b0 >
< DOM   Text node "'\n    '" >
< DOM   Element: age at 0xc9f710 >
< DOM   Text node "'\n  '" >

2）使用xml.dom.minidom来生成xml:

def GenerateXml():
    import xml.dom.minidom
  impl   = xml.dom.minidom.getDOMImplementation()
  dom   = impl.createDocument(None,   ' employees ' , None)
  root   = dom.documentElement
  employee   = dom.createElement( ' employee ' )
  root.appendChild(employee)

  nameE = dom.createElement( ' name ' )
  nameT = dom.createTextNode( ' linux ' )
  nameE.appendChild(nameT)
  employee.appendChild(nameE)

  ageE = dom.createElement( ' age ' )
  ageT = dom.createTextNode( ' 30 ' )
  ageE.appendChild(ageT)
  employee.appendChild(ageE)


  f = open( ' employees2.xml ' ,   ' w ' , encoding = ' utf-8 ' )
  dom.writexml(f,   addindent = '     ' , newl = ' \n ' ,encoding = ' utf-8 ' )
  f.close()

GenerateXml()

运行结果：

<? xml version="1.0" encoding="utf-8" ?>
< employees >
    < employee >
      < name >
      linux
      </ name >
      < age >
      30
      </ age >
    </ employee >
</ employees >

3）使用xml.dom.minidom需要注意的

*使用parse()或createDocument()返回的为DOM对象；
*使用DOM的documentElement属性可以获得Root Element;
*DOM为树形结构，包含许多的nodes，其中element是node的一种，可以包含子elements，textNode也是node的一种，是最终的子节点；
*每个node都有nodeName，nodeValue，nodeType属性，nodeValue是结点的值，只对textNode有效。对于textNode，想得到它的文本内容可以使用: .data属性。
*nodeType是结点的类型，现在有以下：

'ATTRIBUTE_NODE''CDATA_SECTION_NODE''COMMENT_NODE''DOCUMENT_FRAGMENT_NODE'

'DOCUMENT_NODE''DOCUMENT_TYPE_NODE''ELEMENT_NODE''ENTITY_NODE''ENTITY_REFERENCE_NODE'

'NOTATION_NODE''PROCESSING_INSTRUCTION_NODE''TEXT_NODE'
*getElementsByTagName()可以根据名字来查找子elements；
*childNodes返回所有的子Nodes，其中所有的文本均为textNode，包含元素间的‘\n\r’和空格均为textNode；
*writexml() 时addindent=' '表示子元素的缩进，newl='\n'表示元素间的换行，encoding='utf-8'表示生成的xml的编码格式（<?xml version="1.0" encoding="utf-8"?>）。

参考：

http://boyeestudio.cnblogs.com/archive/2005/08/16/216408.html

http://www.dnbcw.com/biancheng/python/pnwb252539.html

http://blog.csdn.net/kiki113/archive/2009/04/15/4080789.aspx

转自：http://blog.sina.com.cn/s/blog_714c124f01010y72.html

Python解析XML是出现编码问题

在python中遇到编码问题是一个非常痛苦的问题。

在使用Python处理XML的问题上，首先遇到的是编码问题。

Python并不支持gb2312，所以面对encoding="gb2312"或encoding="utf8"的XML文件会出现错误。Python读取的文件本身的编码也可能导致抛出异常，这种情况下打开文件的时候就需要指定编码。此外就是XML中节点所包含的中文。Python默认的是解析XML是编码为“UTF-8”和“UTF-16”。所以出现encoding error ！问题

我这里呢，处理就比较简单了，只需要修改XML的encoding头部。方法如下：

方法一：

文件 test.xml内容如下

<?xml version=”1.0″ encoding=”gbk”?>
<toplist>

…………….

</toplist>

要用python解析一下文件的内容。

采用minidom解析

xmldoc = minidoc.parse(file_name);

会出现这个错误

xml.parsers.expat.ExpatError: unknown encoding: line 1, column 30

经过查找肯定会发现是minidom不支持gbk编码，那么文件是gbk编码的，肯定是错误的，所以将文件转码为utf-8

（命令 iconv -f gbk -t utf-8 filename -o filename_new）

现在对新的utf-8编码的文件进行解析操作，依旧是报错，这是为什么呢？

原因是<?xml version=”1.0″ encoding=”gbk”?>这个句子在做过，minidom在底层对这个语句进行了识别，也就是不仅仅文件要变成utf-8编码的，而且这里也要变成<?xml version=”1.0″ encoding=”utf-8″?>。再试验一次，全部ok了

给段操作代码

file_xml = open(file_name,”r”).read()
file_xml = file_xml.replace(‘<?xml version=”1.0″ encoding=”gbk”?>’,'<?xml version=”1.0″ encoding=”utf-8″?>’)
file_xml = unicode(file_xml,encoding=’gbk’).encode(‘utf-8′)
xmldoc = minidoc.parseString(file_xml)

方法二：

自己在网上找了些资料，写了个脚本，把XML文件转化成你想要的编码就可以了。

#!/usr/bin/env python

import os, sys
import re

def replaceXmlEncoding(filepath, oldEncoding='gb2312', newEncoding='utf-8'):
    f = open(filepath, mode='r')
    content = f.read()
    content = re.sub(oldEncoding, newEncoding, content)
    f.close()

    f = open(filepath, mode='w')
    f.write(content)
    f.close()

if __name__ == "__main__":
    replaceXmlEncoding('./ActivateAccount.xml')
方法三：

在官网和在官方论坛上看到了一些相关的解决办法：

Encoding Declaration

[80]	`EncodingDecl`	::=	`S 'encoding' Eq ('"' EncName '"' \| "'" EncName "'" )`
[81]	`EncName`	::=	`[A-Za-z] ([A-Za-z0-9._] \| '-')*`

In the document entity, the encoding declaration is part of the XML declaration. The EncName is the name of the encoding used.（http://www.w3.org/TR/2006/REC-xml11-20060816/#NT-EncodingDecl）

没有太明白，具体就是修改编码的格式声明，扩大了解析时能够识别的格式。

方法四：

最简单的办法就是把XML格式改成解析器默认的格式，但是如果不能改，或者文件太多，一个一个改浪费时间，或者在XML的生成是就默认了其他的格式就很麻烦了一个一个改。所以运用上面办法也可以解决问题