xmltodict,xml与json的相互转换
源码:https://github.com/martinblech/xmltodict
在开发中经常遇到string、xml、json、dict对象的相互转换,这个工具和这里的方法全部都能够搞定。
XML文件转换流程
XML文件转换流程
注意:以下代码只是示范逻辑,不能直接运行。
import os
import time
import lxml
from lxml import etree
import xmltodict, sys, gc
# 递归解析xml文件
context = etree.iterparse(osmfile,tag=["node","way","relation"])
fast_iter(context, process_element, maxline)
...
# xml对象转为字符串
elem_data = etree.tostring(elem)
# 生成dict对象
elem_dict = xmltodict.parse(elem_data)
# 从dict产生json字符串
elem_jsonStr = json.dumps(elem_dict)
# 从json字符串产生json对象
json_obj = json.dumps(elem_jsonStr)
递归解析XML
etree递归读取xml结构数据(占用资源少): http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
XML字符串转为json对象支持库 : https://github.com/martinblech/xmltodict
xmltodict.parse()会将字段名输出添加@和#,在Spark查询中会引起问题,需要去掉。如下设置即可:
xmltodict.parse(elem_data,attr_prefix="",cdata_key="")
编码和错误xml文件恢复
如下:
magical_parser = lxml.etree.XMLParser(encoding='utf-8', recover=True)
tree = etree.parse(StringIO(your_xml_string), magical_parser) #or pass in an open file object
先将element转为string,然后生成dict,再用json.dump()产生json字符串。
elem_data = etree.tostring(elem)
elem_dict = xmltodict.parse(elem_data)
elem_jsonStr = json.dumps(elem_dict)
可以使用json.loads(elem_jsonStr)创建出可编程的json对象。
xmltodict的用法
xmltodict的用法
xmltodict
is a Python module that makes working with XML feel like you are working with JSON, as in this "spec":
>>> print(json.dumps(xmltodict.parse("""
... <mydocument has="an attribute">
... <and>
... <many>elements</many>
... <many>more elements</many>
... </and>... <plus a="complex">
... element as well
... </plus>
... </mydocument>... """), indent=4))
{ "mydocument":
{ "@has": "an attribute",
"and":
{
"many": ["elements", "more elements"]
},
"plus": {"@a": "complex", "#text": "element as well"
}
}
}
Namespace support
By default, xmltodict
does no XML namespace processing (it just treats namespace declarations as regular node attributes), but passing process_namespaces=True
will make it expand namespaces for you:
>>> xml = """
... <root xmlns="
... xmlns:a="
... xmlns:b="
... <x>1</x>... <a:y>2</a:y>
... <b:z>3</b:z>... </root>
... """
>>> xmltodict.parse(xml, process_namespaces=True) == {
... 'http://defaultns.com/:root': {
... 'http://defaultns.com/:x': '1',
... 'http://a.com/:y': '2',
... 'http://b.com/:z': '3',
... }
... }
True
It also lets you collapse certain namespaces to shorthand prefixes, or skip them altogether:
>>> namespaces = {
... 'http://defaultns.com/': None, # skip this namespace
... 'http://a.com/': 'ns_a', # collapse "http://a.com/" -> "ns_a"
... }
>>> xmltodict.parse(xml, process_namespaces=True, namespaces=namespaces) == {
... 'root': {
... 'x': '1',
... 'ns_a:y': '2',
... 'http://b.com/:z': '3',
... },
... }
True
Streaming mode
xmltodict
is very fast (Expat-based) and has a streaming mode with a small memory footprint, suitable for big XML dumps like Discogs or Wikipedia:
>>> def handle_artist(_, artist):
... print artist['name']
... return True
>>>
>>> xmltodict.parse(GzipFile('discogs_artists.xml.gz'),
... item_depth=2, item_callback=handle_artist)
A Perfect Circle
Fantômas
King Crimson
Chris Potter
...
It can also be used from the command line to pipe objects to a script like this:
import sys, marshal
while True:
_, article = marshal.load(sys.stdin)
print article['title']
$ cat enwiki-pages-articles.xml.bz2 | bunzip2 | xmltodict.py 2 | myscript.py
AccessibleComputing
Anarchism
AfghanistanHistory
AfghanistanGeography
AfghanistanPeople
AfghanistanCommunications
Autism
...
Or just cache the dicts so you don't have to parse that big XML file again. You do this only once:
$ cat enwiki-pages-articles.xml.bz2 | bunzip2 | xmltodict.py 2 | gzip > enwiki.dicts.gz
And you reuse the dicts with every script that needs them:
$ cat enwiki.dicts.gz | gunzip | script1.py
$ cat enwiki.dicts.gz | gunzip | script2.py
...
Roundtripping
You can also convert in the other direction, using the unparse()
method:
>>> mydict = {
... 'response': {
... 'status': 'good',
... 'last_updated': '2014-02-16T23:10:12Z',
... }
... }
>>> print unparse(mydict, pretty=True)
<?xml version="1.0" encoding="utf-8"?>
<response>
<status>good</status>
<last_updated>2014-02-16T23:10:12Z</last_updated>
</response>
Text values for nodes can be specified with the cdata_key
key in the python dict, while node properties can be specified with the attr_prefix
prefixed to the key name in the python dict. The default value for attr_prefix
is @
and the default value for cdata_key
is #text
.
>>> import xmltodict
>>>
>>> mydict = {
... 'text': {
... '@color':'red',
... '@stroke':'2',
... '#text':'This is a test'
... }
... }
>>> print xmltodict.unparse(mydict, pretty=True)
<?xml version="1.0" encoding="utf-8"?>
<text stroke="2" color="red">This is a test</text>
Ok, how do I get it?
Using pypi
You just need to:
$ pip install xmltodict
RPM-based distro (Fedora, RHEL, …)
There is an official Fedora package for xmltodict.
$ sudo yum install python-xmltodict
Arch Linux
There is an official Arch Linux package for xmltodict.
$ sudo pacman -S python-xmltodict
Debian-based distro (Debian, Ubuntu, …)
There is an official Debian package for xmltodict.
$ sudo apt install python-xmltodict