这是关于xml.etree.ElementTree.iterparse在大型XML文件上练习的good answer。 lxml也有这个方法。使用iterparse进行流解析的关键是手动清除和删除已处理的节点,否则您将最终耗尽内存。
另一种选择是使用xml.sax。官方手册对我来说过于正式,缺乏示例,因此需要澄清问题。默认解析器模块xml.sax.expatreader实现增量解析接口xml.sax.xmlreader.IncrementalParser。也就是说xml.sax.make_parser()提供了合适的流解析器。
例如,给定一个XML流,如:
value 0
value 1
value 2
...
可以通过以下方式处理。
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import time
import xml.sax
class StreamHandler(xml.sax.handler.ContentHandler):
lastEntry = None
lastName = None
def startElement(self, name, attrs):
self.lastName = name
if name == 'entry':
self.lastEntry = {}
elif name != 'root':
self.lastEntry[name] = {'attrs': attrs, 'content': ''}
def endElement(self, name):
if name == 'entry':
print({
'a' : self.lastEntry['a']['content'],
'b' : self.lastEntry['b']['attrs'].getValue('foo')
})
self.lastEntry = None
elif name == 'root':
raise StopIteration
def characters(self, content):
if self.lastEntry:
self.lastEntry[self.lastName]['content'] += content
if __name__ == '__main__':
# use default ``xml.sax.expatreader``
parser = xml.sax.make_parser()
parser.setContentHandler(StreamHandler())
# feed the parser with small chunks to simulate
with open('data.xml') as f:
while True:
buffer = f.read(16)
if buffer:
try:
parser.feed(buffer)
except StopIteration:
break
else:
time.sleep(2)
# if you can provide a file-like object it's as simple as
with open('data.xml') as f:
parser.parse(f)