XPath应该很快您可以将XPath调用次数减少到一个:
doc = etree.fromstring(xml)
btags = doc.xpath('//a/b')
for b in btags:
print b.text
如果这还不够快,可以尝试使用Liza Daly’s fast_iter.这样做的优点是不要求整个XML都使用etree.fromstring进行处理,父节点在孩子被访问之后被丢弃.这两件事都有助于减少内存需求.下面是a modified version of fast_iter,它更加积极地删除不再需要的其他元素.
def fast_iter(context, func, *args, **kwargs):
"""
fast_iter is useful if you need to free memory while iterating through a
very large XML file.
http://lxml.de/parsing.html#modifying-the-tree
Based on Liza Daly's fast_iter
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
See also http://effbot.org/zone/element-iterparse.htm
"""
for event, elem in context:
func(elem, *args, **kwargs)
# It's safe to call clear() here because no descendants will be
# accessed
elem.clear()
# Also eliminate now-empty references from the root node to elem
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context
def process_element(elt):
print(elt.text)
context=etree.iterparse(io.BytesIO(xml), events=('end',), tag='b')
fast_iter(context, process_element)
解析大型XML文件的Liza Daly’s article可能对您也是有用的阅读.根据文章,lxml与fast_iter可以比cElementTree的iterparse更快. (见表1).