简而言之,events中选取 start 或者 end 事件,主要在于自己打算:
“首先访问外层 elements” or
“首先访问内层 elements”。
举例如下,对于如下 xml :
<level-1>
<level-2-1>
<level-3-1></level-3-1>
<level-3-2></level-3-2>
</level-2-1>
<level-2-2>
<level-3-3></level-3-3>
<level-3-4></level-3-4>
</level-2-2>
</level-1>
如果选取 start 事件来parse:
from lxml.etree import iterparse
with open('foo.xml', 'r') as xml:
for event, element in iterparse(xml, events=['start']):
print(element.tag)
则获得“首先访问外层 elements”的结果:
level-1
level-2-1
level-3-1
level-3-2
level-2-2
level-3-3
level-3-4
如果选取 end 事件来parse:
with open('foo.xml', 'r') as xml:
for event, element in iterparse(xml, events=['end']):
print(element.tag)
则获得“首先访问内层 elements”的结果:
level-3-1
level-3-2
level-2-1
level-3-3
level-3-4
level-2-2
level-1
以这样的一个xml为例
<?xml version="1.0"?>
<data>
<country name="china">
<rank updated="yes">2</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="hangkong">
<rank updated="yes">5</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank updated="yes">69</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>
如果我把参数改成events=(‘start’,’end’),结果是这样的:
<Element 'data' at 0x5d11828>
('start', <Element 'data' at 0x5d116a0>)
('start', <Element 'country' at 0x5d11c88>)
('start', <Element 'rank' at 0x5d11cc0>)
('end', <Element 'rank' at 0x5d11cc0>)
('start', <Element 'year' at 0x5d11d30>)
('end', <Element 'year' at 0x5d11d30>)
('start', <Element 'gdppc' at 0x5d11d68>)
('end', <Element 'gdppc' at 0x5d11d68>)
('start', <Element 'neighbor' at 0x5d11da0>)
('end', <Element 'neighbor' at 0x5d11da0>)
('start', <Element 'neighbor' at 0x5d11dd8>)
('end', <Element 'neighbor' at 0x5d11dd8>)
('end', <Element 'country' at 0x5d11c88>)
('start', <Element 'country' at 0x5d11e10>)
('start', <Element 'rank' at 0x5d11e48>)
('end', <Element 'rank' at 0x5d11e48>)
('start', <Element 'year' at 0x5d11e80>)
('end', <Element 'year' at 0x5d11e80>)
('start', <Element 'gdppc' at 0x5d11eb8>)
('end', <Element 'gdppc' at 0x5d11eb8>)
('start', <Element 'neighbor' at 0x5d11ef0>)
('end', <Element 'neighbor' at 0x5d11ef0>)
('end', <Element 'country' at 0x5d11e10>)
('start', <Element 'country' at 0x5d11f28>)
('start', <Element 'rank' at 0x5d11f60>)
('end', <Element 'rank' at 0x5d11f60>)
('start', <Element 'year' at 0x5d11f98>)
('end', <Element 'year' at 0x5d11f98>)
('start', <Element 'gdppc' at 0x5d11fd0>)
('end', <Element 'gdppc' at 0x5d11fd0>)
('start', <Element 'neighbor' at 0x43a3390>)
('end', <Element 'neighbor' at 0x43a3390>)
('start', <Element 'neighbor' at 0x43a3cc0>)
('end', <Element 'neighbor' at 0x43a3cc0>)
('end', <Element 'country' at 0x5d11f28>)
('end', <Element 'data' at 0x5d116a0>)
start就是一个标签的开始,end就是一个标签的结尾,当你用(‘start’,‘end’)同时作为参数时,那iterparse在见到data的时候会产生一个start 的elem,见到data会产生一个‘end’elem
P.S. events 参数的默认值为[‘end’],即默认首先访问内层elements。
使用iterparse返回的是一个可迭代的(event,element)元组流。
注:本文内容源自优达学城论坛,感谢andy.li和bbikks的精彩回答。我的优达学城优惠码是:C7B2877A
关于iterparse的详细介绍可以参考此文档:iterpasrse Function
关于ElementTree介绍可以参考此链接:ElementTree官方文档翻译