考虑一下:>>> xml = """
... first
...
... third
... """
>>> import xml.etree.cElementTree as et
>>> page = et.fromstring(xml)
>>> for p in page.getiterator():
... print "ppp", p.tag, repr(p.text)
... for c in p:
... print "ccc", c.tag, repr(c.text), p.tag
...
ppp Content '\n '
ccc Para 'first' Content
ccc Table None Content
ccc Para 'third' Content
ppp Para 'first'
ppp Table None
ccc Para 'second' Table
ppp Para 'second'
ppp Para 'third'
>>>
旁白:列表理解是非常棒的,直到你想清楚什么是被迭代的:-)
getiterator是按广告顺序生成“ppp”元素的。然而,您正在从附属“ccc”元素中提取感兴趣的元素,这些元素不符合您所需的顺序。
一种解决方案是进行自己的迭代:>>> def process(elem, parent):
... print elem.tag, repr(elem.text), parent.tag if parent is not None else None
... for child in elem:
... process(child, elem)
...
>>> process(page, None)
Content '\n ' None
Para 'first' Content
Table None Content
Para 'second' Table
Para 'third' Content
>>>
现在,您可以在每个“Para”元素经过时,都引用其父元素(如果有的话)。
这可以很好地包装在发电机小工具中:>>> def iterate_with_parent(elem):
... stack = []
... while 1:
... for child in reversed(elem):
... stack.append((child, elem))
... if not stack: return
... elem, parent = stack.pop()
... yield elem, parent
...
>>>
>>> showtag = lambda e: e.tag if e is not None else None
>>> showtext = lambda e: repr((e.text or '').rstrip())
>>> for e, p in iterate_with_parent(page):
... print e.tag, showtext(e), showtag(p)
...
Para 'first' Content
Table '' Content
Para 'second' Table
Para 'third' Content
>>>