python elementtree 父节点,使用Python ElementTree迭代多个（父，子）节点

最新推荐文章于 2021-12-19 22:17:54 发布

网络小魔王

最新推荐文章于 2021-12-19 22:17:54 发布

阅读量307

点赞数

文章标签： python elementtree 父节点

The standard implementation of ElementTree for Python (2.6) does not provide pointers to parents from child nodes. Therefore, if parents are needed, it is suggested to loop over parents rather than children.

Consider my xml is of the form:

first

third

The following finds all "Para" nodes without considering parents:

(1) paras = [p for p in page.getiterator("Para")]

This (adapted from effbot) stores the parent by looping over them instead of the child nodes:

(2) paras = [(c,p) for p in page.getiterator() for c in p]

This makes perfect sense, and can be extended with a conditional to achieve the (supposedly) same result as (1), but with parent info added:

(3) paras = [(c,p) for p in page.getiterator() for c in p if c.tag == "Para"]

The ElementTree documentation suggests that the getiterator() method does a depth-first search. Running it without looking for the parent (1) yields:

first

second

third

However, extracting the text from paras in (3), yields:

first, Content>Para

third, Content>Para

second, Table>Para

This appears to be breadth-first.

This therefore raises two questions.

Is this correct and expected behaviour?

How do you extract (parent, child) tuples when the child must be of a certain type but the parent can be anything, if document order must be maintained. I do not think running two loops and mapping the (parent,child)'s generated by (3) to the orders generated by (1) is ideal.

解决方案

Consider this:

>>> xml = """

... first

...

... third

... """

>>> import xml.etree.cElementTree as et

>>> page = et.fromstring(xml)

>>> for p in page.getiterator():

... print "ppp", p.tag, repr(p.text)

... for c in p:

... print "ccc", c.tag, repr(c.text), p.tag

...

ppp Content '\n '

ccc Para 'first' Content

ccc Table None Content

ccc Para 'third' Content

ppp Para 'first'

ppp Table None

ccc Para 'second' Table

ppp Para 'second'

ppp Para 'third'

>>>

Aside: list comprehensions are magnificent until you want to see exactly what is being iterated over :-)

getiterator is producing the "ppp" elements in the advertised order. However you are plucking your elements of interest out of the subsidiary "ccc" elements, which are not in your desired order.

One solution is to do your own iteration:

>>> def process(elem, parent):

... print elem.tag, repr(elem.text), parent.tag if parent is not None else None

... for child in elem:

... process(child, elem)

...

>>> process(page, None)

Content '\n ' None

Para 'first' Content

Table None Content

Para 'second' Table

Para 'third' Content

>>>

Now you can snarf "Para" elements each with a reference to its parent (if any) as they stream past.

This can be wrapped up nicely in a generator gadget:

>>> def iterate_with_parent(elem):

... stack = []

... while 1:

... for child in reversed(elem):

... stack.append((child, elem))

... if not stack: return

... elem, parent = stack.pop()

... yield elem, parent

...

>>>

>>> showtag = lambda e: e.tag if e is not None else None

>>> showtext = lambda e: repr((e.text or '').rstrip())

>>> for e, p in iterate_with_parent(page):

... print e.tag, showtext(e), showtag(p)

...

Para 'first' Content

Table '' Content

Para 'second' Table

Para 'third' Content

>>>

网络小魔王

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python elementtree 父节点,使用Python ElementTree迭代多个（父，子）节点

The standard implementation of ElementTree for Python (2.6) does not provide pointers to parents from child nodes. Therefore, if parents are needed, it is suggested to loop over parents rather than ch...
复制链接

扫一扫

python elementtree 父节点,使用Python ElementTree迭代多个（父，子）节点

“相关推荐”对你有帮助么？