python里遍历筛选xml文件_遍历python中XML标记中的所有子标记和字符串，而无需指定子标记名称...

最新推荐文章于 2022-04-18 21:55:03 发布

weixin_39649660

最新推荐文章于 2022-04-18 21:55:03 发布

阅读量232

点赞数

文章标签： python里遍历筛选xml文件

My question is an add on from here, but I'm not meant to use the answer section for add-on questions.

If I have part of an XML file like this:

Inclusion Criteria:

- women undergoing cesarean section for any indication

- literate in german language

Exclusion Criteria:

- history of keloids

- previous transversal suprapubic scars

- known patient hypersensitivity to any of the suture materials used in the protocol

- a medical disorder that could affect wound healing (eg, diabetes mellitus, chronic

corticosteroid use)

Female

18 Years

45 Years

I want to pull out all of the strings in this eligibility section (i.e the string in the textblock section and the gender, minimum age, maximum age and healthy volunteers sections)

using the code above I did this:

import sys

from bs4 import BeautifulSoup

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'lxml')

eligibi = []

for eligibility in soup.find_all('eligibility'):

d = {'other_name':eligibility.criteria.textblock.string, 'gender':eligibility.gender.string}

eligibi.append(d)

print eligibi

My problem is I have many files. Sometimes the structure of the XML file might be:

eligibility -> criteria -> textblock -> text

eligibility -> other things (e.g. gender as above) -> text

eligibility -> text

e.g.

if there way to just take 'take all of the sub-headings and their texts'

so in the above example, the list/dictionary would contain:

{criteria textblock: inclusion and exclusion criteria, gender: xxx, minimum_age: xxx, maximum_age: xxx, healthy_volunteers: xxx}

My problem is, in reality, I am not going to know all the specific sub-tags of the eligibility tag, as each experiment could be different (e.g. maybe some say 'pregnant women accepted', 'drug history of XXX accepted' etc)

So I just want, if I give it a tag name, it will give me all the sub-tags and text of those sub-tags in a dictionary.

Extended XML for comment:

Subcutaneous Adaption and Cosmetic Outcome Following Caesarean Delivery

Klinikum Klagenfurt am Wörthersee

...and then the eligibility XML section above.

解决方案

Since you have lxml installed you can try the following (this code assumes leaf elements within a given element i.e eligibility are unique) :

from lxml import etree

tree = etree.parse(sys.argv[1])

root = tree.getroot()

eligibi = []

for eligibility in root.xpath('//eligibility'):