python里遍历筛选xml文件_遍历python中XML标记中的所有子标记和字符串,而无需指定子标记名称...

My question is an add on from here, but I'm not meant to use the answer section for add-on questions.

If I have part of an XML file like this:

Inclusion Criteria:

- women undergoing cesarean section for any indication

- literate in german language

Exclusion Criteria:

- history of keloids

- previous transversal suprapubic scars

- known patient hypersensitivity to any of the suture materials used in the protocol

- a medical disorder that could affect wound healing (eg, diabetes mellitus, chronic

corticosteroid use)

Female

18 Years

45 Years

No

I want to pull out all of the strings in this eligibility section (i.e the string in the textblock section and the gender, minimum age, maximum age and healthy volunteers sections)

using the code above I did this:

import sys

from bs4 import BeautifulSoup

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'lxml')

eligibi = []

for eligibility in soup.find_all('eligibility'):

d = {'other_name':eligibility.criteria.textblock.string, 'gender':eligibility.gender.string}

eligibi.append(d)

print eligibi

My problem is I have many files. Sometimes the structure of the XML file might be:

eligibility -> criteria -> textblock -> text

eligibility -> other things (e.g. gender as above) -> text

eligibility -> text

e.g.

if there way to just take 'take all of the sub-headings and their texts'

so in the above example, the list/dictionary would contain:

{criteria textblock: inclusion and exclusion criteria, gender: xxx, minimum_age: xxx, maximum_age: xxx, healthy_volunteers: xxx}

My problem is, in reality, I am not going to know all the specific sub-tags of the eligibility tag, as each experiment could be different (e.g. maybe some say 'pregnant women accepted', 'drug history of XXX accepted' etc)

So I just want, if I give it a tag name, it will give me all the sub-tags and text of those sub-tags in a dictionary.

Extended XML for comment:

Subcutaneous Adaption and Cosmetic Outcome Following Caesarean Delivery

Klinikum Klagenfurt am Wörthersee

...and then the eligibility XML section above.

解决方案

Since you have lxml installed you can try the following (this code assumes leaf elements within a given element i.e eligibility are unique) :

from lxml import etree

tree = etree.parse(sys.argv[1])

root = tree.getroot()

eligibi = []

for eligibility in root.xpath('//eligibility'):

d = {}

for e in eligibility.xpath('.//*[not(*)]'):

d[e.tag] = e.text

eligibi.append(d)

print eligibi

XPath explanation :

.//* : find all elements within current eligibility, no matter its depth (//) and tag name (*)

[not(*)] : filter elements found by the previous bit to those that don't have any child element aka leaf elements

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值