python使用lxml xpath模块解析XML遇到的坑

最新推荐文章于 2023-09-26 17:15:19 发布

weixin_45906169

最新推荐文章于 2023-09-26 17:15:19 发布

阅读量1.3k

点赞数

文章标签： python xml

本文链接：https://blog.csdn.net/weixin_45906169/article/details/124042572

版权

项目场景：

解析电子病历CDA文档，由于CDA文档是XML 格式的，有些节点的属性值需要修改。

问题描述

在使用python 解析xml时，百度了很多方面的资料，其实都不尽人意，要么示例不够详细，要么示例本身就是坑，总结一下，主要遇到的是这几个方面的问题

1. 使用etree.fromstring(new_doc_content)报错

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
在这里插入图片描述

2.xpath无法获取值、返回值为[]或者{}的问题

原因分析：

1.由于数据是从数据库查询出来得到的，所以etree.fromstring(new_doc_content)需要传 byte string
2.由于CDA文档含有字符声明，以及命名空间的，在使用常规的xpath语法取不到数据，或者有些text能取到，其他节点或者属性值取不到。那么在含有命名空间的xml数据里，xpath需要将命名空间也带上才能正常取到，其实问题就出在命名空间这里，从网上百度出来的资料，有些命名空间写成了

ns = {"d" : "http://www.sitemaps.org/schemas/sitemap/0.9"}
url = root.xpath("//d:loc", namespaces=ns)

正是这里把我带入了误区，使用这个方式反复调试，始终是取不到数据，从其他地方查到的资料很多也是类似的这种写法，同时也忽略掉了一些不一样的点。例如这样的写法:

url = root.xpath("//d:loc", namespaces={'d' : 'http://www.sitemaps.org/schemas/sitemap/0.9'})`

咋一看只是namespaces的值事先定义好了而已，没有往其他方向想。后来通过foo_tree = etree.ElementTree(xml) 然后通过遍历foo_tree.getroot()修改属性内容，虽然说能解决，但是还是想通过xpath来查询定位，因为之前爬虫用过xpath，知道它的便利之处，回过头来还是要去解决xpath这个问题。猛回头，发现namespaces字典定义的区别，单引号和双引号这里有所不同。那就是试试把，将双引号改成了单引号。啪，完美，它起作用了，能找到节点了。

解决方案：

1.将str转换成byte string

etree.fromstring(new_doc_content.encode('utf-8'))

2.将namespaces定义的字典中的双引号换成单引号

url = root.xpath("//d:loc", namespaces={'d' : 'http://www.sitemaps.org/schemas/sitemap/0.9'})`

示例XML：

<?xml version="1.0" encoding="UTF-8"?> <ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:mif="urn:hl7-org:v3/mif" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 ..\sdschemas\SDA.xsd"> <realmCode code="CN"/> <typeId root="2.16.840.1.113883.1.3" extension="POCD_MT000040"/> <templateId root="2.16.156.10011.2.1.1.33"/> <id root="2.16.156.10011.1.1" extension="545ED988-5235-45F1-BBFD-9326D74FAA43"/> <code code="00000" codeSystem="545ED988-5235-45F1-BBFD-9326D74FAA43" codeSystemName="卫生信息共享文档规范编码体系"/> <title>测试</title> <effectiveTime value="20220407090145"/> <confidentialityCode code="N" codeSystem="2.16.840.1.113883.5.25" codeSystemName="Confidentiality" displayName="正常访问保密级别"/> <languageCode code="zh-CN"/> <setId/> <versionNumber/> <recordTarget typeCode="RCT" contextControlCode="OP"> <patientRole classCode="PAT"> <id root="2.16.156.10011.1.11" extension="00000000"/> <id root="2.16.156.10011.1.12" extension="00000000"/> <id root="2.16.156.10011.1.24" extension="-"/> <patient classCode="PSN" determinerCode="INSTANCE"> <name>XXX</name> <administrativeGenderCode code="1" displayName="男性" codeSystem="2.16.156.10011.2.3.3.4" codeSystemName="生理性别代码表(GB/T 2261.1)"/> <age value="0" unit="岁"/> </patient> </patientRole> </recordTarget> </ClinicalDocument>

示例Python：

xml = etree.fromstring(new_doc_content.encode('utf-8'))
# 示例的默认命名空间是urn:hl7-org:v3，使用xpath需要将命名空间带上
effective_time = xml.xpath("//x:effectiveTime[@*]", namespaces={'x': 'urn:hl7-org:v3'})
extension = xml.xpath('//x:recordTarget//x:patientRole/x:id[@extension]',
                                         namespaces={'x': 'urn:hl7-org:v3'})
print(effective_time)
print(extension)

weixin_45906169

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python使用lxml xpath模块解析XML遇到的坑

项目场景：解析电子病历CDA文档，由于CDA文档是XML 格式的，有些节点的属性值需要修改。问题描述在使用python 解析xml时，百度了很多方面的资料，其实都不尽人意，要么示例不够详细，要么示例本身就是坑，总结一下，主要遇到的是这几个方面的问题1. 使用etree.fromstring(new_doc_content)报错ValueError: Unicode strings with encoding declaration are not supported. Please use by
复制链接

扫一扫