python xml.dom.minidom.parse not well-formed error

最新推荐文章于 2024-07-11 09:09:04 发布

煎饼皮皮侠

最新推荐文章于 2024-07-11 09:09:04 发布

阅读量3.6k

点赞数

分类专栏： python 文章标签： xml parse

本文链接：https://blog.csdn.net/yuan882696yan/article/details/50392452

版权

python 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

用python解析xml，遇到以下error，

ParseError: not well-formed (invalid token): line 1, column 17

因为xml文件中有以下一段文字，

<news>
 <title>2016年1月2日 if(window.yzq_d==null)window.yzq_d=new Object();
window.yzq_d['vYTLAWRhZD8-']='&U=13jt854h2%2fN%3dvYTLAWRhZD8-%2fC%3d300908984.301767463.303376606.311556217%2fD%3dULT%2fB%3d302054045';</title>
</news>

其中的“&U” 破坏了xml的结构，

原来的代码是这样写的：

DOMTree = xml.dom.minidom.parse(file_path)
collection = DOMTree.documentElement
categoryNameStr = collection.getAttribute("name")

由于要过滤特殊字符，需要改成以下写法：

import codecs
def replace_special_character(content):
    content = content.replace("&U", "&U")
    return content
datasource = codecs.open(readFileAddress, 'r', 'UTF-8')
xml_str = ""
for line in datasource:
    xml_str += line
xml_str = replace_special_character(xml_str)
DOMTree = xml.dom.minidom.parseString(xml_str)
collection = DOMTree.documentElement
categoryNameStr = collection.getAttribute("name")

如果有更好的方法，欢迎提出。