用python解析xml,遇到以下error,
ParseError: not well-formed (invalid token): line 1, column 17
因为xml文件中有以下一段文字,
<news>
<title>2016年1月2日 if(window.yzq_d==null)window.yzq_d=new Object();
window.yzq_d['vYTLAWRhZD8-']='&U=13jt854h2%2fN%3dvYTLAWRhZD8-%2fC%3d300908984.301767463.303376606.311556217%2fD%3dULT%2fB%3d302054045';</title>
</news>
其中的“&U” 破坏了xml的结构,
原来的代码是这样写的:
DOMTree = xml.dom.minidom.parse(file_path)
collection = DOMTree.documentElement
categoryNameStr = collection.getAttribute("name")
由于要过滤特殊字符,需要改成以下写法:
import codecs
def replace_special_character(content):
content = content.replace("&U", "&U")
return content
datasource = codecs.open(readFileAddress, 'r', 'UTF-8')
xml_str = ""
for line in datasource:
xml_str += line
xml_str = replace_special_character(xml_str)
DOMTree = xml.dom.minidom.parseString(xml_str)
collection = DOMTree.documentElement
categoryNameStr = collection.getAttribute("name")