XML解析的相关问题

最新推荐文章于 2019-09-21 23:28:11 发布

weixin_34114823

最新推荐文章于 2019-09-21 23:28:11 发布

阅读量205

点赞数

文章标签： java python

原文链接：https://my.oschina.net/u/260244/blog/293747

版权

2019独角兽企业重金招聘Python工程师标准>>>

解决过程：

提取导入数据包中的一条完整数据做调试。

根据报错提示查看XF_SUMMARY标签元素是否真正缺少闭合标签，目测没问题

源文件：

<XF_SUMMARY>XXXX相关信息</XF_SUMMARY>

将该元素标签中的内容删除，继续导入，

还是报错换成其他的标签元素信息：

2014-07-22 13:31:41 错误 [con.err] org.dom4j.DocumentException: Error on line 36 of document  : The element type "APPELLEE_POLITY" must be terminated by the matching end-tag "</APPELLEE_POLITY>". Nested exception: The element type "APPELLEE_POLITY" must be terminated by the matching end-tag "</APPELLEE_POLITY>".
2014-07-22 13:31:41 错误 [con.err] at org.dom4j.io.SAXReader.read(SAXReader.java:482)
2014-07-22 13:31:41 错误 [con.err] at org.dom4j.io.SAXReader.read(SAXReader.java:343)

反复操作当将所有提示有问题的标签元素数据清除后，正常导入。

原因是什么？看上去完好的标签为什么在解析的时候会提示缺少闭合标签呢？

我们可以通过下面代码的输出得到问题原因：

public static Document  changerXMLCode(File xmlFile) throws IOException, DocumentException {
    SAXReader reader = new SAXReader();
    FileInputStream fileInputStream=new FileInputStream(xmlFile);  
        byte[] b0=new byte[1024];  
        byte[] B=new byte[0];  
        int read =-1;   
        while ((read=fileInputStream.read(b0))>-1) {  
            int i=B.length;  
            B=Arrays.copyOf(B, B.length+read);  
            for(int j=0;j<read;j++){  
                B[i+j]=b0[j];  
            }  
        }
        String xmlDate = new String(B,"GBK");//我们的XML文件编码为GBK
        xmlDate = xmlDate.replaceAll("&#[1-9]+|&#\\w{0,3};?", "");
        //将字符串转换为Document对象
        Document document = reader.read(new ByteArrayInputStream(xmlDate
          .getBytes("GBK")));
        
        return document;
    }

通过查看xmlDate的值我们知道了原来是：

<?xml version="1.0" encoding="GBK"?><XF_JUBAO>
  <Jubao>
    <APPELLEE_SEX>鐢?/APPELLEE_SEX>
    <APPELLEE_NATION>姹夋棌</APPELLEE_NATION>
    <APPELLEE_POLITY>涓浗鍏变骇鍏氬厷鍛?/APPELLEE_POLITY>
    <XF_QUESTIONTYPE>宸ㄩ璐骇鏉ユ簮涓嶆槑</XF_QUESTIONTYPE>
    <APPELLEE_NAME>鐜嬪繝</APPELLEE_NAME>
    <APPELLEE_ADDR>娉板窞甯傚叴鍖栧競宸ュ晢閾惰</APPELLEE_ADDR>
  </Jubao>
</XF_JUBAO>

的确有的闭合标签元素被乱码给破坏了如：

<APPELLEE_POLITY>涓浗鍏变骇鍏氬厷鍛?/APPELLEE_POLITY>

最后确定问题原因：导出xml数据包时，没有设置编码为GBK。

继续做个调试，将解析代码调整为utf-8格式

public static Document  changerXMLCode(File xmlFile) throws IOException, DocumentException {
    SAXReader reader = new SAXReader();
    FileInputStream fileInputStream=new FileInputStream(xmlFile);  
        byte[] b0=new byte[1024];  
        byte[] B=new byte[0];  
        int read =-1;   
        while ((read=fileInputStream.read(b0))>-1) {  
            int i=B.length;  
            B=Arrays.copyOf(B, B.length+read);  
            for(int j=0;j<read;j++){  
                B[i+j]=b0[j];  
            }  
        }
        String xmlDate = new String(B,"utf-8");//修改为utf-8
        xmlDate = xmlDate.replaceAll("&#[1-9]+|&#\\w{0,3};?", "");
        //将字符串转换为Document对象
        Document document = reader.read(new ByteArrayInputStream(xmlDate
          .getBytes("utf-8")));//修改为utf-8
        
        return document;
    }

继续导入数据包，查看xmlDate数据是：

<?xml version="1.0" encoding="GBK"?>
<XF_JUBAO>
  <Jubao>
    <APPELLEE_SEX>男</APPELLEE_SEX>
    <APPELLEE_NATION>汉族</APPELLEE_NATION>
    <APPELLEE_POLITY>职级<//APPELLEE_POLITY>
    <XF_QUESTIONTYPE>问题</XF_QUESTIONTYPE>
    <APPELLEE_NAME>名称</APPELLEE_NAME>
    <APPELLEE_ADDR>地址</APPELLEE_ADDR>
  </Jubao>
</XF_JUBAO>

一切显示都正常了，但是莫名其妙又报错了。还是之前的错误信息：

2014-07-22 13:52:01 错误 [con.err] org.dom4j.DocumentException: Error on line 36 of document  : The element type "APPELLEE_POLITY" must be terminated by the matching end-tag "</APPELLEE_POLITY>". Nested exception: The element type "APPELLEE_POLITY" must be terminated by the matching end-tag "</APPELLEE_POLITY>".
2014-07-22 13:52:01 错误 [con.err] at org.dom4j.io.SAXReader.read(SAXReader.java:482)
2014-07-22 13:52:01 错误 [con.err] at org.dom4j.io.SAXReader.read(SAXReader.java:343)

为什么呢？原来继续调试问题又绕回来了

因为XML已经指定了gbk编码格式: