dom4j的乱码问题

最新推荐文章于 2024-04-01 01:34:53 发布

iteye_15968

最新推荐文章于 2024-04-01 01:34:53 发布

阅读量90

点赞数

文章标签： Eclipse XML SUN Security

1）背景

长期运行的爬虫程序（抓取xml）突然出了问题。xml的乱码导致无法验证通过

2）乱码是怎么产生的

发现不同的网站返回的xml编码不一致，有的是gb2312,有的utf-8。
爬虫程序将urlConnection.getInputStream()的字节流传递给了SAXReader来构造Document
可惜SAXReader还不够强悍，由于只是获取了字节流，但不知道编码方式，于是SAXReader采用了系统默认的编码方式对对待字节流，问题就出在这里。

3)未指定编码，SAXReader如何处理字节流

org.gjt.xpp.sax2.Driver.paser(InputSource source)

if(encoding == null)
reader = new InputStreamReader(stream);

java.io.InputStreamReader

sd = StreamDecoder.forInputStreamReader(in, this, (String)null)
编码方式为空

sun.nio.cs.StreamDecoderforInputStreamReader

(InputStream in,Object lock,String charsetName)
if (csn == null)
csn = Charset.defaultCharset().name();
获取默认编码方式

java.nio.charset.Charset.defaultCharset()

java.security.PrivilegedAction pa =
new GetPropertyAction("file.encoding");
String csn = (String)AccessController.doPrivileged(pa);
Charset cs = lookup(csn);
if (cs != null)
defaultCharset = cs;
else
defaultCharset = forName("UTF-8");
首先参考-Dfileencoding,如果没有就是系统默认字符编码，还找不到就是“UTF-8”
如果在eclipse中运行程序，eclipse会指定-Dfileencoding,值就是你得文件编码

4）如何确定xml编码方式

参考com.sun.syndication.io.XmlReader

查看文件第一行，看是后有<?xml....encoding="xx"...?>
查看httpresponseheader中是否含有Content-Typetext/xml; charset=xx
探测BOM(UTF-8 签名)

取头3个字节
UTF_16BE:0xFE 0xFF
UTF_16LE:0xFF0xFE
UTF_8:0xEF0xBB0xBF

实际通过测试发现：
//utf-16BE、utf-16LE、utf-16,utf-8编码差别
System.out.println(Arrays.toString("<".getBytes("utf-16BE")));:[0, 60]
System.out.println(Arrays.toString("<".getBytes("utf-16LE")));:[60, 0]
System.out.println(Arrays.toString("<".getBytes("utf-16")));:[-2, -1, 0, 60]
System.out.println(Arrays.toString("<".getBytes("utf-8")));:[60]
//能识别BOM？
byte[] b1=new byte[]{-2,-1,0,60};
System.out.println(new String(b1,"UTF-16BE"));// <
System.out.println(new String(b1,"UTF-16"));// <

byte[] b1=new byte[]{-1,-2,60,0};
System.out.println(new String(b1,"UTF-16LE"));// ?<
System.out.println(new String(b1,"UTF-16"));// <

byte[] b1=new byte[]{-17,-69,-65,60};
System.out.println(new String(b1,"UTF-8"));// ?<

上面红色代表错误，绿色代表正确
可见java中的BOM纯粹是为UTF-16bigendian或者littleendian准备，基本上已不具备识别UTF-16BE、UTF-16LE、UTF-16、UTF-8功能

猜测

取头4个字节，看是否匹配<?xm
UTF_16BE:0x000x3C0x000x3F
UTF_16BE:0x3C0x000x3F0x00
UTF_8:0x3C0x3F0x780x6D

5）修正方式

采用第一种

使用PushbackInputStream封装预读少量数据（200）
退回读取的字节（PushbackInputStream.unread）保持输入字节流的完整
使用正则抓取数据的第一行，获取encoding
构造new InputStreamReader(pis,encoding)，传给XmlReader以免不知道采用何种编码解析

6)org.dom4j.Document.asXML()的bug

经过上面的步骤输入正确了，Document也成功解析了，为什么org.dom4j.Document.asXML()仍然乱码？

看看代码：
public String asXML() {
try {
ByteArrayOutputStream out = new ByteArrayOutputStream();
XMLWriter writer = new XMLWriter(out, outputFormat);
writer.write(this);
return out.toString();
}
catch (IOException e) {
throw new RuntimeException("IOException while generating textual representation: " + e.getMessage());
}
}

6.1)问题在哪？

out.toString()

java.io.ByteArrayOutputStream.toString()

return new String(buf, 0, count);

java.lang.String(byte bytes[], int offset, int length)

char[] v = StringCoding.decode(bytes, offset, length);

java.lang.StringCoding.decode(byte[] ba, int off, int len)

String csn = Charset.defaultCharset().name();
try {
return decode(csn, ba, off, len);
} catch (UnsupportedEncodingException x) {
warnUnsupportedCharset(csn);
}
采用了系统默认编码来输出导致问题