被忽略的双引号
在一段从mht文件中提取html内容的程序中,包含如下代码:
String strEncodng = getEncoding(bp1);
String strText = getHtmlText(bp1, strEncodng);
在处理某个mht文件时,报如下错误:
java.io.UnsupportedEncodingException: "unicode"
于是,我猜想应该是 strEncodng 为 unicode 所致,可能是文件本身设置的编码有问题,改成别的试试。尝试了UTF8不行,尝试UTF16可以。
String strEncodng = getEncoding(bp1);
// strEncodng = "UTF8"
strEncodng = "UTF16";
String strText = getHtmlText(bp1, strEncodng);
当然程序不能这样写,否则别的mht文件就无法正确处理了。打个补丁,当编码为 unicode 时,改成 UTF16。
String strEncodng = getEncoding(bp1);
if (strEncodng.equals("unicode")) {
strEncodng = "UTF16";
}
String strText = getHtmlText(bp1, strEncodng);
再次测试,发现还是报上面的异常,怪哉。于是加了日志输出,看到底怎么回事
String strEncodng = getEncoding(bp1);
log.debug("strEncodng=" + strEncodng);
if (strEncodng.equals("unicode")) {
strEncodng = "UTF16";
}
log.debug("strEncodng=" + strEncodng);
String strText = getHtmlText(bp1, strEncodng);
执行,发现两次的日志输出一样,根本上就没有进入if判断。
01:05:01.307 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng="unicode" 01:05:01.307 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng="unicode"
难道 strEncodng 中包含特殊字符,输出看不见吗。于是又加了几行代码来确认
String strEncodng = getEncoding(bp1);
log.debug("strEncodng=" + strEncodng);
if (strEncodng.equals("unicode")) {
strEncodng = "UTF16";
} else {
log.debug("strEncodng.length=" + strEncodng.length());
log.debug("strEncodng.contains=" + strEncodng.contains("unicode"));
for (int i = 0; i < strEncodng.length(); ++i) {
log.debug("strEncodng[" + i + "]=" + strEncodng.charAt(i) + " " + (int) strEncodng.charAt(i));
}
}
log.debug("strEncodng=" + strEncodng);
String strText = getHtmlText(bp1, strEncodng);
再次执行
01:05:01.307 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng="unicode" 01:05:01.307 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng.length=9 01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng.contains=true 01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng[0]=" 34 01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng[1]=u 117 01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng[2]=n 110 01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng[3]=i 105 01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng[4]=c 99 01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng[5]=o 111 01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng[6]=d 100 01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng[7]=e 101 01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng[8]=" 34 01:05:01.308 DEBUG http-bio-8080-exec-64 hh1.common.mht.Html2MHTCompiler - strEncodng="unicode"
终于发现了,strEncodng 中不光包含unicode,在其前后还有双引号包裹着。应该只要把双引号去掉就可以了,于是又改了代码,如下
String strEncodng = getEncoding(bp1);
strEncodng = strEncodng.replace("\"", "");
String strText = getHtmlText(bp1, strEncodng);
不错,通过了。都是那个被忽略的双引号啊。其实早在查看异常的时候和日志的时候就有些警觉,"uncode" 的双引号是字符串的一部分。
PS:在实际的测试中,strEncodng 还有可能为 null,加上此判断更加稳妥。
String strEncodng = getEncoding(bp1);
if (strEncodng == null) {
strEncodng = "GBK";
} else {
strEncodng = strEncodng.replace("\"", "");
}
String strText = getHtmlText(bp1, strEncodng);