之前读取网页的时候经常出现乱码,后来查找了一些资料,终于搞明白了网页编码的奥秘。今天在这里和大家分享一下httpclient自动识别网页编码的方法。
首先,先了解一下浏览器识别编码的方法(来源:http://every-best.iteye.com/blog/970861)。
浏览器识别编码有3种方式:
1、HTTP头的Content-Type2、meta标签(有2类meta标签可以设置编码)
3、BOM
当前面这3种都不存在时,浏览器默认使用US-ASCII编码,而不是UTF-8,至少标准是这么解释的,当然浏览器可能不以标准为参考:
引用
If the document does not start with a U+FEFF BYTE ORDER MARK (BOM) character, and if its encoding is not explicitly given by a Content-Type HTTP header, then the character encoding used must be an ASCII-compatible character encoding, and, in addition, if that encoding isn't
US-ASCII itself, then the encoding must be specified using a meta element with a charset attribute or a meta element in the encoding declaration state.
当使用编码时,必须时刻注意这个编码必须是在 INAA 中有注册的,比如UTF-8就必须写UTF-8,虽然部分浏览器识别UTF8(少一个“-”),但在另外一部分浏览器下就可能出错
另外INAA明确指出了,编码名称大小写不敏感
引用
However, no distinction is made between use of upper and lower case letters.
另外当使用meta标签指定编码,即HTTP头和BOM都不存在时,编码必须是一个ASCII编码的超集,也就楼主说的有可能会找不到meta标签的情况
另外标准不允许使用以下的编码,但问题是浏览器竟然能解析UTF-7等编码,导致一些安全问题:
UTF-32 , JIS_C6226-1983, JIS_X0212-1990, HZ-GB-2312, JOHAB, ISO-2022系列, EBCDIC系列, CESU-8, UTF-7 , BOCU-1, SCSU
最后,如果是XML型(如XHTML),需要用XML声明来指定编码,如<?xml version="1.0" encoding="UTF-8"?>
搞清楚原理以后,编码起来就简单了。
-
public String readPage(String url){ String html=null; if(StringUtils.isBlank(url))return null; if(!(url.startsWith("http://") ||url.startsWith("https://"))) { url="http://"+url; } HttpClient client = new DefaultHttpClient(); setParameters(client); HttpResponse response = null; HttpContext httpContext = new BasicHttpContext(); HttpGet get = new HttpGet(url); get.addHeader("Accept", "text/html"); get.addHeader("Accept-Charset", "gb2312,utf-8"); get.addHeader("Accept-Encoding", "gzip"); get.addHeader("Accept-Language", "zh-cn,zh,en-US,en"); get.addHeader("User-Agent", util.UserAgent.getUserAgent()); HttpEntity entity = null; try { response = client.execute(get,httpContext); entity = response.getEntity(); Header header = entity.getContentEncoding(); if (header != null) { HeaderElement[] codecs = header.getElements(); for (int i = 0; i < codecs.length; i++) { if (codecs[i].getName().equalsIgnoreCase("gzip")) { response.setEntity(new GzipDecompressingEntity(entity)); } } } entity = response.getEntity(); HttpHost targetHost = (HttpHost)httpContext.getAttribute(ExecutionContext.HTTP_TARGET_HOST); HttpUriRequest realRequest = (HttpUriRequest)httpContext.getAttribute(ExecutionContext.HTTP_REQUEST); realUrl = ExtractorUtil.connectUrl(targetHost.toString(), realRequest.toString()); byte[] bytes= EntityUtils.toByteArray(entity); String charset = EntityUtils.getContentCharSet(entity); if(StringUtils.isBlank(charset)){ charset = FileUtil.getHtmlCharset(bytes); } /*html = new String(bytes); String charset = FileUtil.getHtmlCharset(html); */ html = new String(bytes ,charset); if(charset.equalsIgnoreCase("BIG5")){ html = ZHConverter.convert(html, ZHConverter.SIMPLIFIED); } EntityUtils.consume(entity); } catch(Exception e){ logger.debug(e.getMessage()+url); e.printStackTrace(); } finally{ } return html; }
/**
* The byte-order mark (BOM) in HTML
* @param bytes
* @return
*/
public static String getEncode(byte[] bytes){
String code = null;
if(bytes==null || bytes.length<2){
return code;
}
int p = ((int)bytes[0]&0x00ff) <<8|((int)bytes[1]&0x00ff);
switch (p) {
case 0xefbb:
code = "UTF-8";
break;
case 0xfffe:
code = "Unicode";
break;
case 0xfeff:
code = "UTF-16BE";
break;
default:
code = "GBK";
}
return code;
}
/**
* 返回网页的编码
* 1.检查HTML meta标签是否含有charset信息
* 2.使用BOM
* @param bytes
* @return
*/
public static String getHtmlCharset(byte[] bytes){
String content = new String(bytes);
String charset=null;
Pattern pattern = Pattern.compile("<[mM][eE][tT][aA][^>]*([cC][Hh][Aa][Rr][Ss][Ee][Tt][\\s]*=[\\s\\\"']*)([\\w\\d-_]*)[^>]*>");
Matcher matcher = pattern.matcher(content);
if(matcher.find()){
charset = matcher.group(2);
}
else{
charset = getEncode(bytes);
}
return charset;
}