java里面httpclient如何识别网页编码

最新推荐文章于 2021-03-19 16:48:59 发布

dandan3150

最新推荐文章于 2021-03-19 16:48:59 发布

阅读量2k

点赞数

分类专栏：数据抓取文章标签： java 编码 httpclient

本文链接：https://blog.csdn.net/dandan3150/article/details/24737023

版权

数据抓取专栏收录该内容

1 篇文章 0 订阅

订阅专栏

之前读取网页的时候经常出现乱码，后来查找了一些资料，终于搞明白了网页编码的奥秘。今天在这里和大家分享一下httpclient自动识别网页编码的方法。

首先，先了解一下浏览器识别编码的方法（来源：http://every-best.iteye.com/blog/970861）。

浏览器识别编码有3种方式：

1、HTTP头的Content-Type
2、meta标签（有2类meta标签可以设置编码）
3、BOM

当前面这3种都不存在时，浏览器默认使用US-ASCII编码，而不是UTF-8，至少标准是这么解释的，当然浏览器可能不以标准为参考：

引用

If the document does not start with a U+FEFF BYTE ORDER MARK (BOM) character, and if its encoding is not explicitly given by a Content-Type HTTP header, then the character encoding used must be an ASCII-compatible character encoding, and, in addition, if that encoding isn't US-ASCII itself, then the encoding must be specified using a meta element with a charset attribute or a meta element in the encoding declaration state.

当使用编码时，必须时刻注意这个编码必须是在 INAA 中有注册的，比如UTF-8就必须写UTF-8，虽然部分浏览器识别UTF8（少一个“-”），但在另外一部分浏览器下就可能出错
另外INAA明确指出了，编码名称大小写不敏感

引用

However, no distinction is made between use of upper and lower case letters.

另外当使用meta标签指定编码，即HTTP头和BOM都不存在时，编码必须是一个ASCII编码的超集，也就楼主说的有可能会找不到meta标签的情况

另外标准不允许使用以下的编码，但问题是浏览器竟然能解析UTF-7等编码，导致一些安全问题：
UTF-32 , JIS_C6226-1983, JIS_X0212-1990, HZ-GB-2312, JOHAB, ISO-2022系列, EBCDIC系列, CESU-8, UTF-7 , BOCU-1, SCSU

最后，如果是XML型（如XHTML），需要用XML声明来指定编码，如<?xml version="1.0" encoding="UTF-8"?>

搞清楚原理以后，编码起来就简单了。

    Java代码   
    
  
 
 public  String readPage(String url){
		String html=null;
		if(StringUtils.isBlank(url))return null;
		if(!(url.startsWith("http://") ||url.startsWith("https://")))
		{
			url="http://"+url;
		}
		
		HttpClient client = new DefaultHttpClient();
		setParameters(client);
		
		HttpResponse response = null;
		HttpContext httpContext = new BasicHttpContext();		
		HttpGet get = new HttpGet(url);
		get.addHeader("Accept", "text/html");
		get.addHeader("Accept-Charset", "gb2312,utf-8");
		get.addHeader("Accept-Encoding", "gzip");
		get.addHeader("Accept-Language", "zh-cn,zh,en-US,en");
		get.addHeader("User-Agent", util.UserAgent.getUserAgent());
		
		HttpEntity entity = null;
		try {
			response = client.execute(get,httpContext);
			entity = response.getEntity();
			Header header = entity.getContentEncoding();
			if (header != null)
			{
				HeaderElement[] codecs = header.getElements();
				for (int i = 0; i < codecs.length; i++)
				{
					if (codecs[i].getName().equalsIgnoreCase("gzip"))
					{
						response.setEntity(new GzipDecompressingEntity(entity));
					}
				}
			}
			entity = response.getEntity();
			HttpHost targetHost = (HttpHost)httpContext.getAttribute(ExecutionContext.HTTP_TARGET_HOST);			
			HttpUriRequest realRequest = (HttpUriRequest)httpContext.getAttribute(ExecutionContext.HTTP_REQUEST);			
			realUrl = ExtractorUtil.connectUrl(targetHost.toString(),
					realRequest.toString());
			byte[] bytes= EntityUtils.toByteArray(entity);
			String charset = EntityUtils.getContentCharSet(entity);
			if(StringUtils.isBlank(charset)){
				charset = FileUtil.getHtmlCharset(bytes);
			}
			/*html = new String(bytes);
			String charset = FileUtil.getHtmlCharset(html);	*/			
			html = new String(bytes ,charset);
			if(charset.equalsIgnoreCase("BIG5")){
				html = ZHConverter.convert(html, ZHConverter.SIMPLIFIED);
			}
			EntityUtils.consume(entity);
		} catch(Exception e){
			logger.debug(e.getMessage()+url);
			e.printStackTrace();
		}
		finally{
		}
		return html;
	}
 
 
  /**
     * The byte-order mark (BOM) in HTML
     * @param bytes
     * @return
     */
    public static String getEncode(byte[] bytes){
    	String code = null;
    	if(bytes==null || bytes.length<2){
    		return code;
    	}
    	
    	int p = ((int)bytes[0]&0x00ff) <<8|((int)bytes[1]&0x00ff);
    	switch (p) {
	    	case 0xefbb:
	    	code = "UTF-8";
	    	break;
	    	case 0xfffe:
	    	code = "Unicode";
	    	break;
	    	case 0xfeff:
	    	code = "UTF-16BE";
	    	break;
	    	default:
	    	code = "GBK";
    	}
    	return code;
    	
    } 
  
 
  /**
     * 返回网页的编码
     * 1.检查HTML meta标签是否含有charset信息
     * 2.使用BOM
     * @param bytes
     * @return
     */
    public static String getHtmlCharset(byte[] bytes){
    	String content = new String(bytes);
    	String charset=null;
    	Pattern pattern = Pattern.compile("<[mM][eE][tT][aA][^>]*([cC][Hh][Aa][Rr][Ss][Ee][Tt][\\s]*=[\\s\\\"']*)([\\w\\d-_]*)[^>]*>");
		Matcher matcher = pattern.matcher(content);
		if(matcher.find()){
			charset = matcher.group(2);
		}
		else{
			charset = getEncode(bytes);
		}
    	
		return charset;
    }