httpClient采集到的数据乱码问题完整解决

最新推荐文章于 2024-10-11 17:58:08 发布

阿泽财商会

最新推荐文章于 2024-10-11 17:58:08 发布

阅读量1.3w

点赞数 5

分类专栏： java 文章标签： encoding 压缩 utf-8 gzip 乱码

本文链接：https://blog.csdn.net/zzq900503/article/details/39203331

版权

java 专栏收录该内容

267 篇文章 4 订阅

订阅专栏

解决乱码有如下几种方式，还有两种特殊的情况:

首先选用正确的post和get方式是必须的

如果出现乱码可以尝试分别单独使用以下方法:

if (HttpStatus.SC_OK == response.getStatusLine().getStatusCode()) {
entity = response.getEntity();
if (entity != null) {
//第一种方式
System.out.println(EntityUtils.toString(entity,"GBK"));

//第二种方式

String outstr = new String(EntityUtils.toString(entity).getBytes("ISO-8859-1"),"GBK");
System.out.println(outstr);

//第三种方式

String responseString = new String(EntityUtils.toString(entity));
responseString=new String(responseString.getBytes("ISO-8859-1"),"GBK");
System.out.println(responseString);
}
}

这三种方式都不起作用，一直乱码。

然后又找到另一种方式，在httpclient之前设置编码

client.getParams().setParameter("http.protocol.content-charset", "UTF-8");
this.response = client.execute(hp);

如果以上几种方式都没解决，那就可能是以下的两种特殊情况：

情况一: 编码被压缩了

在用httpclient做网页提取的过程中，通过抓包工具发现了头部中会有 Accept-Encoding: gzip, deflate字段

如果头部有了该字段，则服务器会将内容reponse的内容进行压缩用gzip或者deflate算法，然后reponse给用户。目前我看到的仅仅有gzip算法被用到，然后返回给用户的数据也是压缩后的数据，这样往往可以减轻服务器的负担，同时也减少了网络传输

如果有了该字段，你又不处理，那么就会遇到乱码现象(这是肯定的，因为只是压缩过的数据)。下边我会利用httpclient工具对加入了Accept-Encoding: gzip, deflate 的内容进行处理，使得内容可以正常处理。

增加代码如下:

	if (httpResponse.getStatusLine().getStatusCode() == 200) {
					HttpEntity httpEntity = httpResponse.getEntity();
					if(httpEntity.getContentEncoding()!=null){
					if("gzip".equalsIgnoreCase(httpEntity.getContentEncoding().getValue())){
						httpEntity = new GzipDecompressingEntity(httpEntity);				
					} else if("deflate".equalsIgnoreCase(httpEntity.getContentEncoding().getValue())){
						httpEntity = new DeflateDecompressingEntity(httpEntity);			
					}}
					result = EntityUtils.toString(httpEntity, encode);// 取出应答字符串
//					System.out.println(result);}

第二种情况: 有时候我们使用System.out.println(EntityUtils.toString(entity,"GBK"))能解析大部分代码，但是还有小部分出现乱码

这种情况是因为我们设置的编码没起作用

我们看httpcore-4.2.4.jar版本的EntityUtils源码

如下:

 public static String toString(
            final HttpEntity entity, final String defaultCharset) throws IOException, ParseException {
        return toString(entity, defaultCharset != null ? Charset.forName(defaultCharset) : null);
    }

public static String toString(
            final HttpEntity entity, final Charset defaultCharset) throws IOException, ParseException {
        if (entity == null) {
            throw new IllegalArgumentException("HTTP entity may not be null");
        }
        InputStream instream = entity.getContent();
        if (instream == null) {
            return null;
        }
        try {
            if (entity.getContentLength() > Integer.MAX_VALUE) {
                throw new IllegalArgumentException("HTTP entity too large to be buffered in memory");
            }
            int i = (int)entity.getContentLength();
            if (i < 0) {
                i = 4096;
            }
            Charset charset = null;
            try {
                ContentType contentType = ContentType.get(entity);
                if (contentType != null) {
                    charset = contentType.getCharset();
                }
            } catch (final UnsupportedCharsetException ex) {
                throw new UnsupportedEncodingException(ex.getMessage());
            }
            if (charset == null) {
                charset = defaultCharset;
            }
            if (charset == null) {
                charset = HTTP.DEF_CONTENT_CHARSET;
            }
            Reader reader = new InputStreamReader(instream, charset);
            CharArrayBuffer buffer = new CharArrayBuffer(i);
            char[] tmp = new char[1024];
            int l;
            while((l = reader.read(tmp)) != -1) {
                buffer.append(tmp, 0, l);
            }
            return buffer.toString();
        } finally {
            instream.close();
        }
    }

发现它会先去获取一遍网站头文件传回来的编码如果有编码就不用我们的编码

但是我们有时候会遇到网站的头文件传回来的编码是 gb2312 但其实网站用的是gbk

所以我们要把以上方法重新，把获取头文件编码部分注释掉

我最后用的方法如下：

result = enCodetoString(httpEntity, encode);// 取出应答字符串

 public static String enCodetoString(
	            final HttpEntity entity, final String defaultCharset) throws IOException, ParseException {
	        return enCodetoStringDo(entity, defaultCharset != null ? Charset.forName(defaultCharset) : null);
	    }
	
	  public static String enCodetoStringDo(
	            final HttpEntity entity, Charset defaultCharset) throws IOException, ParseException {	
	        if (entity == null) {
	            throw new IllegalArgumentException("HTTP entity may not be null");
	        }
	        InputStream instream = entity.getContent();
	        if (instream == null) {
	            return null;
	        }
	        try {
	            if (entity.getContentLength() > Integer.MAX_VALUE) {
	                throw new IllegalArgumentException("HTTP entity too large to be buffered in memory");
	            }
	            int i = (int)entity.getContentLength();
	            if (i < 0) {
	                i = 4096;
	            }
	            Charset charset = null;
	            try {
//	                ContentType contentType = ContentType.get(entity);
//	                if (contentType != null) {
//	                    charset = contentType.getCharset();
//	                }
	            } catch (final UnsupportedCharsetException ex) {
	                throw new UnsupportedEncodingException(ex.getMessage());
	            }
	            if (charset == null) {
	                charset = defaultCharset;
	            }
	            if (charset == null) {
	                charset = HTTP.DEF_CONTENT_CHARSET;
	            }
	            Reader reader = new InputStreamReader(instream, charset);
	            CharArrayBuffer buffer = new CharArrayBuffer(i);
	            char[] tmp = new char[1024];
	            int l;
	            while((l = reader.read(tmp)) != -1) {
	                buffer.append(tmp, 0, l);
	            }
	            return buffer.toString();
	        } finally {
	            instream.close();
	        }
	    }

下面还有一个方法可以检测字符的解析

System.out.println(Arrays.toString("堎".getBytes(Charset.forName("gbk"))));
		
		System.out.println(new String(new byte[]{-120, -39},Charset.forName("gb2312")));