前段时间写了段模拟登录抓取页面数据的代码,被其中的乱码问题给坑了。
情况如下,本地windows开发环境很好,没有问题,但是一到Linux服务,就给闹意见了,抓取的页面各种乱码,各种转码各种处理都不好使,最后只好在LinuxUbuntu搭建了一套开发环境,才得以解决。具体为什么会出现这个情况,还没有详细研究。
修改后具体代码如下:
DefaultHttpClient httpclient = new DefaultHttpClient();
HttpClientParams.setCookiePolicy(httpclient.getParams(),CookiePolicy.BROWSER_COMPATIBILITY);
HttpResponse response = null;
HttpPost httpost = new HttpPost(loginUrl);
List<NameValuePair> nvps = new ArrayList<NameValuePair>();
nvps.add(new BasicNameValuePair("userid", "用户名"));
nvps.add(new BasicNameValuePair("password", "密码"));
httpost.setEntity(new UrlEncodedFormEntity(nvps, "utf-8"));
response = httpclient.execute(httpost);
response.getEntity().getContent().close();
//cookies = httpclient.getCookieStore().getCookies();
List<NameValuePair> nvpsReport = new ArrayList<NameValuePair>();
nvpsReport.add(new BasicNameValuePair("perporty1", "参数值1"));
nvpsReport.add(new BasicNameValuePair("perporty2", "参数值2"));
HttpPost httpostReport = new HttpPost(queryPersonalReportUrl);
httpostReport.setEntity(new UrlEncodedFormEntity(nvpsReport,URL_CHARACTER));
response = httpclient.execute(httpostReport);
HttpEntity entity = response.getEntity();
String result = EntityUtils.toString(entity, "utf-8");
response.getEntity().getContent().close();
重点是对于获取数据编码的处理,查看了下源代码,如下:
public static String toString(
final HttpEntity entity, final Charset defaultCharset) throws IOException, ParseException {
Args.notNull(entity, "Entity");
final InputStream instream = entity.getContent();
if (instream == null) {
return null;
}
try {
Args.check(entity.getContentLength() <= Integer.MAX_VALUE,
"HTTP entity too large to be buffered in memory");
int i = (int)entity.getContentLength();
if (i < 0) {
i = 4096;
}
Charset charset = null;
try {
final ContentType contentType = ContentType.get(entity);
if (contentType != null) {
charset = contentType.getCharset();
}
} catch (final UnsupportedCharsetException ex) {
if (defaultCharset == null) {
throw new UnsupportedEncodingException(ex.getMessage());
}
}
if (charset == null) {
charset = defaultCharset;
}
if (charset == null) {
charset = HTTP.DEF_CONTENT_CHARSET;
}
final Reader reader = new InputStreamReader(instream, charset);
final CharArrayBuffer buffer = new CharArrayBuffer(i);
final char[] tmp = new char[1024];
int l;
while((l = reader.read(tmp)) != -1) {
buffer.append(tmp, 0, l);
}
return buffer.toString();
} finally {
instream.close();
}
}