HttpClient对网页编码的精确识别

最新推荐文章于 2020-08-29 17:27:08 发布

silence1214

最新推荐文章于 2020-08-29 17:27:08 发布

阅读量4k

点赞数

文章标签： string 算法 byte class stream 浏览器

本文链接：https://blog.csdn.net/silence1214/article/details/6106005

版权

最近用Httpclient对网页进行采集，因为采集的网页编码不确定，主要是中文的网址，而httpclient对编码的识别也是靠response的head来识别的，但是有的服务器根本不返回这个头，httpclient默认就采用了ISO-8859-1的编码。上网搜索了下，有人写出了浏览器对页面编码的自动识别原理，还是个北京人呢。我找到了他的java实现算法，为了保证写好的代码不会被重新大修改，我就extends了httpclient中的GetMethod（因为我在用这个类）对里面的编码识别进行了修改，完整的算法如下（使用了chardet.jar这个类库）：

这个代码是对chardet.jar的使用，这个算法来自网上：

/** * */ package com.baseframework.support; import java.io.BufferedInputStream; import java.io.IOException; import java.io.InputStream; import org.mozilla.intl.chardet.nsDetector; import org.mozilla.intl.chardet.nsICharsetDetectionObserver; import org.mozilla.intl.chardet.nsPSMDetector; /** * @author sunyanan 判断字节流的编码 * */ public class CharsetDetector { private boolean found = false; private String result; private int lang; private static CharsetDetector c = new CharsetDetector(); private CharsetDetector(){} public static CharsetDetector getInstance() { return c; } public String[] detectChineseCharset(InputStream in) throws IOException { lang = nsPSMDetector.CHINESE; String[] prob; // Initalize the nsDetector() ; nsDetector det = new nsDetector(lang); // Set an observer... // The Notify() will be called when a matching charset is found. det.Init(new nsICharsetDetectionObserver() { public void Notify(String charset) { found = true; result = charset; } }); BufferedInputStream imp = new BufferedInputStream(in); byte[] buf = new byte[1024]; int len; boolean isAscii = true; while ((len = imp.read(buf, 0, buf.length)) != -1) { // Check if the stream is only ascii. if (isAscii) isAscii = det.isAscii(buf, len); // DoIt if non-ascii and not done yet. if (!isAscii) { if (det.DoIt(buf, len, false)) break; } } imp.close(); in.close(); det.DataEnd(); if (isAscii) { found = true; prob = new String[] { "ASCII" }; } else if (found) { prob = new String[] { result }; } else { prob = det.getProbableCharsets(); } return prob; } public String[] detectAllCharset(InputStream in) throws IOException { try { lang = nsPSMDetector.ALL; return detectChineseCharset(in); } catch (IOException e) { throw e; } } }

下面是对GetMethod的扩充

/** * */ package com.baseframework.httpcient; import java.io.IOException; import java.io.InputStream; import org.apache.commons.httpclient.methods.GetMethod; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import com.baseframework.support.CharsetDetector; /** * @author sunyanan * 对标准org.apache.commons.httpclient.methods.GetMethod的重写，主要是为了覆盖其父类对Charset的探测，这个探测时选用的开源的jar */ public class GetMethodForCharset extends GetMethod { private Log log = LogFactory.getLog(GetMethodForCharset.class); public GetMethodForCharset() { super(); } public GetMethodForCharset(String uri) { super(uri); } /** * 主要实现的是对这个方法的重写 */ public String getResponseCharSet() { String charset = getContentCharSet(getResponseHeader("Content-Type")); // 默认情况下选择的是 ISO-8859-1，那么就判断如果是这个字符编码的时候再来探测 if(charset.equalsIgnoreCase("ISO-8859-1")) { // 使用组件来判断 try { InputStream is = getResponseBodyAsStream(); String cs[] = CharsetDetector.getInstance().detectAllCharset(is); if(cs != null && cs.length > 0) { charset = cs[0]; } } catch (IOException e) { e.printStackTrace(); } } log.debug("charset used: " + charset); return charset; } }

silence1214

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
1
评论
HttpClient对网页编码的精确识别

<br />最近用Httpclient对网页进行采集，因为采集的网页编码不确定，主要是中文的网址，而httpclient对编码的识别也是靠response的head来识别的，但是有的服务器根本不返回这个头，httpclient默认就采用了ISO-8859-1的编码。上网搜索了下，有人写出了浏览器对页面编码的自动识别原理，还是个北京人呢。我找到了他的java实现算法，为了保证写好的代码不会被重新大修改，我就extends了httpclient中的GetMethod（因为我在用这个类）对里面的编码识别进行了修改
复制链接

扫一扫