如何用Java判断网页编码

最新推荐文章于 2024-07-10 10:23:06 发布

Godric42

最新推荐文章于 2024-07-10 10:23:06 发布

阅读量1.7k

点赞数

分类专栏： Java 文章标签： Java 乱码 html encoding

本文链接：https://blog.csdn.net/Godric42/article/details/37049131

版权

Java 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

How to detect html encoding in Java？

问题

当我们希望用程序收集一些互联网上的素材，比如行业新闻、用户信息等等，就总会遇到网页编码问题，如果编码没有解析正确，或者忽略编码信息，就会出现乱码问题，比如常见的“中文乱码”，乱码的现象就是一堆问号。

思路

根据HTTP的规范（http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html 14.17），我们可以从服务器返回的HTTP响应首部得到Content-Type信息，里面会包含编码信息。

Content-Type: text/html; charset=utf-8

也可以从HTML的元信息中获得编码信息，在 HTML 4.01 中（大部分网站都是这种情况），写法如下：

<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

在 HTML 5 中，更加简洁：

<meta charset="ISO-8859-1">

只要网站服务器返回的HTTP响应首部或者HTML中包含编码信息，我们就可以顺利解决编码识别问题。

参考实现

下面我们看看一些开源代码是如何实现网页编码识别的，这里只包含Java代码。
我们将会从如下的开源项目中选出相应代码：
Webmagic：https://github.com/code4craft/webmagic/
Nutch：https://github.com/apache/nutch

代码一：Nutch从content type首部解析编码

/**
   * Parse the character encoding from the specified content type header.
   * If the content type is null, or there is no explicit character encoding,
   * <code>null</code> is returned.
   * <br />
   * This method was copied from org.apache.catalina.util.RequestUtil,
   * which is licensed under the Apache License, Version 2.0 (the "License").
   *
   * @param contentType a content type header
   */
  public static String parseCharacterEncoding(String contentType) {
    if (contentType == null)
      return (null);
    int start = contentType.indexOf("charset=");
    if (start < 0)
      return (null);
    String encoding = contentType.substring(start + 8);
    int end = encoding.indexOf(';');
    if (end >= 0)
      encoding = encoding.substring(0, end);
    encoding = encoding.trim();
    if ((encoding.length() > 2) && (encoding.startsWith("\""))
      && (encoding.endsWith("\"")))
      encoding = encoding.substring(1, encoding.length() - 1);
    return (encoding.trim());

  }

代码二：webmagic从content type首部解析编码

private static final Pattern patternForCharset = Pattern.compile("charset\\s*=\\s*['\"]*([^\\s;'\"]*)");

public static String getCharset(String contentType) {
        Matcher matcher = patternForCharset.matcher(contentType);
        if (matcher.find()) {
            String charset = matcher.group(1);
            if (Charset.isSupported(charset)) {
                return charset;
            }
        }
        return null;
}

代码三：webmagic从HTML元数据中解析网页编码

protected String getHtmlCharset(HttpResponse httpResponse, byte[] contentBytes) throws IOException {
        String charset;
        // charset
        // 1、encoding in http header Content-Type
        String value = httpResponse.getEntity().getContentType().getValue();
        charset = UrlUtils.getCharset(value);
        if (StringUtils.isNotBlank(charset)) {
            logger.debug("Auto get charset: {}", charset);
            return charset;
        }
        // use default charset to decode first time
        Charset defaultCharset = Charset.defaultCharset();
        String content = new String(contentBytes, defaultCharset.name());
        // 2、charset in meta
        if (StringUtils.isNotEmpty(content)) {
            Document document = Jsoup.parse(content);
            Elements links = document.select("meta");
            for (Element link : links) {
                // 2.1、html4.01 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
                String metaContent = link.attr("content");
                String metaCharset = link.attr("charset");
                if (metaContent.indexOf("charset") != -1) {
                    metaContent = metaContent.substring(metaContent.indexOf("charset"), metaContent.length());
                    charset = metaContent.split("=")[1];
                    break;
                }
                // 2.2、html5 <meta charset="UTF-8" />
                else if (StringUtils.isNotEmpty(metaCharset)) {
                    charset = metaCharset;
                    break;
                }
            }
        }
        logger.debug("Auto get charset: {}", charset);
        // 3、todo use tools as cpdetector for content decode
        return charset;
    }
}

这里没有给出Nutch是如何从HTML中获取编码信息的代码，为什么呢？因为Nutch没有自己实现这个功能。取而代之的是，Nutch实现了一套编码猜测的机制，主要是利用了IBM的ICU4J完成的。具体可以参考如下类：
https://github.com/apache/nutch/blob/adbccc4827fee28782522a3451f7e643ab449d00/src/java/org/apache/nutch/util/EncodingDetector.java
以及ICU4J的相关文档。

为什么Nutch要这么做呢？因为虽然有规范，但是总会遇到不符合规范的情况，比如HTTP首部返回的编码和HTML内部指定的编码不一致，或者编码信息缺失等等，这种情况下，就需要猜测网页编码了，所以Nutch的实现是更通用的。