java中判断字节数组的编码方式是不是UTF-8

最新推荐文章于 2023-04-07 13:44:34 发布

gold_zwj

最新推荐文章于 2023-04-07 13:44:34 发布

阅读量4.1k

点赞数

分类专栏： java 个人

本文链接：https://blog.csdn.net/zwjyyy1203/article/details/82187239

版权

java 同时被 2 个专栏收录

114 篇文章 0 订阅

订阅专栏

个人

38 篇文章 0 订阅

订阅专栏

java中判断字节数组的编码方式是不是UTF-8

1，用google的工具包，配置maven：

<!-- https://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet -->
<dependency>
    <groupId>com.googlecode.juniversalchardet</groupId>
    <artifactId>juniversalchardet</artifactId>
    <version>1.0.3</version>
</dependency>

2，定义一个公共方法：

public static String guessEncoding(byte[] bytes) {
        UniversalDetector detector = new UniversalDetector(null);
        detector.handleData(bytes, 0, bytes.length);
        detector.dataEnd();
        String encoding = detector.getDetectedCharset();
        detector.reset();
        return encoding;
}

public abstract class CharsetUtils {

    private static Logger logger = LoggerFactory.getLogger(CharsetUtils.class);

    public static String detectCharset(String contentType, byte[] contentBytes) throws IOException {
        String charset;
        // charset
        // 1、encoding in http header Content-Type
        charset = UrlUtils.getCharset(contentType);
        if (StringUtils.isNotBlank(contentType) && StringUtils.isNotBlank(charset)) {
            logger.debug("Auto get charset: {}", charset);
            return charset;
        }
        // use default charset to decode first time
        Charset defaultCharset = Charset.defaultCharset();
        String content = new String(contentBytes, defaultCharset);
        // 2、charset in meta
        if (StringUtils.isNotEmpty(content)) {
            Document document = Jsoup.parse(content);
            Elements links = document.select("meta");
            for (Element link : links) {
                // 2.1、html4.01 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
                String metaContent = link.attr("content");
                String metaCharset = link.attr("charset");
                if (metaContent.indexOf("charset") != -1) {
                    metaContent = metaContent.substring(metaContent.indexOf("charset"), metaContent.length());
                    charset = metaContent.split("=")[1];
                    break;
                }
                // 2.2、html5 <meta charset="UTF-8" />
                else if (StringUtils.isNotEmpty(metaCharset)) {
                    charset = metaCharset;
                    break;
                }
            }
        }
        logger.debug("Auto get charset: {}", charset);
        // 3、todo use tools as cpdetector for content decode
        charset=guessEncoding(contentBytes);
        return charset;
    }

}

private static final Pattern patternForCharset = Pattern.compile("charset\\s*=\\s*['\"]*([^\\s;'\"]*)", Pattern.CASE_INSENSITIVE);

public static String getCharset(String contentType) {
    Matcher matcher = patternForCharset.matcher(contentType);
    if (matcher.find()) {
        String charset = matcher.group(1);
        if (Charset.isSupported(charset)) {
            return charset;
        }
    }
    return null;
}

gold_zwj

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
java中判断字节数组的编码方式是不是UTF-8

java中判断字节数组的编码方式是不是UTF-81，用google的工具包，配置maven：&lt;!-- https://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet --&gt;&lt;dependency&gt; &lt;groupId&gt;com.google...
复制链接

扫一扫