java 判断文件编码格式(支持zip)

最新推荐文章于 2024-07-25 19:28:36 发布

置顶话说一物降一物

最新推荐文章于 2024-07-25 19:28:36 发布

阅读量1.3w

点赞数 4

分类专栏： java 文章标签： java 文件编码 cpdetector

本文链接：https://blog.csdn.net/u014052432/article/details/79243496

版权

java 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

java 判断文件编码格式(支持zip)

前言：

    最近在工作过程中遇到了这样的问题： 通过文件上传，需要导入zip包中的文件信息。
由于使用的是apache的ant.jar中的ZipFile类、ZipEntry类。由于目前该工具类并不能判断zip中每个文件的具体的编码，
导致解析时出现中文乱码。通过查找资料发现借鉴使用第三方工具cpDetector解决。因此在此做个记录。
    若想实现更复杂的文件编码检测，可以使用一个开源项目cpdetector，
    网址: http://cpdetector.sourceforge.net
它的类库很小，只有500K左右，cpDetector是基于统计学原理的，不保证完全正确，利用该类库判定文本文件的代码如下：

准备条件

- 需要的jar包：cpdetector_1.0.10.jar、antlr-2.7.4.jar、chardet-1.0.jar、jargs-1.0.jar
- 源码：cpdetector_1.0.10_binary.zip
- 相关资料：https://www.cnblogs.com/king1302217/p/4003060.html

具体实现

    在此摸索过程中遇到的问题：  查找了网上的参考例子，但是几乎所有的都是直接处理针对File对象的处理。
没有针对zip文件的相关处理逻辑。并且由于apache的ZipFile 、 以及它内部的文件对象ZipEntry不能使用url方式。
于是查看底层实现代码发现可以用此： 
    **charset = detector.detectCodepage(bis, Integer.MAX_VALUE);// zip 判断的关键代码**

注意：
    直接使用zipFile.getInputStream(zipEntry) 得到的inputStream流不支持mark()方法。
    但是cpdetector底层需要用此方法.后来查找发现底层其实有类似场景的特殊处理：
    若是不支持mark()则可以把inputStream包装成支持的BufferedInputStream即可。如下：

这里写图片描述

具体代码如下：


import java.io.BufferedInputStream;
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.nio.charset.Charset;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import info.monitorenter.cpdetector.io.ASCIIDetector;
import info.monitorenter.cpdetector.io.CodepageDetectorProxy;
import info.monitorenter.cpdetector.io.JChardetFacade;
import info.monitorenter.cpdetector.io.ParsingDetector;
import info.monitorenter.cpdetector.io.UnicodeDetector;
/**
 * 1、cpDetector内置了一些常用的探测实现类,这些探测实现类的实例可以通过add方法加进来,
 *    ParsingDetector、 JChardetFacade、ASCIIDetector、UnicodeDetector. 
 * 2、detector按照“谁最先返回非空的探测结果,就以该结果为准”的原则. 
 * 3、cpDetector是基于统计学原理的,不保证完全正确.
 */
public class FileCharsetDetector {
    private static final Logger logger = LoggerFactory.getLogger(FileCharsetDetector.class);

    /**
     * 利用第三方开源包cpdetector获取文件编码格式.
     * 
     * @param is
     *            InputStream 输入流
     * @return
     */
    public static String getFileEncode(InputStream is) {
        //    begin     此段为zip格式文件的处理关键
        BufferedInputStream bis = null;
        if (is instanceof BufferedInputStream) {
            bis = (BufferedInputStream) is;
        } else {
            bis = new BufferedInputStream(is);
        }
        //   end

        CodepageDetectorProxy detector = CodepageDetectorProxy.getInstance();

        detector.add(new ParsingDetector(false));
        detector.add(UnicodeDetector.getInstance());
        detector.add(JChardetFacade.getInstance());// 内部引用了 chardet.jar的类
        detector.add(ASCIIDetector.getInstance());

        Charset charset = null;
        try {
            charset = detector.detectCodepage(bis, Integer.MAX_VALUE);// zip 判断的关键代码
        } catch (Exception e) {
            logger.error(e.getMessage(), e);
        } finally {
            if (bis != null) {
                try {
                    bis.close();
                } catch (IOException e) {
                    logger.error(e.getMessage(), e);
                }
            }
        }

        // 默认为GBK
        String charsetName = "GBK";
        if (charset != null) {
            if (charset.name().equals("US-ASCII")) {
                charsetName = "ISO_8859_1";
            } else {
                charsetName = charset.name();
            }
        }
        return charsetName;
    }


    public static String getFileEncode(File  file) {
        CodepageDetectorProxy detector = CodepageDetectorProxy.getInstance();

        detector.add(new ParsingDetector(false));
        detector.add(UnicodeDetector.getInstance());
        detector.add(JChardetFacade.getInstance());
        detector.add(ASCIIDetector.getInstance());

        Charset charset = null;
        try {
            charset = detector.detectCodepage(file.toURI().toURL());
        } catch (Exception e) {
            logger.error(e.getMessage(), e);
        } 

        // 默认为GBK
        String charsetName = "GBK";
        if (charset != null) {
            if (charset.name().equals("US-ASCII")) {
                charsetName = "ISO_8859_1";
            } else {
                charsetName = charset.name();
            }
        }
        return charsetName;
    }

}