Java获取文本文件字符编码的两种方法

最新推荐文章于 2023-10-21 11:07:58 发布

weixin_33890526

最新推荐文章于 2023-10-21 11:07:58 发布

阅读量1.2k

点赞数

文章标签： java python

原文链接：https://my.oschina.net/nivalsoul/blog/791243

版权

2019独角兽企业重金招聘Python工程师标准>>>

Java判断文本文件字符编码的两种方法：1、通过文件流的前面部分字节判断；2、通过cpdetector库提供的监听方法来判断。

1、取文件流方式

    public static String codeString(String fileName) throws Exception {
        BufferedInputStream bin = new BufferedInputStream(new FileInputStream(fileName));
        int p = (bin.read() << 8) + bin.read();
        bin.close();
        String code = null;
 
        switch (p) {
        case 0xefbb:
            code = "UTF-8";
            break;
        case 0xfffe:
            code = "Unicode";
            break;
        case 0xfeff:
            code = "UTF-16BE";
            break;
        default:
            code = "GBK";
        }
 
        return code;
    }

该方法一般情况是可以正常运行的，但对有些文件却不生效，不能获取正确的编码，故而可采取如下方法。

2、使用cpdetector库

使用Cpdetector jar包检测文件编码需要依赖antlr-2.7.4.jar、chardet-1.0.jar、jargs-1.0.jar三个jar包，可以到官网http://cpdetector.sourceforge.net下载。

详细的使用可以参考官网，简单的代码示例如下：

	/**
	 * <div>
	 * 利用第三方开源包cpdetector获取文件编码格式.<br/>
	 * --1、cpDetector内置了一些常用的探测实现类,这些探测实现类的实例可以通过add方法加进来,
	 *   如:ParsingDetector、 JChardetFacade、ASCIIDetector、UnicodeDetector. <br/>
	 * --2、detector按照“谁最先返回非空的探测结果,就以该结果为准”的原则. <br/>
	 * --3、cpDetector是基于统计学原理的,不保证完全正确.<br/>
	 * </div>
	 * @param filePath
	 * @return 返回文件编码类型：GBK、UTF-8、UTF-16BE、ISO_8859_1
	 * @throws Exception 
	 */
	public static String getFileCharset(String filePath) throws Exception {
		CodepageDetectorProxy detector = CodepageDetectorProxy.getInstance();
		/*ParsingDetector可用于检查HTML、XML等文件或字符流的编码,
		 * 构造方法中的参数用于指示是否显示探测过程的详细信息，为false不显示。
	    */
		detector.add(new ParsingDetector(false));
		/*JChardetFacade封装了由Mozilla组织提供的JChardet，它可以完成大多数文件的编码测定。
		 * 所以，一般有了这个探测器就可满足大多数项目的要求，如果你还不放心，可以再多加几个探测器，
		 * 比如下面的ASCIIDetector、UnicodeDetector等。
        */
		detector.add(JChardetFacade.getInstance());
		detector.add(ASCIIDetector.getInstance());
		detector.add(UnicodeDetector.getInstance());
		Charset charset = null;
		File file = new File(filePath);
		try {
			//charset = detector.detectCodepage(file.toURI().toURL());
			InputStream is = new BufferedInputStream(new FileInputStream(filePath));
			charset = detector.detectCodepage(is, 8);
		} catch (Exception e) {
			e.printStackTrace();
			throw e;
		}

		String charsetName = "GBK";
		if (charset != null) {
			if (charset.name().equals("US-ASCII")) {
				charsetName = "ISO_8859_1";
			} else if (charset.name().startsWith("UTF")) {
				charsetName = charset.name();// 例如:UTF-8,UTF-16BE.
			}
		}
		return charsetName;
	}

转载于:https://my.oschina.net/nivalsoul/blog/791243