利用cpdetector判断文本文档的编码

最新推荐文章于 2018-07-29 15:10:14 发布

fanday

最新推荐文章于 2018-07-29 15:10:14 发布

阅读量188

点赞数

分类专栏： Java基础文章标签： java

Java基础专栏收录该内容

15 篇文章 0 订阅

订阅专栏

文本文档不包含文档的编码信息，然而有些时候，我们必须要获得某个文件的编码，这时候怎么办？

1、自己造轮子，通过对各种编码的判断，确定其所属编码。

这种方式难度较大，而且对编码知识的要求较高。

2、借助其他已经存在的工具。

在网上找到了这个东西：cpdetector。看了下他自己的介绍，感觉其初衷是为抓取html而不能确定其编码而写的，里面有的方法可以直接通过传入url的方式确定其编码。

下面是个通俗的例子：

package encoding;

import info.monitorenter.cpdetector.io.ASCIIDetector;
import info.monitorenter.cpdetector.io.ByteOrderMarkDetector;
import info.monitorenter.cpdetector.io.CodepageDetectorProxy;
import info.monitorenter.cpdetector.io.UnicodeDetector;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.net.MalformedURLException;


public class CPDetectorTest {
	public static void main(String[] args) {
		System.out.println(getEncoding(new File("c:/test.txt")));
	}
	
	public static String getEncoding(File document) {

		CodepageDetectorProxy detector = CodepageDetectorProxy.getInstance();
		
	    detector.add(new ByteOrderMarkDetector()); 
	    detector.add(ASCIIDetector.getInstance());
	    detector.add(UnicodeDetector.getInstance());
		
		boolean ret = false;
	    java.nio.charset.Charset charset = null;
	    try {
			charset = detector.detectCodepage(document.toURL());
		} catch (MalformedURLException e1) {
			e1.printStackTrace();
		} catch (IOException e1) {
			e1.printStackTrace();
		}
	    return charset.toString();
	}
}

注意其中的这三行：

detector.add(new ByteOrderMarkDetector()); 
detector.add(ASCIIDetector.getInstance());
detector.add(UnicodeDetector.getInstance());

这是加载其内置的检测器，通过名字可以看出来其所能检测的字符集。

同时，上面的代码不能检测出gb2312等编码，没仔细找到底有没有gb2312等的检测器。

如果不能检测出的话，会返回一个void。

fanday

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录