java检测文本(字节流)的编码方式

需求:

某文件或者某字节流要检测他的编码格式。

实现:

基于jchardet

<dependency>
	<groupId>net.sourceforge.jchardet</groupId>
	<artifactId>jchardet</artifactId>
	<version>1.0</version>
</dependency>

 

代码如下:

public class DetectorUtils {
	private DetectorUtils() {
	}

	static class ChineseCharsetDetectionObserver implements
			nsICharsetDetectionObserver {
		private boolean found = false;
		private String result;

		public void Notify(String charset) {
			found = true;
			result = charset;
		}

		public ChineseCharsetDetectionObserver(boolean found, String result) {
			super();
			this.found = found;
			this.result = result;
		}

		public boolean isFound() {
			return found;
		}

		public String getResult() {
			return result;
		}

	}

	public static String[] detectChineseCharset(InputStream in)
			throws Exception {
		String[] prob=null;
		BufferedInputStream imp = null;
		try {
			boolean found = false;
			String result = Charsets.UTF_8.toString();
			int lang = nsPSMDetector.CHINESE;
			nsDetector det = new nsDetector(lang);
			ChineseCharsetDetectionObserver detectionObserver = new ChineseCharsetDetectionObserver(
					found, result);
			det.Init(detectionObserver);
			imp = new BufferedInputStream(in);
			byte[] buf = new byte[1024];
			int len;
			boolean isAscii = true;
			while ((len = imp.read(buf, 0, buf.length)) != -1) {
				if (isAscii)
					isAscii = det.isAscii(buf, len);
				if (!isAscii) {
					if (det.DoIt(buf, len, false))
						break;
				}
			}

			det.DataEnd();
			boolean isFound = detectionObserver.isFound();
			if (isAscii) {
				isFound = true;
				prob = new String[] { "ASCII" };
			} else if (isFound) {
				prob = new String[] { detectionObserver.getResult() };
			} else {
				prob = det.getProbableCharsets();
			}
			return prob;
		} finally {
			IOUtils.closeQuietly(imp);
			IOUtils.closeQuietly(in);
		}
	}
}

测试:

		String file = "C:/3737001.xml";
		String[] probableSet = DetectorUtils.detectChineseCharset(new FileInputStream(file));
		for (String charset : probableSet) {
			System.out.println(charset);
		}

 依赖的jar参见附件

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值