简单自动获取文件编码

最新推荐文章于 2023-06-25 16:23:49 发布

zhjw1006

最新推荐文章于 2023-06-25 16:23:49 发布

阅读量336

点赞数

分类专栏： Java 文章标签： byte file exception jar 测试 null

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/zhjw1006/article/details/7996525

版权

Java 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

前段时间，在文章中用到读取文件，由于文件的编码不同，需要在程序中不断的调整读取文件的编码格式。

BufferedReader reader = newBufferedReader(newInputStreamReader(new FileInputStream(new File(文件名)),编码格式));

在网上找了一些资料，对他们总结一下，以备以后需要用的时候能够方便查找。资料整理如下：

Unicode：　　　　　　前两个字节为FFFE；

Unicodebig endian：　前两字节为FEFF；　

UTF-8：　　　　　　　前两字节为EFBB；　

方法一.由于主要用到的编码格式是UTF-8和GBK的，所以很多时候只需要做如下的判断：

File file = new File(文件名);

InputStreamios = null;

byte[] b =new byte[3];

ios = newFileInputStream(file);

ios.read(b);

ios.close();

Stringencode;

if (b[0]== -17 && b[1] == -69 && b[2] ==-65) { // 文件头

encode="UTF-8";

System.out.println(file.getName()+ "：编码为UTF-8");

} else {

encode="GBK";

System.out.println(file.getName()+"：可能是GBK，也可能是其他编码。");

}

方法二.见http://www.cppblog.com/biao/archive/2009/11/04/100130.aspx

这个比较的详细一些，基本方法一能识别的，这个方法也能识别

public static String get_charset(File file) {

Stringcharset = "GBK";

byte[]first3Bytes = new byte[3];

try {

boolean checked = false;

BufferedInputStreambis = new BufferedInputStream( new FileInputStream(file));

bis.mark(0);

int read = bis.read(first3Bytes, 0,3);

if (read == -1)

return charset;

if (first3Bytes[0] == (byte) 0xFF&& first3Bytes[1] == (byte)0xFE) {

charset= "UTF-16LE";

checked = true;

} else if (first3Bytes[0] == (byte)0xFE && first3Bytes[1] == (byte)0xFF) {

charset = "UTF-16BE";

checked = true;

} else if (first3Bytes[0] == (byte)0xEF && first3Bytes[1] == (byte)0xBB&& first3Bytes[2] == (byte)0xBF) {

charset = "UTF-8";

checked = true;

}

bis.reset();

if (!checked) {

int loc = 0;

while ((read = bis.read()) != -1) {

loc++;

if (read >= 0xF0)

break;

if (0x80 <= read && read<= 0xBF) // 单独出现BF以下的，也算是GBK

break;

if (0xC0 <= read && read<= 0xDF) {

read = bis.read();

if (0x80 <= read && read<= 0xBF) // 双字节 (0xC0- 0xDF)

// (0x80

// -0xBF),也可能在GB编码内

continue;

else

break;

} else if (0xE0 <= read && read<= 0xEF) {// 也有可能出错，但是几率较小

read = bis.read();

if (0x80 <= read && read<= 0xBF) {

read = bis.read();

if (0x80 <= read && read<= 0xBF) {

charset = "UTF-8";

break;

} else

break;

} else

break;

}

}

//System.out.println( loc +" " + Integer.toHexString( read )

}

bis.close();

} catch (Exception e) {

e.printStackTrace();

}

return charset;

}

方法三.参考http://blog.sina.com.cn/s/blog_904e7b150100zvcv.html

这个方法需要用到cpdetector的一个jar包，我用的是cpdetector_1.0.10.jar，用这个包还需要导入antlr和chardet两个jar包

这个包用在大部分情况下识别基本准确，但是测试GBK编码的识别不出来（可能我测试的不够准确）

public static String getFileEncode(File file) {

CodepageDetectorProxydetector = CodepageDetectorProxy.getInstance();

//下面可以添加集中识别编码的

detector.add(new ParsingDetector(false));

detector.add(JChardetFacade.getInstance());

detector.add(ASCIIDetector.getInstance());

detector.add(UnicodeDetector.getInstance());

Charset charSet = null;

try {

charSet = detector.detectCodepage(file.toURI().toURL());

} catch (MalformedURLException e) {

// TODO Auto-generatedcatchblock

e.printStackTrace();

} catch (IOException e) {

// TODO Auto-generatedcatchblock

e.printStackTrace();

}

if (charSet != null){

return charSet.name();

} else {

return null;

}

}

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
简单自动获取文件编码

前段时间，在文章中用到读取文件，由于文件的编码不同，需要在程序中不断的调整读取文件的编码格式。BufferedReader reader = newBufferedReader(newInputStreamReader(new FileInputStream(new File(文件名)),编码格式)); 在网上找了一些资料，对他们总结一下，以备以后需要用的时候能够方
复制链接

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。