简单自动获取文件编码


前段时间,在文章中用到读取文件,由于文件的编码不同,需要在程序中不断的调整读取文件的编码格式。

BufferedReader reader = newBufferedReader(newInputStreamReader(new FileInputStream(new File(文件名)),编码格式));

         在网上找了一些资料,对他们总结一下,以备以后需要用的时候能够方便查找。资料整理如下:

        Unicode:      前两个字节为FFFE

        Unicodebig endian: 前两字节为FEFF;  

         UTF-8:       前两字节为EFBB; 

        方法一.由于主要用到的编码格式是UTF-8GBK的,所以很多时候只需要做如下的判断:

File file = new File(文件名);

         InputStreamios = null;

         byte[] b =new byte[3];

         ios = newFileInputStream(file);

         ios.read(b);

         ios.close();

         Stringencode;

         if (b[0]== -17 && b[1] == -69 && b[2] ==-65) { // 文件头

                  encode="UTF-8";

                   System.out.println(file.getName()+ ":编码为UTF-8");

         } else {

                  encode="GBK";

                    System.out.println(file.getName()+":可能是GBK,也可能是其他编码。");

         }

方法二.见http://www.cppblog.com/biao/archive/2009/11/04/100130.aspx

         这个比较的详细一些,基本方法一能识别的,这个方法也能识别

         public static String get_charset(File file) {

                      Stringcharset = "GBK";

                  byte[]first3Bytes = new byte[3];

                  try {

                      boolean checked = false;

                            BufferedInputStreambis = new BufferedInputStream( new FileInputStream(file));

                             bis.mark(0);

                              int read = bis.read(first3Bytes, 0,3);

                              if (read == -1)

                                         return charset;

                              if (first3Bytes[0] == (byte) 0xFF&& first3Bytes[1] == (byte)0xFE) {

                                     charset= "UTF-16LE";

                                      checked = true;

                             } else if (first3Bytes[0] == (byte)0xFE && first3Bytes[1] == (byte)0xFF) {

                                      charset = "UTF-16BE";

                                      checked = true;

                             } else if (first3Bytes[0] == (byte)0xEF  && first3Bytes[1] == (byte)0xBB&& first3Bytes[2] == (byte)0xBF) {

                                      charset = "UTF-8";

                                      checked = true;

                             }

                             bis.reset();

                             if (!checked) {

                                        int loc = 0;

                                        while ((read = bis.read()) != -1) {

                                    loc++;

                                       if (read >= 0xF0)

                                                  break;

                                                  if (0x80 <= read && read<= 0xBF) // 单独出现BF以下的,也算是GBK

                                                  break;

                                       if (0xC0 <= read && read<= 0xDF) {

                                                   read = bis.read();

                                                 if (0x80 <= read && read<= 0xBF) // 双字节 (0xC0- 0xDF)

                                                    // (0x80

                                                     // -0xBF),也可能在GB编码内

                                                continue;

                                                 else

                                                          break;

                                      } else if (0xE0 <= read && read<= 0xEF) {// 也有可能出错,但是几率较小

                                                         read = bis.read();

                                                   if (0x80 <= read && read<= 0xBF) {

                                                                     read = bis.read();

                                                                     if (0x80 <= read && read<= 0xBF) {

                                                                            charset = "UTF-8";

                                                                              break;

                                                         } else

                                                                               break;

                                                                  } else

                                                                                        break;

                                                          }

                                      }

                      //System.out.println( loc +" " + Integer.toHexString( read )

                             }

                             bis.close();

                             } catch (Exception e) {

                              e.printStackTrace();

                              }

                     return charset;

           }

方法三.参考http://blog.sina.com.cn/s/blog_904e7b150100zvcv.html

         这个方法需要用到cpdetector的一个jar包,我用的是cpdetector_1.0.10.jar,用这个包还需要导入antlrchardet两个jar

         这个包用在大部分情况下识别基本准确,但是测试GBK编码的识别不出来(可能我测试的不够准确)

         public static String getFileEncode(File file) {

                   CodepageDetectorProxydetector = CodepageDetectorProxy.getInstance();

                   //下面可以添加集中识别编码的

                  detector.add(new ParsingDetector(false));

                    detector.add(JChardetFacade.getInstance());

                   detector.add(ASCIIDetector.getInstance());

                     detector.add(UnicodeDetector.getInstance());

 

                    Charset charSet = null;

                    try {

                       charSet = detector.detectCodepage(file.toURI().toURL());

                     } catch (MalformedURLException e) {

                              // TODO Auto-generatedcatchblock

                      e.printStackTrace();

                    } catch (IOException e) {

                             // TODO Auto-generatedcatchblock

                    e.printStackTrace();

                     }

 

                     if (charSet != null){

                                return charSet.name();

                    } else {

                       return null;

                    }

           }




  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值