首先,不同编码的文本,是根据文本的前两个字节来定义其编码格式的。定义如下:
ANSI: 无格式定义;
Unicode: 前两个字节为FFFE;
Unicode big endian: 前两字节为FEFF;
UTF-8: 前两字节为EFBB;
知道了各种编码格式的区别,写代码就容易了.
- publicstaticStringget_charset(Filefile){
- Stringcharset="GBK";
- byte[]first3Bytes=newbyte[3];
- try{
- boolean;
- BufferedInputStreambis=newBufferedInputStream(newFileInputStream(file));
- bis.mark(0);
- intread=bis.read(first3Bytes,0,3);
- if(read==-1)returncharset;
- if(first3Bytes[0]==(byte)0xFF&&first3Bytes[1]==(byte)0xFE){
- charset="UTF-16LE";
- checked=true;
- }
- elseif(first3Bytes[0]==(byte)0xFE&&first3Bytes[1]==(byte)0xFF){
- charset="UTF-16BE";
- checked=true;
- }
- elseif(first3Bytes[0]==(byte)0xEF&&first3Bytes[1]==(byte)0xBB&&first3Bytes[2]==(byte)0xBF){
- charset="UTF-8";
- checked=true;
- }
- bis.reset();
- if(!checked){
- //intlen=0;
- intloc=0;
- while((read=bis.read())!=-1){
- loc++;
- if(read>=0xF0)break;
- if(0x80<=read&&read<=0xBF)//单独出现BF以下的,也算是GBK
- break;
- if(0xC0<=read&&read<=0xDF){
- read=bis.read();
- if(0x80<=read&&read<=0xBF)//双字节(0xC0-0xDF)(0x80
- //-0xBF),也可能在GB编码内
- continue;
- elsebreak;
- }
- elseif(0xE0<=read&&read<=0xEF){//也有可能出错,但是几率较小
- read=bis.read();
- if(0x80<=read&&read<=0xBF){
- read=bis.read();
- if(0x80<=read&&read<=0xBF){
- charset="UTF-8";
- break;
- }
- elsebreak;
- }
- elsebreak;
- }
- }
- //System.out.println(loc+""+Integer.toHexString(read));
- }
- bis.close();
- }catch(Exceptione){
- e.printStackTrace();
- }
- returncharset;
- }
From:http://ajava.org/code/I18N/14816.html