在java的项目中,为了对付多个平台,有时候需要判断要出来的文本文件时什么编码.
如果不知道编码的话有可能就产生乱码了.
当然有一中方法就是事先约定是什么编码.比如只能是UTF-8等,这样就稍微有一点不方便.
最近在发现了一个判断的小工具.比较好用.随便推广一下.
juniversalchardet:http://code.google.com/p/juniversalchardet/
能够识别的编码如下:
- Chinese
- ISO-2022-CN
- BIG5
- EUC-TW
- GB18030
- HZ-GB-23121
- Cyrillic
- ISO-8859-5
- KOI8-R
- WINDOWS-1251
- MACCYRILLIC
- IBM866
- IBM855
- Greek
- ISO-8859-7
- WINDOWS-1253
- Hebrew
- ISO-8859-8
- WINDOWS-1255
- Japanese
- ISO-2022-JP
- SHIFT_JIS
- EUC-JP
- Korean
- ISO-2022-KR
- EUC-KR
- Unicode
- UTF-8
- UTF-16BE / UTF-16LE
- UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-34121 / X-ISO-10646-UCS-4-21431
- Others
- WINDOWS-1252
测试代码:
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
byte[] buf = new byte[4096];
//String fileName = args[0];
String fileName = "d:/test.txt";
java.io.FileInputStream fis = new java.io.FileInputStream(fileName);
// (1)
UniversalDetector detector = new UniversalDetector(null);
// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();
// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}
// (5)
detector.reset();
String cread;
StringBuffer content = new StringBuffer();
InputStreamReader r = new InputStreamReader(new FileInputStream(fileName), encoding);
BufferedReader in = new BufferedReader(r);
while ((cread = in.readLine()) != null) {
content.append(cread);
}
System.out.print(content.toString());
}