juniversalchardet
开源于github
应用的话首先添加依赖
<dependency>
<groupId>com.github.albfernandez</groupId>
<artifactId>juniversalchardet</artifactId>
<version>2.3.0</version>
</dependency>
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
byte[] buf = new byte[4096];
java.io.InputStream fis = new FileInputStream("E:\\test1");
// (1)
UniversalDetector detector = new UniversalDetector();
// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();
// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}
// (5)
detector.reset();
}
这样就可以判断文件的编码了,现在支持的有以下几种编码格式
-
Chinese
- ISO-2022-CN
- BIG-5
- EUC-TW
- HZ-GB-2312
-
Cyrillic
- ISO-8859-5
- KOI8-R
- WINDOWS-1251
- MACCYRILLIC
- IBM866
- IBM855
-
Greek
- ISO-8859-7
- WINDOWS-1253
-
Hebrew
- ISO-8859-8
- WINDOWS-1255
-
Japanese
- ISO-2022-JP
- Shift_JIS
- EUC-JP
-
Korean
- ISO-2022-KR
- EUC-KR
-
Unicode
- UTF-8
- UTF-16BE / UTF-16LE
- UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-3412 / X-ISO-10646-UCS-4-2143
-
Others
- WINDOWS-1252