文件编码鉴定小工具_universaldetector-CSDN博客

本文链接：https://blog.csdn.net/baidu_29609961/article/details/85276538

juniversalchardet

开源于github

应用的话首先添加依赖

<dependency>
<groupId>com.github.albfernandez</groupId>
<artifactId>juniversalchardet</artifactId>
<version>2.3.0</version>
</dependency>

public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
byte[] buf = new byte[4096];
java.io.InputStream fis = new FileInputStream("E:\\test1");

// (1)
UniversalDetector detector = new UniversalDetector();

// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();

// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}

// (5)
detector.reset();
}

这样就可以判断文件的编码了，现在支持的有以下几种编码格式

Chinese
- ISO-2022-CN
- BIG-5
- EUC-TW
- HZ-GB-2312
Cyrillic
- ISO-8859-5
- KOI8-R
- WINDOWS-1251
- MACCYRILLIC
- IBM866
- IBM855
Greek
- ISO-8859-7
- WINDOWS-1253
Hebrew
- ISO-8859-8
- WINDOWS-1255
Japanese
- ISO-2022-JP
- Shift_JIS
- EUC-JP
Korean
- ISO-2022-KR
- EUC-KR
Unicode
- UTF-8
- UTF-16BE / UTF-16LE
- UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-3412 / X-ISO-10646-UCS-4-2143
Others
- WINDOWS-1252