通过文件头判断文件编码

最新推荐文章于 2024-07-03 02:34:55 发布

_狼_

最新推荐文章于 2024-07-03 02:34:55 发布

阅读量2.1k

点赞数 1

分类专栏： Java

本文链接：https://blog.csdn.net/nrs12345/article/details/86746991

版权

Java 专栏收录该内容

9 篇文章 1 订阅

订阅专栏

概述

常用的几种文件编码：

ansi
unicode
utf8
gb2312

在此主要讨论验证方法，不讨论编码定义。

创建文件

创建4个不同编码的文件，分别命名为unicode.txt, gb2312.txt, utf8.txt, utf8bom.txt。内容“一”，使用Nodepad++分别转码为对应的编码。

二进制读取

import org.apache.commons.io.FileUtils;
import java.io.*;
import java.util.Arrays;
import java.util.stream.Stream;

public class FileLearning {

    public void binary(String filepath) throws IOException {

        File file = new File(filepath);
        byte[] data = FileUtils.readFileToByteArray(file);
        bytesToHex(data);
    }

    /**
     * 字节数组转16进制
     * @param bytes 需要转换的byte数组
     * @return  转换后的Hex字符串
     */

    public static void bytesToHex(byte[] bytes) {

        StringBuffer sb = new StringBuffer();
        for(int i = 0; i < bytes.length; i++) {
            String hex = Integer.toHexString(bytes[i] & 0xFF);
            if(hex.length() < 2){
                sb.append(0);
            }
            sb.append(hex);
        }
        System.out.println(sb.toString());
    }
}

单元测试

fileLearning.binary("unicode.txt");
fileLearning.binary("ansi.txt");
fileLearning.binary("utf8.txt");
fileLearning.binary("utf8bom.txt");

输出结果为

fffe004e
d2bb
e4b880
efbbbfe4b880

分析

1. ANSI：文件的编码就是两个字节“D2 BB”，这正是“一”的ANSI编码，这也暗示ANSI是采用大头方式存储的。
2. Unicode：编码是四个字节“FF FE 00 4E”，其中“FF FE”表明是小头方式存储，真正的编码是4E00。如果使用Big-Endian格式存储的话（Unicode big endian）：编码是四个字节“FE FF 4E 00”，其中“FE FF”表明是大头方式存储。
3. UTF-8 BOM：编码是六个字节“EF BB BF E4 B8 80”，前三个字节“EF BB BF”表示这是UTF-8编码，后三个“E4B880”就是“一”的具体编码，它的存储顺序与编码顺序是一致的。
4. UTF-8：编码没有表使用UTF-8的“EF BB FE”。

编码详情请查看 https://www.cnblogs.com/gavin-num1/p/5170247.html
中文Unicode编码查看地址 http://www.chi2ko.com/tool/CJK.htm

结论

FF FE或者FE FF开始的文件为Unicode编码的文件。
EF BB FE开始的文件为UTF-8 BOM文件。
GB2312和UTF-8无法从文件头部辨别。

补充

测试时遇到的坑，在Windows测试中，ansi.txt的内容“一”，保存重新打开之后，变成了UTF-8格式的“h”。

建议测试的不要时候单字测试，结果不够准确。我把“一”替换成“大家好”后，结果更准确。

想准确判断文件编码，可以使用第三方工具，如JUniversalCharDet。

Maven依赖

<!-- https://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet -->
<dependency>
    <groupId>com.googlecode.juniversalchardet</groupId>
    <artifactId>juniversalchardet</artifactId>
    <version>1.0.3</version>
</dependency>

测试方法(网上摘录的方法)

public String getCharset(String filepath) throws FileNotFoundException {

    File file = new File(filepath);
    InputStream is = new FileInputStream(file);
    UniversalDetector detector = new UniversalDetector(null);
    try {
        byte[] bytes = new byte[1024];
        int nread;
        if ((nread = is.read(bytes)) > 0 && !detector.isDone()) {
            detector.handleData(bytes, 0, nread);
        }
    } catch (Exception localException) {
        logger.info("detected code:", localException);
    }
    detector.dataEnd();
    String encode = detector.getDetectedCharset();
    /** default UTF-8 */
    if (StringUtils.isEmpty(encode)) {
        encode = "UTF-8";
    }
    detector.reset();
    return encode;
}

测试方法

@Test
public void test02() throws FileNotFoundException {

    String unicodeCharset = fileLearning.getCharset("E:\\unicode.txt");
    String ansiCharset = fileLearning.getCharset("E:\\ansi.txt");
    String utf8Charset = fileLearning.getCharset("E:\\utf8.txt");
    String utf8bomCharset = fileLearning.getCharset("E:\\utf8bom.txt");
    
    System.out.println("unicode charset: " + unicodeCharset);
    System.out.println("ansi charset: " + ansiCharset);
    System.out.println("utf8 charset: " + utf8Charset);
    System.out.println("utf8bom charset: " + utf8bomCharset);
}

输出结果

unicode: UTF-16LE
ansi charset: WINDOWS-1252
utf8 charset: UTF-8
utf8bom charset: UTF-8

从结果可知，判断结果还蛮可靠的。

！！！有不妥之处请指教！！！

_狼_

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
2
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录