关于Windows下记事本中保存编码的格式问题

最新推荐文章于 2023-04-29 12:07:07 发布

siege

最新推荐文章于 2023-04-29 12:07:07 发布

阅读量2.4w

点赞数

分类专栏： JAVA 字符编码

本文链接：https://blog.csdn.net/u010999240/article/details/71836108

版权

JAVA 同时被 2 个专栏收录

43 篇文章 0 订阅

订阅专栏

字符编码

2 篇文章 0 订阅

订阅专栏

关于Windows下记事本中保存编码的格式问题

Windows下记事本保存文本文件的时候，可以选择不同的编码格式来保存文件，各种编码保存的文件的二进制是不同的，举例说明：

我们在记事本中输入123，选择默认的编码格式，即ANSI，也就是系统默认的编码格式，简体中文版的默认编码格式为GBK，此时我们使用二进制工具打开时，其二进制形式为：

31 32 33

使用Unicode编码保存，实际上，这种称呼是不正确的，Unicode只是表示字符集方案，并不能表示编码方案，windows对Unicode实际上采用的编码方案是UTF-16LE，其会在文本的开头插入小段字节序标识BOM（FFFE），故其二进制为：

FF FE 31 00 32 00 33 00

使用Unicode big endian编码保存，这种称呼也是不正确的，windows实际上采用的编码方案是UTF-16BE，其会在文本的开头插入大端字节序标识BOM（FEFF），故其二进制为：

FE FF 00 31 00 32 00 33

使用UTF-8编码保存，这种称呼也是不正确的，正常UTF-8编码的二进制是没有BOM标识的，而windows上的UTF-8编码的文件时有UTF-8 BOM标识（EF BB BF），故其二进制为：

EF BB BF 31 32 33

下面请看由BOM头引起的问题的例子：

package test;

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.Arrays;

public class Test1 {
    public static void main(String[] args) throws IOException {
        String myString = "";
        byte[] bytes = new byte[10];
        int readCount = 0;
        try (FileOutputStream outputStream = new FileOutputStream("D:\\test\\hello.txt")) {
            outputStream.write(new byte[] { -2, -1, 0, 0x31, 0, 0x32, 0, 0x33 });
            outputStream.flush();
            outputStream.close();
        } catch (Exception e) {
        }
        try (FileInputStream reader = new FileInputStream("D:\\test\\hello.txt")) {
            while ((readCount = reader.read(bytes, 0, 10)) != -1) {
                myString += new String(bytes, 0, readCount, "UTF-16BE");
                System.out.println(Arrays.toString(bytes));
                System.out.println(myString);
                System.out.println(Integer.parseInt(myString));
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

该例子我们通过程序写入二进制数据：

FE FF 00 31 00 32 00 33

以UTF-16BE的方式读入，当我们将读取的字符串转化为数字时，出现错误了，其上面的输出结果如下：

[-2, -1, 0, 49, 0, 50, 0, 51, 0, 0] 123
java.lang.NumberFormatException: For input string: ”123” at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580) at
java.lang.Integer.parseInt(Integer.java:615) at
test.Test1.main(Test1.java:24)

其真正原因就是这个BOM字节序导致的，一般情况下很难发现这个错误，因为输出的字符串就是“123”，与正常的字符串结果看起来并没有什么不同，这时我们应该想到要查下其二进制表示，这样很快就能发现问题了。

最后，关于字节序BOM，上文提到各种不同的编码其字节序不同，实际上BOM是指一个Unicode character，其值为
U+FEFF，但是由于编码方式不同，其表示出来不同的值，但是都是映射到同一个Unicode字符集上了。

The byte order mark (BOM) is a Unicode character, U+FEFF Byte order mark (BOM), whose appearance as a magic number at the start of a text stream can signal several things to a program consuming the text。

代码为证：

package test;

import java.util.Arrays;

public class Main {
    public static void main(String[] args) throws Exception {
        byte[] a = new byte[] { 0xEF - 256, 0xBB - 256, 0xBF - 256 };
        byte[] b = new byte[] { 0xFE - 256, 0xFF - 256 };
        byte[] c = new byte[] { 0xFF - 256, 0xFE - 256 };
        String aString = new String(a, 0, 3, "UTF-8");
        String bString = new String(b, 0, 2, "UTF-16BE");
        String cString = new String(c, 0, 2, "UTF-16LE");
        System.out.println(Arrays.toString(aString.getBytes("UTF-8")));
        System.out.println(Arrays.toString(bString.getBytes("UTF-8")));
        System.out.println(Arrays.toString(cString.getBytes("UTF-8")));
    }
}