UTF-8 的BOM带来的麻烦

最新推荐文章于 2024-09-02 12:28:21 发布

jazwoo

最新推荐文章于 2024-09-02 12:28:21 发布

阅读量1k

点赞数 1

分类专栏： java

本文链接：https://blog.csdn.net/jazywoo123/article/details/8670361

版权

java 专栏收录该内容

67 篇文章 0 订阅

订阅专栏

生成的html文件，用记事本打开中文显示正常，可是用浏览器就是乱码。

后来发现虽然是utf-8的编码格式，但是是UTF-8 with BOM

public void htmlWrite(String charsetName) {
        try {
            out = new BufferedWriter(new OutputStreamWriter(
                        new FileOutputStream(outFileName), "UTF-8"));
            out.write(res);
            out.flush();
            if (out != null) {
                out.close();
            }
        } catch (Exception e) {
            try {
                if (out != null) {
                    out.close();
                }
            } catch (IOException e1) {
                System.out.print("write errors!" + e);
            }
            System.out.print("write errors!" + e);
        }
    }

这样生成了一个html，运行就会出现问题

UTF的字节序和BOM
UTF-8以字节为编码单元，没有字节序的问题。UTF-16以两个字节为编码单元，在解释一个UTF-16文本前，首先要弄清楚每个编码单元的字节序。例如收到一个“奎”的Unicode编码是594E，“乙”的Unicode编码是4E59。如果我们收到UTF-16字节流“594E”，那么这是“奎”还是“乙”？

Unicode规范中推荐的标记字节顺序的方法是BOM。BOM不是“Bill Of Material”的BOM表，而是Byte Order Mark。BOM是一个有点小聪明的想法：

在UCS编码中有一个叫做"ZERO WIDTH NO-BREAK SPACE"的字符，它的编码是FEFF。而FFFE在UCS中是不存在的字符，所以不应该出现在实际传输中。UCS规范建议我们在传输字节流前，先传输字符"ZERO WIDTH NO-BREAK SPACE"。

这样如果接收者收到FEFF，就表明这个字节流是Big-Endian的；如果收到FFFE，就表明这个字节流是Little-Endian的。因此字符"ZERO WIDTH NO-BREAK SPACE"又被称作BOM。

UTF-8不需要BOM来表明字节顺序，但可以用BOM来表明编码方式。字符"ZERO WIDTH NO-BREAK SPACE"的UTF-8编码是EF BB BF（读者可以用我们前面介绍的编码方法验证一下）。所以如果接收者收到以EF BB BF开头的字节流，就知道这是UTF-8编码了。

Windows就是使用BOM来标记文本文件的编码方式的。

原来BOM是在文件的开始加了几个字节作为标记。有了这个标记，一些协议和系统才能识别。

UTF-8
8-bit encoded Unicode. neé UTF8. Optional marker on front of file: EF BB BF for reading. Unfortunately, OutputStreamWriter does not automatically insert the marker on writing. Notepad can't read the file without this marker. Now the question is, how do you get that marker in there? You can't just emit the bytes EF BB BF since they will be encoded and changed. However, the solution is quite simple. prw.write( '\ufeff' ); at the head of the file. This will be encoded as EF BB BF.
DataOutputStreams have a binary length count in front of each string. Endianness does not apply to 8-bit encodings. Java DataOutputStream and ObjectOutputStream uses a slight variant of kosher UTF-8. To aid with compatibility with C in JNI, the null byte '\u0000' is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls. Only the 1-byte, 2-byte, and 3-byte formats are used. Supplementary characters, (above 0xffff), are represented in the form of surrogate pairs (a pair of encoded 16 bit characters in a special range), rather than directly encoding the character.

prw.write( '\ufeff' );就是这个。
代码变为：

public void htmlWrite(String charsetName) {
        try {
            out = new BufferedWriter(new OutputStreamWriter(
                        new FileOutputStream(outFileName), "UTF-8"));
            out.write('\ufeff');
            out.write(res);
            out.flush();
            if (out != null) {
                out.close();
            }
        } catch (Exception e) {
            try {
                if (out != null) {
                    out.close();
                }
            } catch (IOException e1) {
                System.out.print("write errors!" + e);
            }
            System.out.print("write errors!" + e);
        }
    }

问题解决。