UTF-8编码格式的Byte Order Mark问题

前两天同事编写的SQL Server数据库脚本文件交给我运行时,出现了syntax error的错误,但将文件内容拷贝到SQL Server Management Studio里面运行时却一切正常。。。真是很诡异,经检查许久,才发现原来是UTF-8编码的BOM(Byte Order Mark)问题。

以下摘自wikipedia:

The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.[1]

Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving these encodings from arbitrary sources needs to know which byte order the integers are encoded in. The BOM gives the producer of the text a way to describe the text stream's endianness to the consumer of the text without requiring some contract or metadata outside of the text stream itself. Once the receiving computer has consumed the text stream, it presumably processes the characters in its own native byte order and no longer needs the BOM. Hence the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment.

The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters ï»¿ for this.

The Unicode Standard does permit the BOM in UTF-8,[2] but does not require or recommend its use.[3] Byte order has no meaning in UTF-8[4] so in UTF-8 the BOM serves only to identify a text stream or file as UTF-8.

Many Windows programs (including Windows Notepad) add BOMs to UTF-8 files by default[citation needed].

因为Unicode可以采用16位或者32位编码,所以计算机在处理时需要知道其字节顺序,BOM就是用来标识字节流的字节顺序的,但字节顺序这个 概念对UTF-8来说是没有意义的,所以BOM对UTF-8同样没有意义。但Unicode标准却BOM在UTF-8编码格式中存在。其存在位置在文件开 头,以三个字节0xEF, 0xBB, 0xBF表示。

UTF-8编码不推荐使用无意义的BOM,但许多Windows程序却在保存UTF-8编码的文件时将其存为带BOM的格式(即在文件开头加上0xEFBBBF三个字节),这么干的就包括Windows记事本。

因此,在编辑UTF-8的文件时,建议不要使用记事本等进行编辑,虽然保存后的文件仍然是UTF-8,但却已经不再是保存前的UTF-8了,这在使用这些文件的时候可能就会因为编码而出现问题,就像我文章开头所描述的那样。

去掉UTF-8编码文件BOM的方法:用Notepad++的Encoding菜单中的Encoding in UTF-8 without BOM即可。或者用任何16进制编辑器将文件前三个字节去掉。再或者更简单的:用VIM设置UTF-8编码的BOM标记

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值