BOM问题

最新推荐文章于 2024-04-28 15:31:15 发布

vannachen

最新推荐文章于 2024-04-28 15:31:15 发布

阅读量3.3k

点赞数

分类专栏： Database 文章标签： db2 jdbc ibm search windows

本文链接：https://blog.csdn.net/vannachen/article/details/1417404

版权

Database 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

unicode网站上介绍BOM如下：

详见：http://www.unicode.org/faq/utf_bom.html#BOM 这是网上的另一段说明：

Byte Order Mark (BOM) FAQ

Q: What is a BOM?

A: A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream , where it can be used as a signature defining the byte order and encoding form , primarily of unmarked plaintext files. Under some higher level protocols , use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol. [ AF ]

Q: Where is a BOM useful?

A: A BOM is useful at the beginning of files that are typed as text , but for which it is not known whether they are in big or little endian format—it can also serve as a hint indicating that the file is in Unicode , as opposed to in a legacy encoding and furthermore , it act as a signature for the specific encoding form used . [ MD ] & [ AF ]

Q: What does ‘endian’ mean?

A: Data types longer than a byte can be stored in computer memory with the most significant byte (MSB) first or last. The former is called big-endian , the latter little-endian. When data are exchange in the same byte order as they were in the memory of the originating system , they may appear to be in the wrong byte order on the receiving system. In that situation , a BOM would look like 0xFFFE which is a noncharacter , allowing the receiving system to apply byte reversal before processing the data. UTF- 8 is byte oriented and therefore does not have that issue. Nevertheless , an initial BOM might be useful to identify the datastream as UTF- 8 . [ AF ]

在UCS 编码中有一个叫做 " ZERO WIDTH NO-BREAK SPACE " 的字符，它的编码是FEFF。而FFFE在UCS中是不存在的字符，所以不应该出现在实际传输中。UCS规范建议我们在传输字节流前，先传输字符 " ZERO WIDTH NO-BREAK SPACE " 。这样如果接收者收到FEFF，就表明这个字节流是Big-Endian的；如果收到FFFE，就表明这个字节流是Little-Endian的。因此字符 " ZERO WIDTH NO-BREAK SPACE " 又被称作BOM。

UTF-8不需要BOM来表明字节顺序，但可以用BOM来表明编码方式。字符 " ZERO WIDTH NO-BREAK SPACE " 的UTF-8编码是EF BB BF。所以如果接收者收到以EF BB BF开头的字节流，就知道这是UTF-8编码了。

Windows就是使用BOM来标记文本文件的编码方式的。

言归正传！

今天在使用IBM DB2的Net Search Extender的时候发现，英文、德文都没有问题，只要DEF文件中一旦有中文字符，就会说无法使用db2extth编译。直到插手册的时候，我的Partner给我写了这么一段话：

Wichtig!: Die UTF8 BOM muss vor der Kompilierung aus dem Definitionsdatei entfernt werden. Die " db2extth " kann die BOM nicht verarbeiten und meldet es als Fehler in der Eingabe.

原来如此！
然后又在网上查BOM的资料，得到以上的那些引文。好，下面开始实战：

将DEF文件重新编写，使用UltraEdit将文件保存为“UTF-8不含BOM”，然后编译==通过！使用NSE的Thesaurus，得出检索结果！
后来我又想，如果我的DEF文件内的中文字符是用JDBC从DB2中取出来的，那么会不会有问题呢？还需要经过什么无BOM的UTF8转换之类的吗？于是，试了一下，结果完全没问题！然后又查BOM的相关说明，发现了以下的话：

Q: Why wouldn’t I always use a protocol that requires a BOM?

A: Where the data is typed , such as a field in a database , a BOM is unnecessary. In particular , if a text data stream is marked as UTF-16BE , UTF-16LE , UTF-32BE or UTF-32LE , a BOM is neither necessary nor permitted. Any FEFF would be interpreted as a ZWNBSP.

Do not tag every string in a database or set of fields with a BOM , since it wastes space and complicates string concatenation. Moreover , it also means two data fields may have precisely the same content , but not be binary-equal (where one is prefaced by a BOM). [ MD ]

呵呵，又是豁然开朗！

完毕！

vannachen

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
BOM问题

unicode网站上介绍BOM如下：详见：http://www.unicode.org/faq/utf_bom.html#BOM这是网上的另一段说明：Byte Order Mark (BOM) FAQQ: What is a BOM?A: A byte order mark (BOM) consists of the character code U+FEFF at the begi
复制链接

扫一扫