浅析Unicode编码

最新推荐文章于 2023-09-06 17:20:15 发布

sunrock

最新推荐文章于 2023-09-06 17:20:15 发布

阅读量1.2k

点赞数

分类专栏： JAVA 文章标签： character transformation integer basic encoding table

JAVA 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

1.缘起
在学习java的过程中常常出现UTF-8、UTF-16等等的编码，而书中又没有给出相应的解释，这样就促使我去研究一下这些编码。

2.big- endian和little-endian
在谈编码之前，让我们先了解几个概念，即big-endian和little-endian。
“endian” 这个词出自《格列佛游记》。小人国的内战就源于吃鸡蛋时是究竟从大头(Big-Endian)敲开还是从小头(Little-Endian)敲开，由此曾发生过六次叛乱，其中一个皇帝送了命，另一个丢了王位。
在计算机中，big-endian和little-endian是cpu处理多字节数的不同方式。big-endian即高位数放在前，这和我们平时的阅读习惯一样。反之，little-endian是高位放在后面。
举个例子来说，"A"的Unicode编码为0x0041，则在文件中big-endian的顺序为00 41，而little-endian的顺序为41 00。

3.Unicode、UCS和UTF
和我们熟知的ASCII码一样，IS08859、Unicode和UCS(Universal Character Set)均是字符表，每个字符对应一个唯一的字符号(Character Number)。不同的仅是它们的字符集大小的不同。他们的大小关系是：ASCII<IS08859<Unicode<UCS。且这些字符表是向下兼容的，即同一个字符在这些方案中总是有相同的编码，后面的标准支持更多的字符。注意这里我们说的Unicode是指Unicode1.0 版。那么何谓Unicode1.0版，它是否还有其他版本呢？
那么我们就需要了解一下Unicode的历史：历史上存在两个试图独立设计 Unicode的组织，即国际标准化组织（ISO）和一个软件制造商的协会（unicode.org）。ISO开发了ISO-10646项目，也即我们通常所说的UCS，而Unicode协会开发了Unicode项目。其中ISO-10646的code space为U+0000-U+10FFFF，而Unicode的code space为U+0000-U+FFFF，这就是Unicode1.0，也是我们平时说的Unicode。
在1991年前后，双方都认识到世界不需要两个不兼容的字符集。于是它们开始合并双方的工作成果，并为创立一个单一编码表而协同工作。于是诞生了Unicode2.0，它采用了与ISO 10646-1相同的字库和字码。
UCS规定了怎么用多个字节表示各种文字。而怎样传输这些编码，是由UTF(UCS Transformation Format)规范规定的，常见的UTF规范包括UTF-8、UTF-7、UTF-16。由于java中主要使用UTF-8、UTF-16，故我们主要介绍这两种编码，顺带提及UTF-32，UCS-2，UCS-4。
4. UTF-16
我们先来谈谈UTF-16，这也是JDK5.0中 char类型的编码方式。首先我们定义：
Code Point：在ISO-10646编码机制中，字符对应的整数值称为Code Point。
在 Unicode标准中，code point的书写方式是：U+“16机制数”。如A的Unicode编码为U+0041。ISO-10646（即Unicode2.0标准）的code space为U+0000-U+10FFFF，它的code space可被分为17个group，每个group有65536个code point，第一个group称为basic multilingual plane,即为Unicode1.0标准定义的字符集合，范围从U+0000到U+FFFF。而code point在U+10000-U+10FFFF之间的字符被称为supplementary characters。
我们再定义：
Code Unit：code point在basic multilingual plane范围内用16bit表示的字符称为code unit.
显然在basic multilingual plane中的字符用一个code unit即可表示。那么如何表示supplementary characters呢？
我们首先需要知道一个背景：在basic multilingual plane的U+0000到U+FFFF的范围内有一段保留区范围(从U+D800到U+DFFF)，其中未定义任何字符，这个区域称为 surrogates area。supplementary characters是由两个code unit来表示，我们让U表示某一supplementary characters的code point，让U’= U-0x10000，因U<= 0x10FFFF，故U’<= 0xFFFFF刚好20bits，然后在两个code unit中各保存10bits。那么如何来放这10bits呢？UTF-16标准规定第一个code unit的范围是U+D800--U+DBFF，第二个code unit的范围是U+DC00-- U+DFFF。在编码时将两个16bits的整数初始化为0xD800和0xDC00，然后将20bits中的高10bits放到第一个code unit的低10bit中，20bits中的低10bits放到第二个code unit的低10bit中，这样便完成了supplementary characters的编码工作。
那么将ISO 10646 Character Number编码为UTF-16的过程为：
Let U be the character number, no greater than 0x10FFFF.
1) If U < 0x10000, encode U as a 16-bit unsigned integer and
terminate.

   2) Let U' = U - 0x10000. Because U is less than or equal to 0x10FFFF,
      U' must be less than or equal to 0xFFFFF. That is, U' can be
      represented in 20 bits.

   3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and
      0xDC00, respectively. These integers each have 10 bits free to
      encode the character value, for a total of 20 bits.

4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order
 bits of W1 and the 10 low-order bits of U' to the 10 low-order
 bits of W2. Terminate.
相应的，从 UTF-16解码为ISO 10646 Character Number的过程为：
Let W1 be the next 16-bit integer in the sequence of integers representing the text. Let W2 be the (eventual) next integer following W1.
1) If W1 < 0xD800 or W1 > 0xDFFF, the character value U is the value
 of W1. Terminate.

   2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence
      is in error and no valid character can be obtained using W1.
      Terminate.

   3) If there is no W2 (that is, the sequence ends with W1), or if W2
      is not between 0xDC00 and 0xDFFF, the sequence is in error.
      Terminate.

   4) Construct a 20-bit unsigned integer U', taking the 10 low-order
      bits of W1 as its 10 high-order bits and the 10 low-order bits of
      W2 as its 10 low-order bits.

5) Add 0x10000 to U' to obtain the character value U. Terminate.
从这里可以看出限定两个code unit的范围可以让我快速的确定一个字符是由一个code unit组成还是两个，而且能确定此code unit是supplementary characters的第一部分还是第二部分。
谈到这里，我们还要讨论一下UCS-2，它常和UTF-16混淆。UCS-2是ISO 10646编码规范定义的一个编码表，它用16bits来编码字符，但是它的code space仅为U+0000-U+FFFF，即Basic Multilingual Plane，它不能表示supplementary characters。
5. UTF-8
UTF-8要编码的字符均来自范围U+0000--U+10FFFF，这些字符用UTF-8编码会生成1到4个“8比特组”(octect)，其编码形式为：
Character number range(16进制)  UTF-8 octet sequence(2进制)
0000 0000-0000 007F                      0xxxxxxx
0000 0080-0000 07FF                      110xxxxx 10xxxxxx
0000 0800-0000 FFFF                      1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF                      11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

从中我们可以看出在范围00-7F内，ASCII码和UTF-8的编码一样。除此之外，每个字符号范围对应的UTF-8 octet sequence的第一个octet中“1”的数量等于这个sequence中octet的数量，后面的每个octet均以“10”开头。
将一个字符编码为UTF-8的过程如下：
1. Determine the number of octets required from the character number
       and the first column of the table above. It is important to note
       that the rows of the table are mutually exclusive, i.e, there is
       only one valid way to encode a given character.

2. Prepare the high-order bits of the octets as per the second
column of the table.

   3. Fill in the bits marked x from the bits of the character number,
       expressed in binary. Start by putting the lowest-order bit of
       the character number in the lowest-order position of the last
       octet of the sequence, then put the next higher-order bit of the
       character number in the next higher-order position of that octet,
       etc. When the x bits of the last octet are filled in, move on to
       the next to last octet, then to the preceding one, etc. until all
       x bits are filled in.
例如“汉”字的Unicode编码是6C49。6C49在0800-FFFF之间，所以肯定要用3 字节模板了：1110xxxx 10xxxxxx 10xxxxxx。将6C49写成二进制是：0110 110001 001001，用这个比特流依次代替模板中的x，得到：11100110 10110001 10001001，即E6 B1 89。
相应的将UTF-8编码解码的过程如下：
1. Initialize a binary number with all bits set to 0. Up to 21 bits
       may be needed.

   2. Determine which bits encode the character number from the number
       of octets in the sequence and the second column of the table
       above (the bits marked x).

3. Distribute the bits from the sequence to the binary number, first
 the lower-order bits from the last octet of the sequence and
 proceeding to the left until no x bits are left. The binary
 number is now equal to the character number.
6. UTF的字节序和BOM
UTF-8以字节为编码单元，没有字节序的问题。UTF-16以两个字节为编码单元，在解释一个UTF-16文本前，首先要弄清楚每个编码单元的字节序。例如收到一个“奎”的Unicode编码是594E，“乙”的Unicode编码是 4E59。如果我们收到UTF-16字节流“594E”，那么这是“奎”还是“乙”？
Unicode规范中推荐的标记字节顺序的方法是BOM。 BOM是指Byte Order Mark。BOM是一个有点小聪明的想法：在UCS编码中有一个叫做"ZERO WIDTH NO-BREAK SPACE"的字符，它的编码是FEFF。而FFFE在UCS中是不存在的字符，所以不应该出现在实际传输中。UCS规范建议我们在传输字节流前，先传输字符"ZERO WIDTH NO-BREAK SPACE"。
　　这样如果接收者收到FEFF，就表明这个字节流是Big-Endian的；如果收到FFFE，就表明这个字节流是Little-Endian的。因此字符"ZERO WIDTH NO-BREAK SPACE"又被称作BOM。
　　UTF-8不需要BOM来表明字节顺序，但可以用BOM来表明编码方式。字符"ZERO WIDTH NO-BREAK SPACE"的UTF-8编码是EF BB BF（读者可以用我们前面介绍的编码方法验证一下）。所以如果接收者收到以EF BB BF开头的字节流，就知道这是UTF-8编码了。
Windows就是使用BOM来标记文本文件的编码方式的。
7. UTF-32
要谈UTF-32，我们就不得不提UCS-4。UCS-4和UCS-2一样，是ISO 10646定义的一个编码表(encoding form)，它用32bits来编码字符，它的code space为从0x0000-0x7FFFFFFF(规定最高位必须为0)。而ISO 10646的范围仅为0x0000-0x10FFFF，并声称以后的扩展也不会超出此范围。所以UCS-4的code space就太大了，于是就产生了UTF-32，它仍用32bits来编码字符，但它的code space仅为0x0000-0x10FFFF。
我们可以看出用4个字节来表示一个字符太浪费空间了，这也是为何这两个编码方式很少使用的原因，但用4个字节来表示一个字符也带来一个好处，即所有的字符均为等长，而不论UTF-16还是UTF-8字符均为变长，这就给字符处理带来了方便。
参考文献：
[1] 程序员趣味读物：谈谈Unicode编码
<http://www.pconline.com.cn/pcedu/empolder/gj/other/0505/616631_1.html >
[2] RFC 3629 - UTF-8, a transformation format of ISO 10646
<http://www.faqs.org/rfcs/rfc3629.html >
[3] RFC 2781 - UTF-16, an encoding of ISO 10646
<http://www.faqs.org/rfcs/rfc2781.html >
[4] 维基百科
http://en.wikipedia.org/wiki

sunrock

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
浅析Unicode编码

 1.缘起 在学习java的过程中常常出现UTF-8、UTF-16等等的编码，而书中又没有给出相应的解释，这样就促使我去研究一下这些编码。 2.big-endian和little-endian 在谈编码之前，让我们先了解几个概念，即big-endian和little-endian。 “endian”这个词出自《格列佛游记》。小人国的内战就源于吃鸡蛋时是究竟从大头(Big-Endian)敲开还是从小头(Little-Endian
复制链接

扫一扫

专栏目录