Unicode字符集初探

 In the first version of UCS 34203 different characters are included. Of these 21204 are ideographic characters used in Chinese, Japanese and Korean, and 6656 are Korean Hangul syllabograms. To guarantee that the coding space will not be filled up even in the future -- 2 octets give 65536 different character positions -- a 4-octet form of UCS (UCS-4) is also definied.

The 65536 positions in the 2-octet form of UCS are divided into 256 rows with 256 cells in each. The first octet of a character representation gives the row number, the second the cell number. The first row, row 0, contains exactly the same characters as ISO/IEC 8859-1. The first 128 characters are thus the ASCII characters. The octet representing an ISO/IEC 8859-1 character is easily transformed to the representation in UCS, by putting a 0 octet in front of it. UCS includes the same control characters as ISO/IEC 8859 and these are also in row 0. An overview of the content of all rows are found in the annex.

UCS2(UTF-16)的第一行和ISO/IEC 8859-1完全一样,开头的128个字符就是ASCII码。所以把WE8ISO8859P1的字符转换成UCS2只要在前面加一个0字节即可。这也是我上一篇关于英文字符集Oracle数据库OraOLEDB驱动乱码问题可行的原因。

UTF-8开头128个字符也是ASCII码。

UTF-16有两种,BMP(基本多语言平面)只有两个字节,增补字符需要四个字节。

Unicode最新版是5.0,新的版本都是在原有基础上增加字符,不改变原来字符的编码。有UTF-16和UTF-32,Windows采用的是UTF-16,有些Unix系统采用UTF-32。

UTF-8编码方式:
   UCS-4 range (hex.)           UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx

0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000-03FF FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000-7FFF FFFF 1111110x 10xxxxxx ... 10xxxxxx
一个字节第一位为0,如果最高两位为1后跟一个0,表示由两个字节构成,依此类推。ASCII码的UTF-8编码完全一样,
汉字通常需要3-4个字节。一个字节破坏或者丢失很容易找出这个字符的边界。

UTF-16也与此类似:
四个字节的UTF-16,前两个字节为0xD800-0xDBFF,后两个字节为0xDC00-0xDFFF。
丢失一个字节也可以判断出来字符边界。

GBK这样变长编码比较麻烦,一个字节不对导致一大堆乱码。
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值