第一部分之规律:
1.0x00 ~ 0x7F (包含00与7F)三种编码没有区别.
2.0x80 ~ 0xBF(包含80与BF)
其中:ISO8859-1 = (Unicode char)Unicode = (unsigned char) (utf-8[1]) .
3.0xC0 ~ 0xFF(包含C0与FF)
其中:ISO8859-1 = (Unicode char)Unicode = (unsigned char) (utf-8[1] + 0x40) .
第二部分之详解:
因为:当十六进制在:000080 - 0007FF(utf-8的1920个代码)中,转换为utf-8的方式如下
代码范围 十六进制 |
标量值(scalar value) 二进制 |
UTF-8 二进制/十六进制 |
注释 |
---|---|---|---|
000000 - 00007F 128个代码 |
00000000 00000000 0zzzzzzz | 0zzzzzzz(00-7F) | ASCII字符范围,字节由零开始 |
七个z | 七个z | ||
000080 - 0007FF 1920个代码 |
00000000 00000yyy yyzzzzzz | 110yyyyy(C0-DF) 10zzzzzz(80-BF) | 第一个字节由110开始,接着的字节由10开始 |
三个y;二个y;六个z | 五个y;六个z | ||
000800 - 00D7FF 00E000 - 00FFFF 61440个代码 [Note 1] |
00000000 xxxxyyyy yyzzzzzz | 1110xxxx(E0-EF) 10yyyyyy 10zzzzzz | 第一个字节由1110开始,接着的字节由10开始 |
四个x;四个y;二个y;六个z | 四个x;六个y;六个z | ||
010000 - 10FFFF 1048576个代码 |
000wwwxx xxxxyyyy yyzzzzzz | 11110www(F0-F7) 10xxxxxx 10yyyyyy 10zzzzzz | 将由11110开始,接着的字节由10开始 |
三个w;二个x;四个x;四个y;二个y;六个z | 三个w;六个x;六个y;六个z |
- Note 1 Unicode在范围D800-DFFF中不存在任何字符, 基本多文种平面中约定了这个范围用于UTF-16扩展标识 辅助平面(两个UTF-16表示一个 辅助平面字符)。当然,任何编码都是可以被转换到这个范围,但在unicode中他们并不代表任何合法的值。
例如,希伯来语字母aleph(א)的Unicode代码是U+05D0,按照以下方法改成UTF-8:
- 它属于U+0080到U+07FF区域,这个表说明它使用双字节,110yyyyy 10zzzzzz.
- 十六进制的0x05D0换算成二进制就是101-1101-0000.
- 这11位数按顺序放入"y"部分和"z"部分:11010111 10010000.
- 最后结果就是双字节,用十六进制写起来就是0xD7 0x90,这就是这个字符aleph(א)的UTF-8编码。
所以开始的128个字符(US-ASCII)只需一字节,接下来的1920个字符需要双字节编码,包括带附加符号的拉丁字母,希腊字母,西里尔字母,科普特语字母,亚美尼亚语字母,希伯来文字母和阿拉伯字母的字符。基本多文种平面中其余的字符使用三个字节,剩余字符使用四个字节。
根据这种方式可以处理更大数量的字符。原来的规范允许长达6字节的序列,可以覆盖到31位(通用字符集原来的极限)。尽管如此,2003年11月UTF-8被RFC 3629重新规范,只能使用原来Unicode定义的区域,U+0000到U+10FFFF。根据这些规范,以下字节值将无法出现在合法UTF-8序列中:
编码(二进制) | 编码(十六进制) | 注释 |
---|---|---|
1100000x | C0, C1 | 过长编码:双字节序列的头字节,但码点<= 127 |
1111111x | FE, FF | 无法达到:7或8字节序列的头字节 |
111110xx 1111110x |
F8, F9, FA, FB, FC, FD | 被RFC 3629规范:5或6字节序列的头字节 |
11110101 1111011x |
F5, F6, F7 | 被RFC 3629规范:码点超过10FFFF的头字节 |
第三部分之表格:
Charset | ISO-8859-1 |
Extends ASCII? | |
Supported by Sprachinspektor Language Detector? | |
Supported by AutoUniConv Unicode Converter? |
ISO-8859-1 | UTF-8 | Unicode | Character | Name |
---|---|---|---|---|
0xFF | 0xC3 0xBF | 0x00FF | ÿ | LATIN SMALL LETTER Y WITH DIAERESIS |
0xFE | 0xC3 0xBE | 0x00FE | þ | LATIN SMALL LETTER THORN (Icelandic) |
0xFD | 0xC3 0xBD | 0x00FD | ý | LATIN SMALL LETTER Y WITH ACUTE |
0xFC | 0xC3 0xBC | 0x00FC | ü | LATIN SMALL LETTER U WITH DIAERESIS |
0xFB | 0xC3 0xBB | 0x00FB | û | LATIN SMALL LETTER U WITH CIRCUMFLEX |
0xFA | 0xC3 0xBA | 0x00FA | ú | LATIN SMALL LETTER U WITH ACUTE |
0xF9 | 0xC3 0xB9 | 0x00F9 | ù | LATIN SMALL LETTER U WITH GRAVE |
0xF8 | 0xC3 0xB8 | 0x00F8 | ø | LATIN SMALL LETTER O WITH STROKE |
0xF7 | 0xC3 0xB7 | 0x00F7 | ÷ | DIVISION SIGN |
0xF6 | 0xC3 0xB6 | 0x00F6 | ö | LATIN SMALL LETTER O WITH DIAERESIS |
0xF5 | 0xC3 0xB5 | 0x00F5 | õ | LATIN SMALL LETTER O WITH TILDE |
0xF4 | 0xC3 0xB4 | 0x00F4 | ô | LATIN SMALL LETTER O WITH CIRCUMFLEX |
0xF3 | 0xC3 0xB3 | 0x00F3 | ó | LATIN SMALL LETTER O WITH ACUTE |
0xF2 | 0xC3 0xB2 | 0x00F2 | ò | LATIN SMALL LETTER O WITH GRAVE |
0xF1 | 0xC3 0xB1 | 0x00F1 | ñ | LATIN SMALL LETTER N WITH TILDE |
0xF0 | 0xC3 0xB0 | 0x00F0 | ð | LATIN SMALL LETTER ETH (Icelandic) |
0xEF | 0xC3 0xAF | 0x00EF | ï | LATIN SMALL LETTER I WITH DIAERESIS |
0xEE | 0xC3 0xAE | 0x00EE | î | LATIN SMALL LETTER I WITH CIRCUMFLEX |
0xED | 0xC3 0xAD | 0x00ED | í | LATIN SMALL LETTER I WITH ACUTE |
0xEC | 0xC3 0xAC | 0x00EC | ì | LATIN SMALL LETTER I WITH GRAVE |
0xEB | 0xC3 0xAB | 0x00EB | ë | LATIN SMALL LETTER E WITH DIAERESIS |
0xEA | 0xC3 0xAA | 0x00EA | ê | LATIN SMALL LETTER E WITH CIRCUMFLEX |
0xE9 | 0xC3 0xA9 | 0x00E9 | é | LATIN SMALL LETTER E WITH ACUTE |
0xE8 | 0xC3 0xA8 | 0x00E8 | è | LATIN SMALL LETTER E WITH GRAVE |
0xE7 | 0xC3 0xA7 | 0x00E7 | ç | LATIN SMALL LETTER C WITH CEDILLA |
0xE6 | 0xC3 0xA6 | 0x00E6 | æ | LATIN SMALL LETTER AE |
0xE5 | 0xC3 0xA5 | 0x00E5 | å | LATIN SMALL LETTER A WITH RING ABOVE |
0xE4 | 0xC3 0xA4 | 0x00E4 | ä | LATIN SMALL LETTER A WITH DIAERESIS |
0xE3 | 0xC3 0xA3 | 0x00E3 | ã | LATIN SMALL LETTER A WITH TILDE |
0xE2 | 0xC3 0xA2 | 0x00E2 | â | LATIN SMALL LETTER A WITH CIRCUMFLEX |
0xE1 | 0xC3 0xA1 | 0x00E1 | á | LATIN SMALL LETTER A WITH ACUTE |
0xE0 | 0xC3 0xA0 | 0x00E0 | à | LATIN SMALL LETTER A WITH GRAVE |
0xDF | 0xC3 0x9F | 0x00DF | ß | LATIN SMALL LETTER SHARP S (German) |
0xDE | 0xC3 0x9E | 0x00DE | Þ | LATIN CAPITAL LETTER THORN (Icelandic) |
0xDD | 0xC3 0x9D | 0x00DD | Ý | LATIN CAPITAL LETTER Y WITH ACUTE |
0xDC | 0xC3 0x9C | 0x00DC | Ü | LATIN CAPITAL LETTER U WITH DIAERESIS |
0xDB | 0xC3 0x9B | 0x00DB | Û | LATIN CAPITAL LETTER U WITH CIRCUMFLEX |
0xDA | 0xC3 0x9A | 0x00DA | Ú | LATIN CAPITAL LETTER U WITH ACUTE |
0xD9 | 0xC3 0x99 | 0x00D9 | Ù | LATIN CAPITAL LETTER U WITH GRAVE |
0xD8 | 0xC3 0x98 | 0x00D8 | Ø | LATIN CAPITAL LETTER O WITH STROKE |
0xD7 | 0xC3 0x97 | 0x00D7 | × | MULTIPLICATION SIGN |
0xD6 | 0xC3 0x96 | 0x00D6 | Ö | LATIN CAPITAL LETTER O WITH DIAERESIS |
0xD5 | 0xC3 0x95 | 0x00D5 | Õ | LATIN CAPITAL LETTER O WITH TILDE |
0xD4 | 0xC3 0x94 | 0x00D4 | Ô | LATIN CAPITAL LETTER O WITH CIRCUMFLEX |
0xD3 | 0xC3 0x93 | 0x00D3 | Ó | LATIN CAPITAL LETTER O WITH ACUTE |
0xD2 | 0xC3 0x92 | 0x00D2 | Ò | LATIN CAPITAL LETTER O WITH GRAVE |
0xD1 | 0xC3 0x91 | 0x00D1 | Ñ | LATIN CAPITAL LETTER N WITH TILDE |
0xD0 | 0xC3 0x90 | 0x00D0 | Ð | LATIN CAPITAL LETTER ETH (Icelandic) |
0xCF | 0xC3 0x8F | 0x00CF | Ï | LATIN CAPITAL LETTER I WITH DIAERESIS |
0xCE | 0xC3 0x8E | 0x00CE | Î | LATIN CAPITAL LETTER I WITH CIRCUMFLEX |
0xCD | 0xC3 0x8D | 0x00CD | Í | LATIN CAPITAL LETTER I WITH ACUTE |
0xCC | 0xC3 0x8C | 0x00CC | Ì | LATIN CAPITAL LETTER I WITH GRAVE |
0xCB | 0xC3 0x8B | 0x00CB | Ë | LATIN CAPITAL LETTER E WITH DIAERESIS |
0xCA | 0xC3 0x8A | 0x00CA | Ê | LATIN CAPITAL LETTER E WITH CIRCUMFLEX |
0xC9 | 0xC3 0x89 | 0x00C9 | É | LATIN CAPITAL LETTER E WITH ACUTE |
0xC8 | 0xC3 0x88 | 0x00C8 | È | LATIN CAPITAL LETTER E WITH GRAVE |
0xC7 | 0xC3 0x87 | 0x00C7 | Ç | LATIN CAPITAL LETTER C WITH CEDILLA |
0xC6 | 0xC3 0x86 | 0x00C6 | Æ | LATIN CAPITAL LETTER AE |
0xC5 | 0xC3 0x85 | 0x00C5 | Å | LATIN CAPITAL LETTER A WITH RING ABOVE |
0xC4 | 0xC3 0x84 | 0x00C4 | Ä | LATIN CAPITAL LETTER A WITH DIAERESIS |
0xC3 | 0xC3 0x83 | 0x00C3 | Ã | LATIN CAPITAL LETTER A WITH TILDE |
0xC2 | 0xC3 0x82 | 0x00C2 | Â | LATIN CAPITAL LETTER A WITH CIRCUMFLEX |
0xC1 | 0xC3 0x81 | 0x00C1 | Á | LATIN CAPITAL LETTER A WITH ACUTE |
0xC0 | 0xC3 0x80 | 0x00C0 | À | LATIN CAPITAL LETTER A WITH GRAVE |
0xBF | 0xC2 0xBF | 0x00BF | ¿ | INVERTED QUESTION MARK |
0xBE | 0xC2 0xBE | 0x00BE | ¾ | VULGAR FRACTION THREE QUARTERS |
0xBD | 0xC2 0xBD | 0x00BD | ½ | VULGAR FRACTION ONE HALF |
0xBC | 0xC2 0xBC | 0x00BC | ¼ | VULGAR FRACTION ONE QUARTER |
0xBB | 0xC2 0xBB | 0x00BB | » | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK |
0xBA | 0xC2 0xBA | 0x00BA | º | MASCULINE ORDINAL INDICATOR |
0xB9 | 0xC2 0xB9 | 0x00B9 | ¹ | SUPERSCRIPT ONE |
0xB8 | 0xC2 0xB8 | 0x00B8 | ¸ | CEDILLA |
0xB7 | 0xC2 0xB7 | 0x00B7 |