1、编码简介
1.1 概念简析:字符、字符集、编码字符集、Code Point、Code Unit和字符编码格式
首先要弄清楚字符、字符集、编码字符集、Code Point、Code Unit和字符编码格式等这些概念。
A character is just an abstract minimal unit of text. It doesn't have a fixed shape (that would be a glyph), and it doesn't have a value. "A" is a character, and so is "€", the symbol for the common currency of Germany, France, and numerous other European countries.
字符是一个文本的最小抽象单元,它没有具体的形状(形状是字形的范畴)。“A”是一个字符,“€”也是一个字符。
A character set is a collection of characters. For example, the Han characters are the characters originally invented by the Chinese, which have been used to write Chinese, Japanese, Korean, and Vietnamese.
字符集是一个字符的集合。
A coded character set is a character set where each character has been assigned a unique number. At the core of the Unicode standard is a coded character set that assigns the letter "A" the number 0041(16) and the letter "€" the number 20AC(16). The Unicode standard always uses hexadecimal numbers, and writes them with the prefix "U+", so the number for "A" is written as "U+0041".
编码字符集是一个经过编码的字符集,其中的每一个字符都被赋予了一个唯一的数字编码。Unicode标准的核心就是一个编码字符集,其中“A”对应0041(16进制)、“€”对应20AC(16进制)。Unicode编码标准用16进制表示,用“U+”作为前缀,比如,“A”被表示成U+0041。
Code points are the numbers that can be used in a coded character set. A coded character set defines a range of valid code points, but doesn't necessarily assign characters to all those code points. The valid code points for Unicode are U+0000 to U+10FFFF. Unicode 4.0 assigns characters to 96,382 of these more than a million code points.