翻译转载 | 计算机编码历史（ANSI, GBK, GB2312, GB18030, UNICODE, UTF8）

childerxxx

已于 2022-07-11 00:02:44 修改

阅读量442

点赞数

分类专栏：笔记文章标签： history unicode

于 2022-07-11 00:01:30 首次发布

原文链接：https://blog.karthisoftek.com/a?ID=01650-99d99938-ef38-4968-877e-f2702932b169

版权

笔记专栏收录该内容

19 篇文章 3 订阅

订阅专栏

1 中文正文

2 英文原文

3 原文转自

1 中文正文

这些编码关键字（ANSI, GBK, GB2312, GB18030, UNICODE, UTF8）比较常见。虽然我把它们放在一起，但这并不意味着这些事情是平等的。本部分内容引自网络，稍作修改。原文出处不明，无法署名。

很久很久以前，一群人决定用8个可以开启和关闭的晶体管组合成不同的状态来代表世界上的一切。他们称之为“字节。后来，他们建造了一些可以处理这些字节的机器。当机器启动时，它们可以使用字节来组成许多状态。状态开始发生变化。他们称这台机器为“电脑”。

最初，计算机仅在美国使用。总共 256 个（2 的 8 次方）不同的状态可以与一个 8 位字节联合。他们为特殊目的定义了从 0 开始的 32 个状态。一旦人们通过终端传递事先约定的字节给计算机，计算机就会执行一些约定的操作。比如，计算机遇到 00×10 终端会换行，遇到 0×07 终端会向人发出哔哔声。例如遇到 0x1b 时，打印机会打印反转的字，或者终端会显示彩色字母。他们觉得这很好，所以他们把这些低于 0×20 的字节状态称为“控制码”。

他们还将空格、标点符号、数字、大小写字母用连续的字节状态表示，并编译到第127号，以便计算机可以使用不同的字节来存储英文文本。大家看到这个感觉很好，所以大家把这个方案叫做ANSI“Ascii”码（American Standard Code for Information Interchange，美国信息交换标准码）。那时，世界上所有的计算机都使用相同的 ASCII 方案来保存英文文本。

后来，计算机的发展越来越广泛。为了在电脑上保存文字，世界各国决定用127之后的未被编码的字节状态来表示这些新的字母和符号，并添加了许多绘制表格时需要用到的横线、竖线、十字等形状，一直添加编号到最后一个字节状态255。人们将128到255的字符集称为“扩展字符集”。但是，原来的编号方式已经不能容纳更多的编码符号了。

中国人拿到电脑的时候，没有可以用来表示汉字的字节状态，需要保存6000多个常用汉字。于是中国人自主研发，直接取消了127号以后的奇葩符号。规定：小于127的字符与原来的含义相同，但是当两个大于127的字符连接在一起时，它代表一个汉字。第一个字节（他叫高字节）从0xA1到0xF7使用，后一个字节（低字节）从0xA1到0xFE，所以我们可以组合7000多个简体汉字。在这些代码中，我们还包括了数学符号、罗马和希腊字母以及日文假名。甚至原本是 ASCII 的数字、标点和字母都被重新编码成两字节长的代码。 , 这就是常说的“全角”字符（ "full-width" characters），127以下的称为“半角”字符（"half-width" characters）。

中国人觉得这样很好，所以把这个汉字计划叫做“GB2312”。 GB2312 是对 ASCII 的中文扩展。

但是中国的汉字太多了，后来就不够用了，所以127之后就不再要求低字节是内码了。只要第一个字节大于127，就会一直表示这是一个汉字的开头，后面跟随的一个字节就是扩展字符集的内容。因此，扩展的编码方案被称为 GBK 标准。 GBK包含了GB2312的全部内容，同时新增了近20000个汉字（包括繁体字）和符号。后来少数民族也需要用电脑，所以我们扩充增加了数千个新的少数民族字符，GBK扩充到GB18030。从此，中华民族的文化在计算机时代得以传承。

因为当时各国都像中国一样制定了自己的编码标准，结果没有人知道对方的编码，也没有人支持别人的编码。当时中国人想在自己的电脑上显示汉字，所以不得不安装一个“汉字系统”来处理汉字的显示和输入。如果安装了错误的字符系统，显示就会混乱。该怎么办？此时，一个名为 ISO（(International Organization for Standardization，国际标准化组织）的国际组织决定解决这个问题。他们采用的方法很简单：废除所有区域编码方案，重建一个包含地球上所有文化、所有字母和符号的编码！他们计划将其称为“通用多字节编码字符集”（"Universal Multiple-Octet Coded Character Set"），或简称 UCS，或“UNICODE”。

UNICODE最初制定时，计算机的内存容量得到了很大的发展，空间不再是问题。所以 ISO 直接规定必须用两个字节，即 16 位来统一表示所有字符。对于ascii中的那些“半角”字符，UNICODE保持其原始编码不变，只是将其长度从原来的8位扩展为16位，其他文化和语言的字符全部重新编码。由于“半角”英文符号只需要低8位，高8位始终为0，所以这种大气的方案在保存英文文本时会浪费两倍的空间。

但是，UNICODE 在制定它时并没有考虑保持与任何现有编码方案的兼容性。这使得GBK和UNICODE在汉字的内码布局上完全不同。没有简单的算术方法可以将文本内容从UNICODE编码转换为另一种编码，而这种转换必须通过查表来进行。 UNICODE 由两个字节表示为一个字符。它总共可以组合65535个不同的字符，可以覆盖世界上所有的文化符号。

UNICODE 出现的同时，也伴随着计算机网络的兴起。 UNICODE如何在网络上传输也是一个必须考虑的问题。因此出现了许多面向传输的UTF（UCS传输格式）标准。顾名思义，UTF8 每次都是 8 bits。一位传输数据，UTF16一次是16位，但是为了传输的可靠性，UNICODE到UTF没有直接对应关系，只是需要一些算法和规则来转换。

看完这些，相信你对这些编码关系有了更清晰的认识。让我简单总结一下：

●中国人通过对中文ASCII码的扩展和变换，产生了GB2312码，可以表示6000多个常用汉字。
● 汉字太多了，包括繁体字和各种字符，所以产生了GBK编码，其中包括GB2312中的编码，同时又做了很多扩展。
● 中国是一个多民族国家，几乎每个民族都有自己独立的语言体系。为了表达这些字符，我们继续将GBK代码扩展为GB18030代码。
● 和中国一样，每个国家都对自己的语言进行编码，因此出现了各种各样的代码。如果不安装对应的代码，就无法解释对应的代码想要表达什么。
● 最后，一个叫 ISO 的组织再也受不了了。他们一起创建了一个代码 UNICODE，它非常大，足以容纳世界上的任何文字和徽标。因此，只要电脑上有UNICODE编码系统，无论世界上是什么样的文字，当你只需要保存文件时，其他电脑都可以将其解释为UNICODE编码。
● UNICODE 在网络传输中，出现了两种标准的UTF-8 和UTF-16，分别传输8 位和16 位。

所以人们会有疑问。既然UTF-8可以存储这么多的字符和符号，那为什么中国还有这么多人用GBK等编码呢？因为UTF-8等编码比较大，占用电脑空间比较大，如果目标用户大部分是中国人，也可以使用GBK等编码。不过从现在的电脑来看，硬盘都是白菜价，电脑的性能足以忽略这种性能消耗。因此，建议所有网页使用统一编码：UTF-8。

关于Notepad不能单独保存“Unicom”的问题

创建一个新的文本文档后，在其中输入“Unicom”一词并保存。再次打开时，原来的“Unicom”输入会变成两个乱码。

这个问题是由GB2312编码和UTF8编码的编码冲突引起的。网上摘录一段UNICODE到UTF8的转换规则：

UTF-8

0000 – 007F

0xxxxxxx

0080 –

110xxxxxx 10xxxxxx

0800 – FFFF

1110xxxx 10xxxxxx 10xxxxxx

例如，“中文”的Unicode代码是6C49。 6C49 介于 0800-FFFF 之间，所以使用 3 字节模板：1110xxxx 10xxxxxx 10xxxxxx。将6C49二进制写为：0110 1100 0100 1001，将这个比特流按照三字节模板分割法分成0110 110001 001001，依次替换模板中的x，得到：1110-0110 10-110001 10-001001 ，也就是E6 B1 89，这是它的UTF8编码。

创建新文本文件时，记事本的默认编码是 ANSI。如果用ANSI编码输入汉字，那么实际上就是GB系列的编码方式。在这种编码下，“联通”的内码为：

c1 1100 0001

aa 1010 1010

cd 1100 1101

a8 1010 1000

你注意到了吗？前两个字节和第三、四个字节的开头是“110”和“10”，和UTF8规则中的两字节模板完全一样，所以当你再次打开记事本时，你会记错我以为这是一个UTF8编码的文件。让我们删除第一个字节的 110 和第二个字节的 10，我们将得到“00001 101010”。然后对齐位，加上前导0，得到“0000 0000 0110 1010”，对不起，这是UNICODE 006A，也就是小写字母“j”，后面两个字节用UTF8解码为0368，也就是没有什么。这就是为什么只有“联通”字样的文件在记事本中无法正常显示的原因。

从这个问题可以辐射出很多问题。一个比较常见的问题是：我保存了XX码的文件，为什么每次打开还是原来的YY码？！原因是，虽然你保存为XX码，但当系统识别出来时，却误认为是YY码，所以还是显示为YY码。为了避免这个问题，微软想出了一个叫做 BOM 表头的东西。

关于文件BOM头的问题。

在使用WINDOWS自带的记事本等软件时，保存UTF-8编码的文件时，文件开头会插入三个不可见字符（0xEF 0xBB 0xBF，即BOM）。 )。它是一串隐藏字符，用于让记事本等编辑器识别文件是否以 UTF-8 编码。这样你就可以避免这个问题。对于一般文件，这不会造成任何麻烦。

这样做有缺点，尤其是在网页中。 PHP 不会忽略 BOM，因此在读取、包含或引用这些文件时，它将使用 BOM 作为文件开头的文本的一部分。根据嵌入式语言的特点，这串字符会被直接执行（显示）。结果，即使页面的top padding设置为0，整个网页也无法靠近浏览器的顶部，因为html开头有这3个字符。如果您在网页中发现未知空白，则该文件很可能具有 BOM 标头。遇到此类问题，保存文件时不要包含BOM表头！

IE6加载CSS文件BUG 当HTML文件的编码与你要加载CSS的文件不一致时，IE6将无法读取CSS文件，即HTML文件没有样式。据我观察，这个问题在其他浏览器中从来没有出现过，只有IE6出现过。只需将 CSS 文件保存为 HTML 文件的代码即可。

2 英文原文

        These coding keywords are relatively common. Although I put us together, it does not mean that these things are equal. The content of this part is quoted from the Internet with slight modification. The source of the original text is unknown, so it cannot be signed.
        A long, long time ago, a group of people decided to use 8 transistors that can be turned on and off to combine them into different states to represent everything in the world. They called this "byte". Later, they built some machines that could process these bytes. When the machines were started, they could use bytes to compose many states. The states began to change. They called this machine a "computer."
        Initially, computers were only used in the United States. A total of 256 (2 to the 8th power) different states can be combined with an eight-bit byte. They defined the 32 states with numbers starting from 0 for special purposes. Once the agreed bytes on the terminal and printer are passed over, some agreed actions are required. When it encounters 00×10, the terminal will wrap, and when it encounters 0×07, the terminal will beep to people. For example, when it encounters 0x1b, the printer will print the reversed words, or the terminal will display letters in color. They see that this is good, so they call these byte states below 0×20 "control codes".
They also represented the spaces, punctuation marks, numbers, and uppercase and lowercase letters in consecutive byte states, and compiled them up to No. 127, so that the computer can use different bytes to store English text. Everyone feels good when they see this, so everyone calls this scheme the ANSI "Ascii" code (American Standard Code for Information Interchange, American Standard Code for Information Interchange ) . At that time, all computers in the world used the same ASCII scheme to save English text.
        Later, the development of computers became more and more extensive. In order to save their text on the computer, countries around the world decided to use the space after 127 to represent these new letters and symbols, and added many horizontal lines that need to be used when drawing tables. , Vertical line, cross and other shapes, have been numbered to the last state 255. The character set on this page from 128 to 255 is called "extended character set". However, the original numbering method can no longer accommodate more codes.
        When Chinese people get computers, there are no byte states that can be used to represent Chinese characters, and more than 6000 commonly used Chinese characters need to be saved. So the Chinese people independently researched and developed them, and directly cancelled the strange symbols after the 127th. Regulation: The meaning of a character smaller than 127 is the same as the original one, but when two characters larger than 127 are connected together, it represents a Chinese character. The first byte (he called the high byte) is used from 0xA1 to 0xF7, and the latter One byte (low byte) is from 0xA1 to 0xFE, so we can combine more than 7000 simplified Chinese characters. In these codes, we also included mathematical symbols, Roman and Greek letters, and Japanese kana. Even the numbers, punctuation, and letters that were originally in ASCII were all recoded into two-byte long codes. , This is often said to be "full-width" characters, and those below 127 are called "half-width" characters.
        The Chinese people see this very well, so they call this Chinese character plan "GB2312". GB2312 is a Chinese extension to ASCII.
        However, there were too many Chinese characters in China, and later they were not enough, so the low byte is no longer required to be the internal code after 127. As long as the first byte is greater than 127, it will always indicate that this is the beginning of a Chinese character. What follows is the content of the extended character set. As a result, the expanded coding scheme is called the GBK standard. GBK includes all the contents of GB2312, and at the same time nearly 20,000 new Chinese characters (including traditional characters) and symbols have been added. Later, ethnic minorities also need to use computers, so we expanded and added thousands of new ethnic minority characters, GBK expanded to GB18030. From then on, the culture of the Chinese nation can be passed on in the computer age.
        Because at that time, all countries developed their own coding standards like China, and as a result, no one knew each other's coding, and no one supported others' coding. At that time, Chinese people wanted to display Chinese characters on their computers, so they had to install a "Chinese character system" to deal with the display and input of Chinese characters. If the wrong character system was installed, the display would be messed up. What to do? At this moment, an international organization called ISO (International Organization for Standardization) decided to tackle this problem. The method they adopted is very simple: abolish all regional coding schemes, and rebuild a code that includes all cultures, all letters and symbols on the earth! They plan to call it "Universal Multiple-Octet Coded Character Set", or UCS for short, or "UNICODE".
When UNICODE was first formulated, the memory capacity of the computer was greatly developed, and space was no longer a problem. So ISO directly stipulates that two bytes, that is, 16 bits, must be used to represent all characters uniformly. For those "half-width" characters in ascii, UNICODE keeps its original encoding unchanged, but changes its length from the original 8 The bit is expanded to 16 bits, and the characters of other cultures and languages are all re-encoded. Since the "half-width" English symbol only needs the lower 8 bits, the upper 8 bits are always 0, so this atmospheric solution will waste twice as much space when saving English text.
However, UNICODE did not consider maintaining compatibility with any existing encoding scheme when formulating it. This makes GBK and UNICODE completely different in the internal code layout of Chinese characters. There is no simple arithmetic method to change the text content from UNICODE encoding and another encoding are converted, and this conversion must be performed by looking up the table. UNICODE is represented by two bytes as one character. It can combine 65535 different characters in total, which can cover all cultural symbols in the world.
When UNICODE came, it also came with the rise of computer networks. How UNICODE is transmitted on the network is also a problem that must be considered. So many transmission-oriented UTF (UCS Transfer Format) standards have appeared. As the name suggests, UTF8 is 8 every time. One bit transmits data, and UTF16 is 16 bits at a time, but for the reliability of transmission, there is no direct correspondence from UNICODE to UTF, but some algorithms and rules are required to convert.
        After reading these, I believe you have a clearer understanding of these coding relationships. Let me briefly summarize:

●The Chinese people have produced GB2312 code through the expansion and transformation of ASCII code in Chinese, which can represent more than 6000 commonly used Chinese characters.
● There are too many Chinese characters, including traditional characters and various characters, so GBK encoding is produced, which includes the encoding in GB2312, and at the same time has been expanded a lot.
● China is a multi-ethnic country, and almost every ethnic group has its own independent language system. In order to express those characters, we continue to expand the GBK code to GB18030 code.
● Like China, every country encodes its own language, so a variety of codes appear. If you don't install the corresponding code, you can't explain what the corresponding code wants to express.
● Finally, an organization called ISO couldn't stand it anymore. Together, they created a code UNICODE, which is very large, big enough to hold any text and logo in the world. Therefore, as long as there is a UNICODE encoding system on the computer, no matter what kind of text is in the world, when you only need to save the file, it can be interpreted by other computers as UNICODE encoding.
● UNICODE In network transmission, two standard UTF-8 and UTF-16 appeared, each transmitting 8 bits and 16 bits respectively.

        So people will have questions. Since UTF-8 can store so many characters and symbols, why are there so many people in China who use GBK and other encoding? Because encodings such as UTF-8 are relatively large and occupy more computer space, if the majority of the target users are Chinese, encodings such as GBK can also be used. However, from the perspective of current computers, hard disks are all at the price of cabbage, and the performance of the computer is enough to ignore this performance consumption. Therefore, it is recommended that all web pages use uniform encoding: UTF-8.
        Regarding the problem that the Notepad cannot save "Unicom" separately
        After you create a new text document, enter the word "Unicom" in it and save it. When you open it again, the original "Unicom" input will become two garbled characters.
        This problem is caused by the encoding collision between GB2312 encoding and UTF8 encoding. A section of conversion rules from UNICODE to UTF8 is drawn from the Internet:
UTF-8
0000 – 007F
0xxxxxxx
0080 –
110xxxxx 10xxxxxx
0800 – FFFF
1110xxxx 10xxxxxx 10xxxxxx
        For example, the Unicode code of "Chinese" is 6C49. 6C49 is between 0800-FFFF, So use a 3-byte template: 1110xxxx 10xxxxxx 10xxxxxx. Write 6C49 in binary as: 0110 1100 0100 1001, divide this bit stream into 0110 110001 001001 according to the three-byte template segmentation method, and replace the x in the template in turn to get: 1110-0110 10-110001 10-001001, that is E6 B1 89, this is its UTF8 encoding.
        When you create a new text file, the default encoding of Notepad is ANSI. If you enter Chinese characters in ANSI encoding, then it is actually the GB series encoding method. Under this encoding, the internal code of "Unicom" is:
c1 1100 0001
aa 1010 1010
cd 1100 1101
a8 1010 1000
        Have you noticed? The first two bytes and the beginning of the third and fourth bytes are "110" and "10", which are exactly the same as the two-byte template in the UTF8 rule, so when you open Notepad again, you will remember I mistakenly thought that this is a UTF8 encoded file. Let us remove the 110 of the first byte and the 10 of the second byte, and we will get "00001 101010". Then align the bits and add the leading 0, you get "0000 0000 0110 1010", sorry, this is UNICODE 006A, which is the lowercase letter "j", and the next two bytes are decoded with UTF8 to be 0368, which is nothing. This is the reason why files with only the word "Unicom" cannot be displayed normally in Notepad.
        From this question, many problems can radiate. A more common question is: I have saved the file in XX code, why is it still the original YY code every time I open it? ! The reason is that, although you saved it as XX code, when the system recognized it, it misrecognized as YY code, so it still displayed as YY code. In order to avoid this problem, Microsoft came up with something called the BOM header.
        Regarding the issue of the BOM header of the file.
        When using software like Notepad that comes with WINDOWS, when saving a UTF-8 encoded file, three invisible characters (0xEF 0xBB 0xBF, namely BOM) will be inserted at the beginning of the file. ). It is a string of hidden characters used to allow editors such as Notepad to recognize whether the file is encoded in UTF-8. This way you can avoid this problem. For general files, this will not cause any trouble.
        Doing so has disadvantages, especially in web pages. PHP does not ignore the BOM, so when reading, including or referencing these files, it will use the BOM as part of the text at the beginning of the file. According to the characteristics of the embedded language, this string of characters will be directly executed (displayed). As a result, even if the top padding of the page is set to 0, the entire web page cannot be close to the top of the browser because there are these 3 characters at the beginning of the html. If you find unknown blanks in the webpage, it is likely that the file has a BOM header. When you encounter this kind of problem, do not include the BOM header when saving the file!

IE6 loading CSS file BUG When the encoding of the HTML file is inconsistent with the file that you want to load CSS, IE6 will not be able to read the CSS file, that is, the HTML file has no style. According to my observation, this problem has never appeared in other browsers, only in IE6. Just save the CSS file as the code of the HTML file.