每个软件开发人员必须绝对、积极地了解有关 Unicode 和字符集的最低要求(没有借口!)

注:20年前的旧文,机翻,已初校。


The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
每个软件开发人员必须绝对、积极地了解有关 Unicode 和字符集的最低要求(没有借口!)

Ever wonder about that mysterious Content-Type tag? You know, the one you’re supposed to put in HTML and you never quite know what it should be?
有没有想过神秘的 Content-Type 标签?你知道,它应该放在 HTML 中,但你却不知道它应该是什么?

Did you ever get an email from your friends in Bulgaria with the subject line “??? ??? ??? ???”?
你是否曾收到过来自保加利亚朋友的电子邮件,主题为“??? ??? ??? ???”

I’ve been dismayed to discover just how many software developers aren’t really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese. Japanese? They have email in Japanese? I had no idea. When I looked closely at the commercial ActiveX control we were using to parse MIME email messages, we discovered it was doing exactly the wrong thing with character sets, so we actually had to write heroic code to undo the wrong conversion it had done and redo it correctly. When I looked into another commercial library, it, too, had a completely broken character code implementation. I corresponded with the developer of that package and he sort of thought they “couldn’t do anything about it.” Like many programmers, he just wished it would all blow over somehow.
令我沮丧的是,许多软件开发人员并不真正了解字符集、编码、Unicode 等神秘的世界。几年前,FogBUGZ 的一位 beta 测试人员想知道它是否可以处理日语传入电子邮件。日语?他们有日语电子邮件?我不知道。当我仔细查看我们用于解析 MIME 电子邮件的商业 ActiveX 控件时,我们发现它在处理字符集时完全是错误的,因此我们实际上必须编写大量代码来撤消它所做的错误转换并重新正确执行。当我查看另一个商业库时,它也有一个完全错误的字符代码实现。我与该软件包的开发人员进行了通信,他认为他们 “对此无能为力”。像许多程序员一样,他只是希望一切都能以某种方式过去。

But it won’t. When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough.
但事实并非如此。当我发现流行的 Web 开发工具 PHP 几乎完全不考虑字符编码问题,随意使用 8 位字符,使得开发良好的国际 Web 应用程序几乎不可能时,我想,够了。

So I have an announcement to make: if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.
因此,我要宣布一件事:如果你是一名 2003 年的程序员,却不懂字符、字符集、编码和 Unicode 的基本知识,而我发现你犯了这种错误,我就要惩罚你,让你在潜艇里剥六个月的洋葱。我发誓我会这么做的。

And one more thing: IT’S NOT THAT HARD.
还有一件事情:这没那么难。

In this article I’ll fill you in on exactly what every working programmer should know. All that stuff about “plain text = ascii = characters are 8 bits” is not only wrong, it’s hopelessly wrong, and if you’re still programming that way, you’re not much better than a medical doctor who doesn’t believe in germs. Please do not write another line of code until you finish reading this article.
在本文中,我将向你详细介绍每个程序员都应该知道的内容。关于 “纯文本 = ascii = 字符为 8 位” 的所有内容不仅是错误的,而且是完全错误的,如果你仍然以这种方式编程,那么你就和不相信细菌的医生好多少。在阅读完本文之前,请不要再编写任何代码。

Before I get started, I should warn you that if you are one of those rare people who knows about internationalization, you are going to find my entire discussion a little bit oversimplified. I’m really just trying to set a minimum bar here so that everyone can understand what’s going on and can write code that has a hope of working with text in any language other than the subset of English that doesn’t include words with accents. And I should warn you that character handling is only a tiny portion of what it takes to create software that works internationally, but I can only write about one thing at a time so today it’s character sets.
在我开始之前,我应该警告你,如果你是少数了解国际化的人之一,你会发现我的整个讨论有点过于简单化。我只是想在这里设定一个最低标准,以便每个人都能理解正在发生的事情,并能编写出能够处理除英语子集之外的任何语言文本的代码,这些子集不包括带重音的单词。我还应该警告你,字符处理只是创建国际化软件所需的一小部分,但我一次只能写一件事,所以今天是字符集。

A Historical Perspective
历史视角

The easiest way to understand this stuff is to go chronologically.
理解这些内容的最简单方法是按时间顺序进行。

You probably think I’m going to talk about very old character sets like EBCDIC here. Well, I won’t. EBCDIC is not relevant to your life. We don’t have to go that far back in time.
你可能认为我会在这里讨论非常古老的字符集,例如 EBCDIC。好吧,我不会。EBCDIC 与你的生活无关。我们不必追溯那么久远。
ASCII table
Back in the semi-olden days, when Unix was being invented and K&R were writing The C Programming Language, everything was very simple. EBCDIC was on its way out. The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter “A” was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare, which, if you were wicked, you could use for your own devious purposes: the dim bulbs at WordStar actually turned on the high bit to indicate the last letter in a word, condemning WordStar to English text only. Codes below 32 were called unprintable and were used for cussing. Just kidding. They were used for control characters, like 7 which made your computer beep and 12 which caused the current page of paper to go flying out of the printer and a new one to be fed in.
回到过去的时代,当 Unix 刚刚发明,K&R 正在编写《C 编程语言》时,一切都非常简单。EBCDIC 即将被淘汰。唯一重要的字符是老式的无重音英文字母,我们为它们提供了一种称为 ASCII 的代码,它能够使用 32 到 127 之间的数字来表示每个字符。空格是 32,字母 “A” 是 65,等等。这可以方便地存储在 7 位中。当时大多数计算机都使用 8 位字节,因此你不仅可以存储所有可能的 ASCII 字符,而且还有一堆多余的位,如果你心怀不轨,你可以将其用于自己的邪恶目的:WordStar 的笨蛋们竟然将高位打开,以表示单词的最后一个字母,从而将 WordStar 限制在只能处理英文文本中。小于 32 的代码被称为不可打印字符,并用于输入脏话。只是开玩笑。它们被用作控制字符,例如 7 会使你的计算机发出哔哔声,而 12 会使当前页面的纸张飞出打印机并送入新的页面。

斐夷所非 注

“dim bulbs” :反应迟钝的人。指那些思维缓慢、不太聪明的人。
“Dim” 指智商不高,而 “bulbs” 是指人的脑袋。 “the dim bulbs” 是一种幽默的方式来表达对这些 WordStar 开发人员的不满和讽刺。

And all was good, assuming you were an English speaker.
如果你是一位英语使用者,那么一切都很好。
img
Because bytes have room for up to eight bits, lots of people got to thinking, “gosh, we can use the codes 128-255 for our own purposes.” The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255. The IBM-PC had something that came to be known as the OEM character set which provided some accented characters for European languages and a bunch of line drawing characters… horizontal bars, vertical bars, horizontal bars with little dingle-dangles dangling off the right side, etc., and you could use these line drawing characters to make spiffy boxes and lines on the screen, which you can still see running on the 8088 computer at your dry cleaners’. In fact as soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. For example on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel (ג), so when Americans would send their résumés to Israel they would arrive as rגsumגs. In many cases, such as Russian, there were lots of different ideas of what to do with the upper-128 characters, so you couldn’t even reliably interchange Russian documents.
因为字节最多可以容纳 8 位,所以很多人开始思考,“天哪,我们可以将 128-255 的代码用于我们自己的目的。” 问题是,许多人同时有这个想法,并且他们对 128 到 255 之间的空间应该放在哪里有自己的想法。IBM-PC 有一种后来被称为 OEM 字符集的东西,它为欧洲语言提供了一些重音字符和一堆线条画字符…… 水平条、垂直条、右侧有小悬垂的水平条等,你可以使用这些线条画字符在屏幕上制作漂亮的方框和线条,你仍然可以在干洗店的 8088 计算机上看到它们。事实上,一旦人们开始在美国以外购买 PC,就会出现各种不同的 OEM 字符集,它们都将前 128 个字符用于自己的目的。例如,在某些 PC 上,字符代码 130 将显示为 é,但在以色列销售的计算机上,它显示为希伯来字母 Gimel (ג),因此当美国人将简历发送到以色列时,它们会显示为 rגsumגs。在许多情况下,例如俄语,对于如何处理前 128 个字符有很多不同的想法,因此你甚至无法可靠地交换俄语文档。

Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages. So for example in Israel DOS used a code page called 862, while Greek users used 737. They were the same below 128 but different from 128 up, where all the funny letters resided. The national versions of MS-DOS had dozens of these code pages, handling everything from English to Icelandic and they even had a few “multilingual” code pages that could do Esperanto and Galician on the same computer! Wow! But getting, say, Hebrew and Greek on the same computer was a complete impossibility unless you wrote your own custom program that displayed everything using bitmapped graphics, because Hebrew and Greek required different code pages with different interpretations of the high numbers.
最终,这种 OEM 自由竞争在 ANSI 标准中得到了编纂。在 ANSI 标准中,每个人都同意 128 以下该怎么做,这与 ASCII 非常相似,但是有很多不同的方式来处理 128 及以上的字符,具体取决于你居住的地方。这些不同的系统被称为代码页。例如,在以色列,DOS 使用名为 862 的代码页,而希腊用户使用 737。它们在 128 以下是相同的,但从 128 开始有所不同,所有奇怪的字母都驻留在 128 以上。MS-DOS 的国家版本有几十个这样的代码页,处理从英语到冰岛语的所有内容,他们甚至有几个 “多语言” 代码页可以在同一台计算机上处理世界语和加利西亚语!哇!但是,在同一台计算机上获取希伯来语和希腊语是完全不可能的,除非你编写自己的自定义程序,使用位图显示所有内容,因为希伯来语和希腊语需要不同的代码页,对高位数字有不同的解释。

Meanwhile, in Asia, even more crazy things were going on to take into account the fact that Asian alphabets have thousands of letters, which were never going to fit into 8 bits. This was usually solved by the messy system called DBCS, the “double byte character set” in which some letters were stored in one byte and others took two. It was easy to move forward in a string, but dang near impossible to move backwards. Programmers were encouraged not to use s++ and s– to move backwards and forwards, but instead to call functions such as Windows’ AnsiNext and AnsiPrev which knew how to deal with the whole mess.
与此同时,在亚洲,更疯狂的事情正在发生,因为亚洲字母表有数千个字母,8 位根本装不下。这通常是由称为 DBCS 的混乱系统解决的,即 “双字节字符集”,其中有些字母存储在一个字节中,而另一些字母则占用两个字节。在字符串中向前移动很容易,但向后移动几乎是不可能的。鼓励程序员不要使用 s++ 和 s– 来向前和向后移动,而是调用 Windows 的 AnsiNext 和 AnsiPrev 等函数,这些函数知道如何处理整个混乱局面。

But still, most people just pretended that a byte was a character and a character was 8 bits and as long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work. But of course, as soon as the Internet happened, it became quite commonplace to move strings from one computer to another, and the whole mess came tumbling down. Luckily, Unicode had been invented.
但尽管如此,大多数人还是假装一个字节就是一个字符,一个字符就是 8 位,只要你不将字符串从一台计算机移动到另一台计算机,或者使用多种语言,它就总是有效的。但当然,互联网出现后,将字符串从一台计算机移动到另一台计算机就变得非常普遍,整个混乱局面就此瓦解。幸运的是,Unicode 已经发明了。

Unicode 统一码

Unicode was a brave effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too. Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don’t feel bad.
Unicode 是一项勇敢的尝试,旨在创建一套字符集,其中包含地球上所有合理的书写系统以及一些虚构的书写系统,例如克林贡语。有些人误以为 Unicode 只是一种 16 位代码,每个字符占用 16 位,因此有 65,536 个可能的字符。实际上,这并不正确。这是关于 Unicode 最常见的误解,所以如果你这么认为,请不要难过。

In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense.
事实上,Unicode 对于字符有着不同的思考方式,你必须理解 Unicode 的思考方式,否则一切都将毫无意义。

Until now, we’ve assumed that a letter maps to some bits which you can store on disk or in memory:
到目前为止,我们假设一个字母映射到一些可以存储在磁盘或内存中的位:
A -> 0100 0001

In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole nuther story.
在 Unicode 中,一个字母对应一个称为代码点的东西,这仍然只是一个理论概念。该代码点如何在内存或磁盘上表示又是另一个故事。

In Unicode, the letter A is a platonic ideal. It’s just floating in heaven:
在 Unicode 中,字母 A 是柏拉图式的理想。它只是漂浮在天堂:
A

This platonic A is different than B, and different from a, but the same as A and A and A. The idea that A in a Times New Roman font is the same character as the A in a Helvetica font, but different from “a” in lower case, does not seem very controversial, but in some languages just figuring out what a letter is can cause controversy. Is the German letter ß a real letter or just a fancy way of writing ss? If a letter’s shape changes at the end of the word, is that a different letter? Hebrew says yes, Arabic says no. Anyway, the smart people at the Unicode consortium have been figuring this out for the last decade or so, accompanied by a great deal of highly political debate, and you don’t have to worry about it. They’ve figured it all out already.
这个柏拉图式的 A 与 B 不同,也与 a 不同,但与 A 和 A 和 A 相同。Times New Roman 字体中的 A 与 Helvetica 字体中的 A 是相同字符,但与小写的 “a” 不同,这种想法似乎没有什么争议,但在某些语言中,仅仅弄清楚一个字母是什么就会引起争议。德文字母 ß 是真正的字母还是 ss 的一种花哨的书写方式?如果字母的形状在单词末尾发生变化,那是不同的字母吗?希伯来语说是,阿拉伯语说不是。无论如何,Unicode 联盟的聪明人在过去十年左右的时间里一直在解决这个问题,同时也伴随着大量高度政治化的辩论,你不必担心。他们已经弄清楚了一切。

Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point. The U+ means “Unicode” and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041. You can find them all using the charmap utility on Windows 2000/XP or visiting the Unicode web site.
Unicode 联盟为每个字母表中的每个柏拉图字母分配了一个魔法数字,写法如下:U+0639。这个魔法数字称为代码点。U+ 表示 “Unicode”,数字为十六进制。U +0639 是阿拉伯字母 Ain。英文字母 A 是 U+0041 。你可以使用 Windows 2000/XP 上的 charmap 实用程序或访问 Unicode 网站找到它们。

There is no real limit on the number of letters that Unicode can define and in fact they have gone beyond 65,536 so not every unicode letter can really be squeezed into two bytes, but that was a myth anyway.

Unicode 可以定义的字母数量没有实际限制,事实上它们已经超过了 65,536 个,因此并不是每个 Unicode 字母都可以压缩成两个字节,但这只是一个神话。

OK, so say we have a string:
Hello
好的,假设我们有一个字符串:
你好

which, in Unicode, corresponds to these five code points:
在 Unicode 中,它对应以下五个代码点:

U+0048 U+0065 U+006C U+006C U+006F.

Just a bunch of code points. Numbers, really. We haven’t yet said anything about how to store this in memory or represent it in an email message.
只是一堆代码点。其实是数字。我们还没有说如何将其存储在内存中或在电子邮件中表示它。

Encodings 编码

That’s where encodings come in.
这就是编码的用武之地。

The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let’s just store those numbers in two bytes each. So Hello becomes
Unicode 编码的最早想法导致了关于两个字节的神话,嘿,让我们把这些数字分别存储在两个字节中。所以 Hello 变成了

00 48 00 65 00 6C 00 6C 00 6F

Right? Not so fast! Couldn’t it also be:
对吧?别急!难道不是这样吗:

48 00 65 00 6C 00 6C 00 6F 00 ?

Well, technically, yes, I do believe it could, and, in fact, early implementors wanted to be able to store their Unicode code points in high-endian or low-endian mode, whichever their particular CPU was fastest at, and lo, it was evening and it was morning and there were already two ways to store Unicode. So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.
技术上说,是的,我确实相信它可以这样做的。事实上,早期的实现者想要能够根据他们的CPU速度快慢,选择高端序或低端序来存储 Unicode 码点。结果,早晨和晚上过去了,已经有了两种存储 Unicode 的方式。于是,人们被迫制定了一个奇怪的约定,即在每个 Unicode 字符串的开头存储 FE FF,这被称为 Unicode 字节顺序标记。如果你交换高低字节,它将看起来像 FF FE,这样读取字符串的人就会知道他们需要交换每个其他字节。哇,并非每个 Unicode 字符串的开头都有一个字节顺序标记。

For a while it seemed like that might be good enough, but programmers were complaining. “Look at all those zeros!” they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn’t have minded guzzling twice the number of bytes. But those Californian wimps couldn’t bear the idea of doubling the amount of storage it took for strings, and anyway, there were already all these doggone documents out there using various ANSI and DBCS character sets and who’s going to convert them all? Moi? For this reason alone most people decided to ignore Unicode for several years and in the meantime things got worse.
有一段时间,这似乎已经足够好了,但程序员们却在抱怨。他们说:“看看那些零!” 因为他们是美国人,他们正在看的英语文本很少使用 U+00FF 以上的代码点。而且他们是加利福尼亚的自由嬉皮士,想要节约(冷笑)。如果他们是德克萨斯人,他们不会介意消耗两倍的字节数。但那些加利福尼亚的懦夫无法忍受将字符串所需的存储空间增加一倍的想法,而且,已经有这么多该死的文档使用各种 ANSI 和 DBCS 字符集,谁来转换它们呢?莫伊?仅仅因为这个原因,大多数人决定在几年内忽略 Unicode,与此同时,情况变得更糟了。

Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.
于是,UTF-8 这一绝妙概念应运而生。UTF-8 是另一种使用 8 位字节在内存中存储 Unicode 代码点字符串(即神奇的 U+ 数字)的系统。在 UTF-8 中,0-127 的每个代码点都存储在一个字节中。只有 128 及以上的代码点才使用 2、3,实际上最多 6 个字节来存储。
How UTF-8 works
This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet. Now, if you are so bold as to use accented letters or Greek letters or Klingon letters, you’ll have to use several bytes to store a single code point, but the Americans will never notice. (UTF-8 also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings).
这也有一个很好的副作用,即英语文本在 UTF-8 中看起来与在 ASCII 中完全相同,因此美国人甚至不会注意到任何问题。只有世界其他地区的人需要跳过这些障碍。例如,Hello,它是 U+0048 U+0065 U+006C U+006C U+006F,将被存储为 48 65 6C 6C 6F,看哪!这与它在 ASCII、ANSI 和地球上的每个 OEM 字符集中的存储方式相同。现在,如果你胆敢使用带音符的字母、希腊字母或克林贡字母,你将需要使用多个字节来存储单个码点,但是美国人将不会注意到。(UTF-8 还有一个很好的特性,那就是那些想要使用单个 0 字节作为 null 终止符的旧字符串处理代码不会截断字符串)。

So far I’ve told you three ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still have to figure out if it’s high-endian UCS-2 or low-endian UCS-2. And there’s the popular new UTF-8 standard which has the nice property of also working respectably if you have the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII.
到目前为止,我已经告诉了你三种编码 Unicode 的方法。传统的以两个字节存储的方法称为 UCS-2(因为它有两个字节)或 UTF-16(因为它有 16 位),而且你仍然需要弄清楚它是高位优先的 UCS-2 还是低位优先的 UCS-2。还有流行的新 UTF-8 标准,它具有良好的特性,如果你恰好碰巧有英文文本和完全不知道除了 ASCII 之外还有任何其他内容的脑残程序,它也能正常工作。

There are actually a bunch of other ways of encoding Unicode. There’s something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-state email system that thinks 7 bits are quite enough, thank you it can still squeeze through unscathed. There’s UCS-4, which stores each code point in 4 bytes, which has the nice property that every single code point can be stored in the same number of bytes, but, golly, even the Texans wouldn’t be so bold as to waste that much memory.
实际上,还有许多其他方式来编码 Unicode。有一种叫做 UTF-7 的编码方式,它与 UTF-8 很相似,但保证高位总是零,这样如果你需要将 Unicode 通过某种严苛的警察国家的邮件系统传输,而该系统认为 7 位已经足够了,它仍然可以不受损害地通过。还有 UCS-4,它将每个代码点存储在 4 个字节中,它具有一个很好的特性,即每个代码点都可以存储在相同数量的字节中,但是,天哪,即使是德克萨斯人也不会如此大胆地浪费那么多内存。

And in fact now that you’re thinking of things in terms of platonic ideal letters which are represented by Unicode code points, those unicode code points can be encoded in any old-school encoding scheme, too! For example, you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several hundred encodings that have been invented so far, with one catch: some of the letters might not show up! If there’s no equivalent for the Unicode code point you’re trying to represent in the encoding you’re trying to represent it in, you usually get a little question mark: ? or, if you’re really good, a box. Which did you get? -> �
事实上,既然你现在正在从柏拉图式的理想字母的角度来思考问题,而这些字母由 Unicode 代码点表示,那么这些 Unicode 代码点也可以用任何老式的编码方案进行编码!例如,你可以用 ASCII、旧的 OEM 希腊语编码、希伯来语 ANSI 编码或迄今为止发明的数百种编码中的任何一种来编码 Hello 的 Unicode 字符串 (U+0048 U+0065 U+006C U+006C U+006F),但有一个问题:有些字母可能不会显示出来!如果你尝试在编码中表示的 Unicode 代码点没有等效项,你通常会得到一个小问号:?或者,如果你真的很擅长,会得到一个方框。你得到了哪个?-> �

There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Hebrew letters in these encodings and you get a bunch of question marks. UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.
有数百种传统编码只能正确存储某些代码点,而将所有其他代码点更改为问号。一些流行的英语文本编码是 Windows-1252(西欧语言的 Windows 9x 标准)和 ISO-8859-1,又名 Latin-1(也适用于任何西欧语言)。但尝试用这些编码存储俄语或希伯来语字母,你会得到一堆问号。UTF 7、8、16 和 32 都具有能够正确存储任何代码点的良好特性。

The Single Most Important Fact About Encodings
关于编码的最重要的事实

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that “plain” text is ASCII.
如果你完全忘记了我刚才解释的一切,请记住一个极其重要的事实。如果不知道字符串使用的是什么编码,那么拥有一个字符串是没有意义的。你不能再把头埋在沙子里,假装 “纯” 文本是 ASCII。

There Ain’t No Such Thing As Plain Text.
根本不存在纯文本。

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.
如果你有一个字符串,在内存中、在文件中、或在电子邮件消息中,你必须知道它采用什么编码,否则你无法解释它或正确地将其显示给用户。

Almost every stupid “my website looks like gibberish” or “she can’t read my emails when I use accents” problem comes down to one naive programmer who didn’t understand the simple fact that if you don’t tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends. There are over a hundred encodings and above code point 127, all bets are off.
几乎每一个愚蠢的 “我的网站看起来像胡言乱语” 或 “当我使用重音符号时她无法阅读我的电子邮件” 问题都归结于一个天真的程序员,他没有理解一个简单的事实:如果你不告诉我某个字符串是使用 UTF-8 还是 ASCII 还是 ISO 8859-1(Latin 1)还是 Windows 1252(西欧)编码的,你就无法正确显示它,甚至无法弄清楚它在哪里结束。有超过一百种编码,超过代码点 127,所有努力都无效。

How do we preserve this information about what encoding a string uses? Well, there are standard ways to do this. For an email message, you are expected to have a string in the header of the form.
我们如何保存有关字符串使用哪种编码的信息?嗯,有标准方法可以做到这一点。对于电子邮件消息,你需要在表单的标题中有一个字符串。

Content-Type: text/plain; charset=“UTF-8”

内容类型:text/plain;字符集 =“UTF-8”

For a web page, the original idea was that the web server would return a similar Content-Type http header along with the web page itself — not in the HTML itself, but as one of the response headers that are sent before the HTML page.
对于网页来说,最初的想法是,Web 服务器会与网页本身一起返回类似的 Content-Type http 标头 — — 不是在 HTML 本身中,而是作为在 HTML 页面之前发送的响应标头之一。

This causes problems. Suppose you have a big web server with lots of sites and hundreds of pages contributed by lots of people in lots of different languages and all using whatever encoding their copy of Microsoft FrontPage saw fit to generate. The web server itself wouldn’t really know what encoding each file was written in, so it couldn’t send the Content-Type header.
这会导致问题。假设你有一个大型 Web 服务器,其中包含许多站点和数百个页面,这些页面由许多人使用许多不同的语言贡献,并且都使用他们的 Microsoft FrontPage 副本认为合适的任何编码。Web 服务器本身实际上并不知道每个文件是用什么编码编写的,因此它无法发送 Content-Type 标头。

It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy… how can you read the HTML file until you know what encoding it’s in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

如果你可以使用某种特殊标记将 HTML 文件的 Content-Type 直接放入 HTML 文件本身中,那将非常方便。当然,这会让纯粹主义者抓狂…… 在不知道 HTML 文件的编码之前,你如何读取它?!幸运的是,几乎每种常用编码都会对 32 到 127 之间的字符执行相同的操作,因此你始终可以在 HTML 页面上做到这一点,而无需开始使用奇怪的字母:

But that meta tag really has to be the very first thing in the<head> section because as soon as the web browser sees this tag it’s going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.
但是该元标记实际上必须是 部分中的第一个内容,因为 Web 浏览器一旦看到此标记,它就会停止解析页面,并使用你指定的编码重新解释整个页面后重新开始。

What do web browsers do if they don’t find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working. It’s truly weird, but it does seem to work often enough that naïve web-page writers who never knew they needed a Content-Type header look at their page in a web browser and it looks ok, until one day, they write something that doesn’t exactly conform to the letter-frequency-distribution of their native language, and Internet Explorer decides it’s Korean and displays it thusly, proving, I think, the point that Postel’s Law about being “conservative in what you emit and liberal in what you accept” is quite frankly not a good engineering principle. Anyway, what does the poor reader of this website, which was written in Bulgarian but appears to be Korean (and not even cohesive Korean), do? He uses the View | Encoding menu and tries a bunch of different encodings (there are at least a dozen for Eastern European languages) until the picture comes in clearer. If he knew to do that, which most people don’t.
如果 Web 浏览器在 http 标头或元标记中找不到任何 Content-Type,它们会怎么做?Internet Explorer 实际上做了一些非常有趣的事情:它会根据各种字节在各种语言的典型编码中的典型文本中出现的频率,尝试猜测所使用的语言和编码。由于各种旧的 8 位代码页倾向于将其国家字母放在 128 到 255 之间的不同范围内,并且每种人类语言都有不同的字母使用特征直方图,因此这实际上有可能奏效。这确实很奇怪,但它似乎经常起作用,以至于那些不知道需要 Content-Type 标头的天真的网页编写者在 Web 浏览器中查看他们的页面时,看起来还不错,直到有一天,他们写了一些不完全符合其母语字母频率分布的东西,Internet Explorer 认为这是韩语并以此方式显示它。我认为这证明了 Postel 定律关于 “保守发布,自由接受” 的观点坦率地说并不是一个好的工程原则。无论如何,这个网站的可怜读者(它是用保加利亚语编写的,但看起来像韩语(甚至不是连贯的韩语))会怎么做?他使用 “查看”|“编码” 菜单,尝试了许多不同的编码(东欧语言至少有十几种编码),直到情况更清晰。如果他知道该怎么做,而大多数人不知道。

For the latest version of CityDesk, the web site management software published by my company, we decided to do everything internally in UCS-2 (two byte) Unicode, which is what Visual Basic, COM, and Windows NT/2000/XP use as their native string type. In C++ code we just declare strings as wchar_t (“wide char”) instead of char and use the wcs functions instead of the str functions (for example wcscat and wcslen instead of strcat and strlen). To create a literal UCS-2 string in C code you just put an L before it as so: L"Hello".
对于我公司发布的网站管理软件 CityDesk 的最新版本,我们决定在内部使用 UCS-2(双字节)Unicode 进行所有操作,Visual Basic、COM 和 Windows NT/2000/XP 都使用 UCS-2 Unicode 作为其原生字符串类型。在 C++ 代码中,我们只需将字符串声明为 wchar_t(“宽字符”)而不是 char,并使用 wcs 函数而不是 str 函数(例如,使用 wcscat 和 wcslen 而不是 strcat 和 strlen)。要在 C 代码中创建文字 UCS-2 字符串,只需在字符串前面加上一个 L,如下所示:L"Hello"。

When CityDesk publishes the web page, it converts it to UTF-8 encoding, which has been well supported by web browsers for many years. That’s the way all 29 language versions of Joel on Software are encoded and I have not yet heard a single person who has had any trouble viewing them.
当 CityDesk 发布网页时,它会将其转换为 UTF-8 编码,这种编码多年来一直受到网络浏览器的良好支持。这就是 Joel on Software 所有 29 种语言版本的编码方式,我还没有听说有任何人在查看它们时遇到任何问题。

This article is getting rather long, and I can’t possibly cover everything there is to know about character encodings and Unicode, but I hope that if you’ve read this far, you know enough to go back to programming, using antibiotics instead of leeches and spells, a task to which I will leave you now.
这篇文章相当长,我不可能涵盖有关字符编码和 Unicode 的所有内容,但我希望,如果你读到这里,你已经了解了足够的知识,可以重新开始编程,使用抗生素而不是水蛭和咒语,这个任务现在就留给你了。

斐夷所非 注

“antibiotics instead of leeches and spells”

作者将字符编码和 Unicode 的知识比作现代医学中的抗生素,而将不知道字符编码和 Unicode 的情况比作中世纪的医疗方法,即用水蛭(leeches)和咒语(spells)来治疗疾病。
学习了字符编码和 Unicode 的知识后,你可以像使用抗生素一样,使用科学的方法来解决编程问题,而不是像中世纪医生一样,使用不科学的方法(leeches and spells)来解决问题。


via:

  • 6
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值