注:机翻,未校对。
关于以 NUL 结束的字符串的讨论。
Did Ken, Dennis, and Brian choose wrong with NUL-terminated text strings?
Ken、Dennis 和 Brian 在选择以 NUL 结尾的文本字符串时是否做错了?
IT both drives and implements the modern Western-style economy. Thus, we regularly see headlines about staggeringly large amounts of money connected with IT mistakes. Which IT or CS decision has resulted in the most expensive mistake?
信息技术既是现代西方经济的推动者,也是其实施者。因此,我们经常看到与 IT 错误有关的巨额资金的头条新闻。哪个 IT 或 CS 决策导致了最昂贵的错误?
Not long ago, a fair number of pundits were doing a lot of hand waving about the financial implications of Sony’s troubles with its PlayStation Network, but an event like that does not count here. In my school days, I talked with an inspector from The Guinness Book of World Records who explained that for something to be “a true record,” it could not be a mere accident; there had to be direct causation starting with human intent (i.e., we stuffed 26 high school students into our music teacher’s Volkswagen Beetle and closed the doors).
不久前,相当多的专家对索尼在其 PlayStation 网络上的麻烦所带来的财务影响进行了很多讨论,但这样的事件在这里并不算数。在我上学的时候,我曾与《吉尼斯世界纪录》的一名检查员交谈,他解释说,要成为“真正的记录”,不能仅仅是偶然的事件;必须有直接的因果关系,从人类的意图开始(例如,我们把26名高中学生塞进我们音乐老师的甲壳虫汽车里,然后关上门)。
Sony (probably) did not intend to see how big a mess it could make with the least attention to security, so this and other such examples of false economy will not qualify. Another candidate could be IBM’s choice of Bill Gates over Gary Kildall to supply the operating system for its personal computer. The damage from this decision is still accumulating at breakneck speed, with StuxNet and the OOXML perversion of the ISO standardization process being exemplary bookends for how far and wide the damage spreads. But that was not really an IT or CS decision. It was a business decision that, as far as history has been able to uncover, centered on Kildall’s decision not to accept IBM’s nondisclosure demands.
索尼(可能)并不打算看到它在最不注意安全性的情况下会造成多大的混乱,因此这个和其他此类虚假经济的例子将不符合条件。另一个候选者可能是 IBM 选择比尔・盖茨而不是加里・基尔德尔(Gary Kildall)为其个人电脑提供操作系统。这一决定造成的损害仍在以极快的速度积累,StuxNet 和 OOXML 对 ISO 标准化过程的歪曲是损害传播范围的典范。但这并不是真正的 IT 或 CS 决策。就历史而言,这是一个商业决策,其核心是基尔德尔决定不接受 IBM 的保密要求。
A better example would be the decision for MS-DOS to invent its own directory/filename separator, using the backslash () rather than the forward slash (/) that Unix used or the period that DEC used in its operating systems. Apart from the actual damage being relatively modest, however, this does not qualify as a good example either, because it was not a real decision selecting a true preference. IBM had decided to use the slash for command flags, eliminating Unix as a precedent, and the period was used between filename and filename extension, making it impossible to follow DEC’s example.
一个更好的例子是 MS-DOS 决定发明自己的目录 / 文件名分隔符,使用反斜杠(\)而不是 Unix 使用的正斜杠(/)或 DEC 在其操作系统中使用的句点。然而,除了实际损害相对较小之外,这也不算是一个好的例子,因为选择一个真正的偏好并不是一个真正的决定。IBM 已经决定使用斜杠作为命令标志,从而排除了 Unix 的前例,而句点则用于文件名和文件扩展名之间,使得无法遵循 DEC 的示例。
Space exploration history offers a pool of well-publicized and expensive mistakes, but interestingly, I didn’t find any valid candidates there. Fortran syntax errors and space shuttle computer synchronization mistakes do not qualify for lack of intent. Running one part of a project in imperial units and the other in metric is a “random act of management” that has nothing to do with CS or IT.
太空探索的历史提供了大量广为人知且代价高昂的错误,但有趣的是,我没有发现任何有效的候选者。Fortran 语法错误和航天飞机计算机同步错误由于缺乏意图而不符合标准。将项目的一部分以英制单位运行而另一部分以公制单位运行是一种“管理随机行为”,与计算机科学或信息技术无关。
The best candidate I have been able to come up with is the C/Unix/Posix use of NUL-terminated text strings. The choice was really simple: Should the C language represent strings as an address + length
tuple or just as the address with a magic character (NUL) marking the end? This is a decision that the dynamic trio of Ken Thompson, Dennis Ritchie, and Brian Kernighan must have made one day in the early 1970s, and they had full freedom to choose either way. I have not found any record of the decision, which I admit is a weak point in its candidacy: I do not have proof that it was a conscious decision.
我能想到的最佳候选者是 C/Unix/Posix 对以 NUL 结尾的文本字符串的使用。选择实际上非常简单:C 语言应该将字符串表示为 address + length
的元组,还是仅仅用一个带有魔法字符(NUL)标记结尾的地址?这可能是 Ken Thompson、Dennis Ritchie 和 Brian Kernighan 在 1970 年代初某一天做出的决定,他们完全有自由选择任何一种方式。我没有找到任何关于这个决定的记录,我承认这是其候选资格的一个弱点:我没有证据表明这是一个有意识的决定。
As far as I can determine from my research, however, the address + length
format was preferred by the majority of programming languages at the time, whereas the address + magic_marker
format was used mostly in assembly programs. As the C language was a development from assembly to a portable high-level language, I have a hard time believing that Ken, Dennis, and Brian gave it no thought at all.
然而,根据我的研究, address + length
格式在当时大多数编程语言中更为常见,而 address + magic_marker
格式主要用于汇编程序。由于C语言是从汇编语言发展而来的可移植高级语言,我很难相信 Ken、Dennis 和 Brian 完全没有考虑过这个问题。
Using an address + length
format would cost one more byte of overhead than an address + magic_marker
format, and their PDP computer had limited core memory. In other words, this could have been a perfectly typical and rational IT or CS decision, like the many similar decisions we all make every day; but this one had quite atypical economic consequences.
使用 address + length
格式将比 address + magic_marker
格式多占用一个字节的开销,而他们的PDP 计算机的核心内存有限。换句话说,这可能是一个完全典型和理性的 IT 或 CS 决策,就像我们每天做出的许多类似决策一样;但这个决策却产生了相当不寻常的经济后果。
Hardware Development Costs 硬件开发成本
Initially, Unix had little impact on hardware and instruction set design. The CPUs that offered string manipulation instructions—for example, Z-80 and DEC VAX—did so in terms of the far more widespread adr+len
model. Once Unix and C gained traction, however, the terminated string appeared on the radar as an optimization target, and CPU designers started to add instructions to deal with them. One example is the Logical String Assist instructions IBM added to the ES/9000 520-based processors in 1992.1
最初,Unix 对硬件和指令集设计的影响很小。提供字符串操作指令的 CPU—— 例如 Z-80 和 DEC VAX—— 是以更普遍的 address + length
模型来实现的。然而,一旦 Unix 和 C 语言获得了发展势头,终止字符串便成为了一个优化目标,CPU 设计者开始添加指令来处理它们。其中一个例子是 IBM 在 1992 年为基于 ES/9000 520 的处理器增加的逻辑字符串辅助指令。
Adding instructions to a CPU is not cheap, and it happens only when there are tangible and quantifiable monetary reasons to do so.
向 CPU 添加指令并不便宜,只有当有切实和可量化的金钱原因时才会这么做。
Performance Costs 性能成本
IBM added instructions to operate on NUL-terminated strings because its customers spent expensive CPU cycles handling such strings. That bit of information, however, does not tell us if fewer CPU cycles would have been required if a ptr+len
format had been used.
IBM 增加了用于处理 NUL 终止字符串的指令,因为其客户在处理这些字符串时消耗了昂贵的 CPU 周期。然而,这一点信息并不能告诉我们,如果使用 ptr+len
格式,是否会需要更少的 CPU 周期。
Thinking a bit about virtual memory systems settles that question for us. Optimizing the movement of a known-length string of bytes can take advantage of the full width of memory buses and cache lines, without ever touching a memory location that is not part of the source or destination string.
稍微思考一下虚拟内存系统就能为我们解决这个问题。优化已知长度的字节串的移动可以充分利用内存总线和缓存行的全部宽度,而且无需触及不属于源字符串或目标字符串的内存位置。
One example is FreeBSD’s libc
, where the bcopy (3)/memcpy (3)
implementation will move as much data as possible in chunks of “unsigned long,” typically 32 or 64 bits, and then “mop up any trailing bytes” as the comment describes it, with byte-wide operations.2
一个例子是 FreeBSD 的 libc,其中 bcopy (3)/memcpy (3)
的实现会尽可能多地以“无符号长整型unsigned long
” 为块移动数据,通常为 32 或 64 位,然后通过字节范围的操作 “清除任何尾随字节”,正如注释所描述的那样。
If the source string is NUL terminated, however, attempting to access it in units larger than bytes risks attempting to read characters after the NUL. If the NUL character is the last byte of a VM (virtual memory) page and the next VM page is not defined, this would cause the process to die from an unwarranted “page not present” fault.
然而,如果源字符串是以 NUL 结尾的,尝试以大于字节的单位访问它可能会冒着读取 NUL 之后字符的风险。如果 NUL 字符是虚拟内存(VM)页面的最后一个字节,并且下一个虚拟内存页面没有定义,这将导致进程因一个不应有的 “页面不存在” 错误而终止。
Of course, it is possible to write code to detect that corner case before engaging the optimized code path, but this adds a relatively high fixed cost to all string moves just to catch this unlikely corner case—not a profitable tradeoff by any means.
当然,在采用优化的代码路径之前,可以编写代码来检测该极端情况,但这会给所有字符串移动增加相对较高的固定成本,仅仅是为了捕捉这一不太可能发生的极端情况,这无论如何,都不是一个有利的权衡。
If we have out-of-band knowledge of the strings, things are different.
如果我们拥有字符串的额外信息,情况就不同了。
Compiler Development Cost 编译器开发成本
One thing a compiler often knows about a string is its length, particularly if it is a constant string. This allows the compiler to emit a call to the faster memcpy (3)
even though the programmer used strcpy (3)
in the source code.
编译器通常知道的关于字符串的一件事是它的长度,特别是当它是一个常量字符串时。这使得编译器能够发出对更快的memcpy (3)
的调用,即使程序员在源代码中使用的是strcpy (3)
。
Deeper code inspection by the compiler allows more advanced optimizations, some of them very clever, but only if somebody has written the code for the compiler to do it. The development of compiler optimizations has historically been neither easy nor cheap, but obviously Apple is hoping this will change with LLVM (Low-level Virtual Machine), where optimizers seem to come en gros.
编译器对代码的更深层次检查允许进行更高级的优化,其中一些优化非常巧妙,但前提是有人为编译器编写了执行这些优化的代码。从历史上看,编译器优化的开发既不简单也不便宜,但显然苹果公司希望这种情况会随着LLVM(低级虚拟机)而改变,在LLVM中,优化器似乎可以en gros(大量地)提供。
The downside of heavy-duty compiler optimization—in particular, optimizations that take holistic views of the source code and rearrange it in large-scale operations—is that the programmer has to be really careful that the source code specifies his or her complete intention precisely. A programmer who worked with the compilers on the Convex C3800 series supercomputers related his experience as “having to program as if the compiler was my ex-wife’s lawyer.”
深度编译器优化的缺点——特别是那些需要从整体视角审视源代码并进行大规模重排的优化——在于程序员必须非常小心,确保源代码精确地指定了他或她的完整意图。一位曾在 Convex C3800 系列超级计算机上使用编译器的程序员描述他的经历是“编程时就好像编译器是我的前妻的律师”。
Security Costs 安全成本
Even if your compiler does not have hostile intent, source code should be written to hold up to attack, and the NUL-terminated string has a dismal record in this respect. Utter security disasters such as gets (3)
, which “assume the buffer will be large enough,” are a problem "we have relatively under control."3
即使你的编译器没有恶意意图,源代码也应该编写得足以抵御攻击,而 NUL 终止字符串在这方面的记录非常糟糕。像gets (3)
这样的完全安全灾难,它“假设缓冲区会足够大”,是一个“我们相对能控制的问题”。
Getting it under control, however, takes additions to compilers that would complain if the gets (3)
function were called. Despite 15 years of attention, over- and under-running string buffers is still a preferred attack vector for criminals, and far too often it pays off.
然而,要将其控制住,需要对编译器进行一些改动,使其在调用gets (3)
函数时发出警告。尽管经过 15 年的关注,字符串缓冲区的溢出和不足仍然是犯罪分子的首选攻击媒介,而且这种攻击往往会得到回报。
Mitigation of these risks has been added at all levels. Long-missed no-execute bits have been added to CPUs’ memory management hardware; operating systems and compilers have added address-space randomization, often at high costs in performance; and static and dynamic analyses of programs have soaked up countless hours, trying to find out if the byzantine diagnostics were real bugs or clever programming.
为了减轻这些风险,在各个层面上都增加了措施。长期被忽视的不可执行位已经添加到 CPU 的内存管理硬件中;操作系统和编译器引入了地址空间随机化,通常在性能上付出了很高的代价;静态和动态的程序分析耗费了无数小时,试图弄清楚拜占庭式复杂的诊断是实际的错误还是巧妙的编程。
Yet, absolutely nobody would be surprised if Sony’s troubles were revealed to start with a buffer overflow or false NUL-termination assumption.
然而,如果索尼的问题被发现始于缓冲区溢出或错误的 NUL 终止假设,绝对没有人会感到惊讶。
Slashdot Sensation Prevention Section 不可访问感知预防部分
We learn from our mistakes, so let me say for the record, before somebody comes up with a catchy but totally misleading Internet headline for this article, that there is absolutely no way Ken, Dennis, and Brian could have foreseen the full consequences of their choice some 30 years ago, and they disclaimed all warranties back then. For all I know, it took at least 15 years before anybody realized why this subtle decision was a bad idea, and few, if any, of my own IT decisions have stood up that long.
我们从错误中学习,因此让我郑重声明,在有人为这篇文章想出一个引人注目但完全误导的互联网标题之前,Ken、Dennis 和 Brian 在大约 30 年前做出选择时,绝对不可能预见到其全部后果,而且他们当时已经声明了所有保证无效。据我所知,至少过了 15 年才有人意识到这个微妙的决定是个坏主意,而且如果我自己的 IT 决策能够经得起那么长时间的考验,那真是少之又少。
In other words, Ken, Dennis, and Brian did the right thing.
换句话说,Ken、Dennis 和 Brian 做了正确的事情。
But That Doesn’t Solve the Problem 但这并不能解决问题
To a lot of people, C is a dead language, and ${lang} is the language of the future, for ever-changing transient values of ${lang}
. The reality of the situation is that all other languages today directly or indirectly sit on top of the Posix API and the NUL-terminated string of C.
对很多人来说,C 语言是一种已经过时的语言,而 ${lang}
是未来的语言,${lang}
的值不断变化。实际情况是,今天所有的其他语言直接或间接地都建立在 C 的 Posix API 和以 NUL 结尾的字符串之上。
When your Java, Python, Ruby, or Haskell program opens a file, its runtime environment passes the filename as a NUL-terminated string to open (3), and when it resolves queue.acm.org to an IP number, it passes the host name as a NUL-terminated string to getaddrinfo (3)
. As long as you keep doing that, you retain all the advantages when running your programs on a PDP/11, and all of the disadvantages if you run them on anything else.
当用 Java、Python、Ruby 或 Haskell 程序打开一个文件时,其运行时环境将文件名作为以 NUL 结尾的字符串传递给 open (3)
,并且当它将 queue.acm.org 解析为 IP 地址时,它将主机名作为以 NUL 结尾的字符串传递给 getaddrinfo (3)
。只要你继续这样做,你在 PDP/11 上运行程序时就保留了所有优势,同时如果在其他任何设备上运行它们,也会保留所有劣势。
I could write a straw man API proposal here, suggest representations, operations, and error-handling strategies, and I am quite certain that it would be a perfectly good waste of a nice afternoon. Experience shows that such proposals go nowhere because backwards compatibility with the PDP/11 and the finite number of programs written are much more important than the ability to write the potentially infinite number of programs in the future in an efficient and secure way.
我可以在这里写一个虚构的 API 提案,建议表示方法、操作和错误处理策略,我相当确定这将是浪费一个完美的好下午的事情。经验表明,这样的提案是毫无意义的,因为与 PDP/11 的向后兼容性以及有限的现有程序数量比未来以高效和安全的方式编写潜在无限数量的程序更为重要。
Thus, the costs of the Ken, Dennis, and Brian decision will keep accumulating, like the dust that over the centuries has almost buried the monuments of ancient Rome.
因此,Ken、Dennis和Brian的决策成本将持续累积,就像几个世纪以来几乎埋没了古罗马遗迹的灰尘一样。
References
-
Computer Business Review. 1992. Partitioning and Escon enhancements for top-end ES/9000s; http://www.cbronline.com/news/ibm_announcements_71
-
ViewVC. 2007. Contents of /head/lib/libc/string/bcopy.c;
http://svnweb.freebsd.org/base/head/lib/libc/string/bcopy.c?view=markup -
Wikipedia. 2011. Lifeboat sketch;
http://en.wikipedia.org/wiki/Lifeboat_sketch
via:
-
The Most Expensive One-byte Mistake - ACM Queue Poul-Henning Kamp July 25, 2011