《Windows Via C/C++》边学习，边翻译（二）操作字符和字符串-1

最新推荐文章于 2024-09-23 16:05:40 发布

Direwolf

最新推荐文章于 2024-09-23 16:05:40 发布

阅读量1.1k

点赞数

分类专栏： Windows Via C/C++ 文章标签： windows character microsoft transformation encoding localization

Windows Via C/C++ 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

第二章操作字符和字符串(Working with Characters and Strings)

Overview 概述

Windows has always offered support to help developers localize their applications. An application can get country-specific information from various functions and can examine Control Panel settings to determine the user's preferences. Windows even supports different fonts for our applications. Last but not least, in Windows Vista, Unicode 5.0 is now supported. (Read "Extend The Global Reach Of Your Applications With Unicode 5.0" at http://msdn.microsoft.com/msdnmag/issues/07/01/Unicode/default.aspx for a high-level presentation of Unicode 5.0.)

Windows一直为开发者进行应用程序本地化提供支持。应用程序可以从不同函数中取得设定国家的信息，并能通过检查检查控制面板设置来决定用户的首选项。Windows也能为应用程序提供不同字体支持。最后，但并非不要，Windows Vista目前已支持Unicode 5.0。（可阅读"Extend The Global Reach Of Your Applications With Unicode 5.0"获得Unicode 5.0的更多介绍）

Buffer overrun errors (which are typical when manipulating character strings) have become a vector for security attacks against applications and even against parts of the operating system. In previous years, Microsoft put forth a lot of internal and external efforts to raise the security bar in the Windows world. The second part of this chapter presents new functions provided by Microsoft in the C run-time library. You should use these new functions to protect your code against buffer over-runs when manipulating strings.

缓冲区溢出错误（尤其是在操作字符串时）是对应用程序及操作系统局部进行安全攻击的一个途径。在过去数年，微软一直进行内外部努力来提高Windows环境的安全防护能力。本章的第二部分介绍了C运行时库中微软提供的新函数，应该在处理字符串时使用这些新函数来避免代码产生缓冲区溢出错误。

If you have a code base that is non-Unicode, you'll be best served by moving that code base to Unicode, as this will improve your application's execution performance as well as prepare it for localization. It will also help when interoperating with COM and the .NET Framework.

如果你的代码是非Unicode编码，最好将其迁移到Unicode基础上，因为这样会使应用程序易于本地化。这样做也有助于与COM和.NET框架进行交互。

Character Encodings 字符编码

The real problem with localization has always been manipulating different character sets. For years, most of us have been coding text strings as a series of single-byte characters with a zero at the end. This is second nature to us. When we call strlen, it returns the number of characters in a zero-terminated array of ANSI single-byte characters.

本地化的核心问题在于对不同字符集的操作。多年来，我们一直将字符串编码为一系列单字节字符加上末尾的零字符(’/0’)，这已成为我们的习性。当调用strlen函数时，返回以零字符结尾的数组中ANSI单字节字符的个数。

Unicode is a standard founded by Apple and Xerox in 1988. In 1991, a consortium was created to develop and promote Unicode. The consortium consists of companies such as Apple, Compaq, Hewlett-Packard, IBM, Microsoft, Oracle, Silicon Graphics, Sybase, Unisys, and Xerox. (A complete and updated list of consortium members is available at http://www.Unicode.org.) This group of companies is responsible for maintaining the Unicode standard. The full description of Unicode can be found in The Unicode Standard, published by Addison-Wesley. (This book is available through http://www.Unicode.org.)

1988年，苹果(Apple)和施乐(Xerox)公司创建了Unicode标准。1991年，发展与促进Unicode的协会被创建，此协会由苹果(Apple)、康柏(Compaq)、惠普(Hewlett-Packard)、IBM、微软、 Oracle、SGI(Silicon Graphics)、Sybase、Unisys及施乐(Xerox)等多家公司组成（协会成员最新列表参看http://www.Unicode.org）。这些公司负责维护Unicode标准。Unicode的完整描述清参考《The Unicode Standard》，由Addison-Wesley出版（此书可参见 http://www.Unicode.org）。

In Windows Vista, each Unicode character is encoded using UTF-16 (where UTF is an acronym for Unicode Transformation Format). UTF-16 encodes each character as 2 bytes (or 16 bits). In this book, when we talk about Unicode, we are always referring to UTF-16 encoding unless we state otherwise. Windows uses UTF-16 because characters from most languages used throughout the world can easily be represented via a 16-bit value, allowing programs to easily traverse a string and calculate its length. However, 16-bits is not enough to represent all characters from certain languages. For these languages, UTF-16 supports surrogates, which are a way of using 32 bits (or 4 bytes) to represent a single character. Because few applications need to represent the characters of these languages, UTF-16 is a good compromise between saving space and providing ease of coding. Note that the .NET Framework always encodes all characters and strings using UTF-16, so using UTF-16 in your Windows application will improve performance and reduce memory consumption if you need to pass characters or strings between native and managed code.

Windows Vista中的Unicode字符均采用UTF-16编码（UTF即Unicode Transformation Format）。UTF-16将每个字符编码为2字节（16位）。本书中讨论到Unicode，如果没有特殊说明，均指 UTF-16编码。Windows采用UTF-16，是因为全世界范围使用的绝大多数语言的字符，都能通过16位值来表示，这使得程序能够容易地转换字符串并计算其长度。然而，16位长并不足以表示某些语言的所有字符。对于这些语言，UTF-16支持替代(surrogates)——用32位值（4字节）来表示单个字符。由于只有极少数应用程序需要表示这些语言的字符，所以UTF-16是节省空间和简化编码之间很好地的折衷方案。注意.NET Framework对所有字符和字符串都采用UTF-16编码，因此在Windows应用程序中，当需要在本地和托管代码间传递字符和字符串时，采用UTF-16编码会提升性能和减少内存消耗。

There are other UTF standards for representing characters, including the following ones:

UTF-8 UTF-8 encodes some characters as 1 byte, some characters as 2 bytes, some characters as 3 bytes, and some characters as 4 bytes. Characters with a value below 0x0080 are compressed to 1 byte, which works very well for characters used in the United States. Characters between 0x0080 and 0x07FF are converted to 2 bytes, which works well for European and Middle Eastern languages. Characters of 0x0800 and above are converted to 3 bytes, which works well for East Asian languages. Finally, surrogate pairs are written out as 4 bytes. UTF-8 is an extremely popular encoding format, but it's less efficient than UTF-16 if you encode many characters with values of 0x0800 or above.

UTF-32 UTF-32 encodes every character as 4 bytes. This encoding is useful when you want to write a simple algorithm to traverse characters (used in any language) and you don't want to have to deal with characters taking a variable number of bytes. For example, with UTF-32, you do not need to think about surrogates because every character is 4 bytes. Obviously, UTF-32 is not an efficient encoding format in terms of memory usage. Therefore, it's rarely used for saving or transmitting strings to a file or network. This encoding format is typically used inside the program itself.

以下是表示字符的其他UTF标准：

UTF-8 UTF-8将字符编码为1字节、为2字节、3字节或4字节。值低于0x0080的字符被压缩为1字节，可以很好地表示美国所使用的字符；值介于0x0080和0x07FF之间的字符被转换为2字节，能够很好地表示欧洲及中东国家所使用的字符；值大于等于0x0800的字符被转化为3字节，表示东亚国家的语言字符；最后，替代对(surrogate pairs??)被编码为4字节。UTF-8是最流行的编码格式，但是当你要对许多值大于0x0800的字符进行编码时，会比UTF-16编码的效率低。

UTF-32 UTF-32将每个字符编码为4字节。当你只想写一个简单算法遍历字符（任何语言中所使用的字符），并且不想考虑字节长度变化的问题时，这种编码方式很有用。例如，使用UTF-32无需考虑 surrogates，因为每个字符都是4字节。显然UTF-32在内存使用上是缺乏效率的一种编码方式。因此，它很少被用于向文件或网络存储或传送字符串。典型地，它被用于程序内部处理。

Currently, Unicode code points are defined for the Arabic, Chinese bopomofo, Cyrillic (Russian), Greek, Hebrew, Japanese kana, Korean hangul, and Latin (English) alphabets—called scripts—and more. Each version of Unicode brings new characters in existing scripts and even new scripts such as Phoenician (an ancient Mediterranean alphabet). A large number of punctuation marks, mathematical symbols, technical symbols, arrows, dingbats, diacritics, and other characters are also included in the character sets. These 65,536 characters are divided into regions. Table 2-1 shows some of the regions and the characters that are assigned to them.

目前，Unicode代码点(code points，指符号在字符表中的位置)定义了阿拉伯语、汉语、西里尔字母（俄语所使用的字母）、希腊语、希伯来语、日语假名、韩文、拉丁（英文）字母——所谓的文字体系——以及更多。每个Unicode的版本都引入现存文字体系的新字符，甚至引入像菲尼基文（一种古老的地中海文字）这样的新的文字体系。大量的标点符号、数学符号、专业符号、箭头符号、新发明的符号、医学符号以及其他的符号，也包含在字符集中。这65,536个字符被分成多个区块。表2-1表示了其中的一些区块以及所分配的字符。

Table 2-1: Unicode Character Sets and Alphabets Unicode字符集和字母表

16-Bit Code	Characters	16-Bit Code	Alphabet/Scripts
0000-007F	ASCII	0300-036F	Generic diacritical marks 一般变音符/附加符号
0080-00FF	Latin1 characters	0400-04FF	Cyrillic 西里尔字母
0100-017F	European Latin	0530-058F	Armenian 亚美尼亚语
0180-01FF	Extended Latin	0590-05FF	Hebrew 希伯来语
0250-02AF	Standard phonetic 标准语音（音标）	0600-06FF	Arabic 阿拉伯语
02B0-02FF	Modified letters	0900-097F	Devanagari 梵文字母