第一篇译作:Working With Unicode in C++(在c++中使用Unicode)

 
c++ 中使用 Unicode
翻译者 :selong
翻译时间 :2006-6-9
Working With Unicode in C++
Because the Pocket PC generally requires character string parameters to be in Unicode, you may at first encounter a great many errors when you first port code to the platform. This article will help you through the bumps associated with working with Unicode in your C++ applications. The information also applies to porting non-Unicode applications to Unicode in Microsoft® Windows NT® and Microsoft Windows® 2000.
 
c++ 中使用 Unicode
因为 pocket pc 一般需要的字符串参数为 Unicode 形式的 , 在这个平台上 , 你可能在你的第一个串口程序中遇到很多错误 . 这篇文章将帮助你在 c++ 程序中彻底的征服它们 . 这些信息也可以应用在 Microsoft® Windows NT® and Microsoft Windows® 2000 这样的平台上 , 以帮助你将非 Unicode 程序转化为 Unicode 程序 .
 
What You Need
Microsoft eMbedded Visual C++® 3.0
Languages Supported
Any language supported by Microsoft eMbedded Visual C++ 3.0
 
你需要什么
Microsoft eMbedded Visual C++® 3.0
 
语言支持
任何被 Microsoft eMbedded Visual C++ 3.0 所支持的语言
 
Using Unicode Character Strings
On the Pocket PC platform, Unicode characters are 16-bit (dual byte) integers, which means that each character in a string of text can have one of 65,536 (216) values. In contrast, ASCII characters (the English default for Windows 95/98/ME) use 8 bits, and can only have 255 different values for each character in a string. While 255 characters are enough for English and other Latin-based languages, there are simply too many characters in several Asian, Arabic, and other languages to suffice. In desktop operating systems such as Windows 95/98/ME, different versions of the operating system were made for different languages. The 16-bit Unicode standard used by the Pocket PC provides codes for nearly 39,000 characters from the world's alphabets, ideograph sets, and symbol collections (and still has room for 18,000 more!) Because most of the Pocket PC kernel, user, and graphics application programming interfaces (APIs) require string parameters to be passed as Unicode character strings (encoded in UCS little-endian 16-bit format, also known as UCS-2 or UTF-16), you will need to perform some steps in your application source code:
 
使用 Unicode 字符串
Pocket PC 平台 ,Unicode 字符是 16 ( 双字节 ) 整型 , 这也就意味者 , 每一个字符在一个文本字符串中将拥有 65,536 (216) 个值 . 与此对应的是 ,ASCII 字符 (Windows 95/98/ME 英文中默认的字符 ) 使用 8 个位来存放 , 而且在文本字符串中 , 每一个字符仅仅只有 255 种不同的值 . 虽然 255 个字符对于英文和一些基于拉丁文的语言是足够的 , 但是对于亚洲和阿拉伯等其他语言来说确捉衿见肘 . Windows 95/98/ME 这样的桌面操作系统 , 不同版本的操作系统要做成不同的语言 . 使用 16 Unicode 标准的 Pocket PC 提供将近 39,000 个字符 , 这些字符涉及 整个世界的字母表 , 象形文字集和符号集合 ( 并且依然有多于 18,000 个字符的空间可供日后之用 ). 因为很多的 Pocket PC 内核 , 用户 和图形界面的应用程序编程接口 (APIS), 需要传递 Unicode 形式的字符串参数 (encoded in UCS little-endian 16-bit format, also known as UCS-2 or UTF-16), 你将需要在你的应用程序代码种执行一些步骤 , 如下 : 符号文字 ode 95/98/ME,
1.                  Wrap all character strings in either the _T() macro or the TEXT() macro. These will cause the character strings to be compiled as double-byte strings.
2.                  Use TCHAR instead of char and unsigned char when dealing with individual characters and when allocating character arrays. (Note that in a non-Unicode OS such as Windows 98/ME, a TCHAR is a single byte, so your source code will be portable to these as well.)
3.                  Use LPTSTR for TCHAR pointers and LPCTSTR for constant TCHAR pointers.
4.                  When you are copying strings or memory containing strings, never assume that characters are 1 byte each. Actually, you shouldn't assume they are 2 bytes each either. Instead, use sizeof(TCHAR) to guarantee that your code will work in any situation.

1.
_T () 宏或者 TEXT () 宏来将你的字符串重新包裹起来 , 这一步将导致字符串被编译为双字节的 Unicode 字符串 .
2. 当处理单独的字符或者分配字符数组的时候 , 使用 TCHAR 替代 char unsigned char.
3. 使用 LPTSTR 代替 TCHAR 指针 , 使用 LPCTSTR 代替常量 TCHAR 指针 .
4. 当你拷贝字符串或者内存包含字符串时 , 一定不要假想这些字符每个都是 1 个字节的 . 实际上 , 你也不应该假设每个字符都是两个字节的 . 而应该使用 sizeof(TCHAR) 保证你的代码在任何环境种都能正常工作 .
Converting Between Unicode and Single Byte Characters
 
Unicode 字符串和单字节字符串种转换
 
There may be occasions, such as when you have legacy source code that simply requires a single-byte character string, when you may need to convert from single byte to Unicode or vice-versa. To convert between the two:
可能有这样的时候 , 例如 : 当你拥有一些遗留下来的代码 , 这些代码需要的是单字节字符串 , 你可能需要将这些转换为 Unicode 形式的或者反之亦然 . 为了执行这样的转换 , 需要如下步骤 :
1.                  Make two functions, one called ConvertTToC() and the other called ConvertCToT(). Each of the functions will accept a source and target pointer.
1. 产生两个函数 , 一个叫做 ConvertTToC(), 另外一个叫做 ConvertCToT(). 每一个将接受源指针和目标指针 ( 指向 ANSI/Unicode 字符串的指针 ).
2.                  In the body of each function, simply walk each character in the source string and cast it to the corresponding character in the destination string. Your code should look something like this:
2. 在函数体中 , 简单的将源字符串中的每个字符转换为目的字符串中的每个字符 . 你可以看看下面的代码 :
3.                        // 转换 Unicode 字符串为 Ansi 字符串
4.                        ConvertTToC(CHAR* pszDest, const TCHAR* pszSrc)
5.                        {
6.                                       for(int i = 0; i < _tcslen(pszSrc); i++)
7.                                                       pszDest[i] = (CHAR) pszSrc[i];
8.                        }
9.                        // 转换 Ansi 字符串为 Unicode 字符串
10.                    ConvertCToT(TCHAR* pszDest, const CHAR* pszSrc)
11.                    {
12.                                   for(int i = 0; i < strlen(pszSrc); i++)
13.                                                   pszDest[i] = (TCHAR) pszSrc[i];
14.                    }
As you can see, the functions are nearly identical except for the variation of the strlen() function they use.
就像你看到的一样 , 这些函数在使用 strlen() 函数时几乎是一样的 , 除了 strlen 的名字不样而已 ( _tcslen/ strlen).
15.              Consider that in performing the conversion from TCHAR to CHAR will cause a loss of any high-order bytes in each character as shown in the figure. If you are not planning on your application being used with languages requiring more than 255 characters, this will have no affect. But as shown in the illustration, it could have a very bad effect on strings containing characters greater than 255. As you can see, once these two characters have been converted to single byte there is no way to distinguish them.

Problem converting TCHAR into single byte character.
 
考虑到执行从 TCHAR CHAR 的转换时 , 会引起每个字符双字中的高位丢失 , 如图所示 . 如果你打算在你的应用程序中 使用不多于 255 个字符表达的语言 , 上面的高字节丢失的将不会发生 . 但是如图解中所示 , 当包含大于 255 的字符的时候 , 上面的转换将会产生非常怀的影响 . 就像你所看到的一样 , 当这两个字符串被转换为单字节的时候 , 就没有办法再区分开他们了 .
 
Working with BSTR Objects
When working with character strings in COM objects, you will be required to pass and receive character strings as BSTR (Binary String) objects. There are Microsoft Win32® APIs for creating and working with BSTR objects, as well as an ATL class called CComBSTR if you are using the Microsoft ATL libraries. If you chose to use the Win32 APIs here are the steps you should follow to create, use, and clean up your BSTR objects:
1.                  Create the BSTR using the SysAllocString() API.
2.                  If you need to change the contents of the BSTR object, resize the buffer with the SysReAllocString() API if needed.
3.                  When you are done with the object, call SysFreeString() API to release its memory.
使用 BSTR 对象
当在 COM 对象中使用字符串时 , 你必须传递和接受 BSTR( 二进制字符串 ) 对象 . Microsoft Win32® APIs 中有创建和使用 BSTR 字符串的 , 而且在 ATL 库中 , 也有一个叫做 CComBSTR ATL . 如果你选择使用 Win32 APIs, 下面有一些步骤当你创建和使用并清除你的 BSTR 对象的时候 :
1.      使用 SysAllocString() API 创建 BSTR
2.      当你需要改变 BSTR 对象的时候 , 如果有需要 , 请使用 SysReAllocString() API 来调整存储空间的大小
3.      当你用完这个对象的时候 , 调用 SysFreeString() API 来释放内存 .
Conclusion
By supporting Unicode and understanding the differences between single byte and Unicode characters, your application will be ready to accept character strings in any language.
 
结论
通过在你的应用程序中支持 Unicode 和理解单字节和 Unicode 字符之间的差异 , 你的应用程序将接受任何语言的字符串 .
 
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值