第二章 Unicode编码学习笔记-CSDN博客

本文链接：https://blog.csdn.net/lin810921141/article/details/16351633

我们都知道ASCII character encoding，但是这是一个真正的美国标准，世界其他地区对于这些127个字符都觉得不够用，例如中文就2W多个汉字了，所以后来就出现了Unicode这种16-bit代表一个字符的编码方式了。

在History of Character Sets中，有3种比较突出的编码方式，ASCII ，DBCS，UniCode。ASCII我们很熟悉了，对于DBCS（double-byte character set）我的理解是这样的：这种编码方式依旧是8-bits来代表一个字符，但是有一些规定好的编码（叫做lead byte）它代表后面的一个8-bits（trail byte）和它一起组成一个字符。

the first 128 of these codes are ASCII. However, some of the codes in the higher 128 are always followed by a second byte. The two bytes together (called a lead byte and
a trail byte) define a single character

对于UniCode编码就是用16-bits来代表一个字符了。

Wide Characters and C

ASCII编码的: char strlen()

UniCode编码的: wchar_t

例子: wchar_t c=L'a'; wchar_t c=L'中'; (L前缀告诉编译器后面的字符时一个字符占2个字节的)

wchar_t *p=L"Hello!"; wchar_t a[]=L"Hello!"; （这里如果sizeof(a)得到的是14，因为一个字符2个byte，还有一个0表示完结，0同样也占2个字节)

这里顺便讲下：wchar_t c = 'A' ;
变量c 是一个16-bits的字符0x0041，是Unicode 表示的字母A。然而，因
为Intel 微处理器从最小的位元组开始储存多位元组数值，该位元组实际上是
以0x41、0x00 的顺序保存在内存中，而不是0x00 0x41这样。

所以L"Hello!"are stored in memory by Intel processors like so:
48 00 65 00 6C 00 6C 00 6F 00 21 00

wcslen()用于宽字符的求长度

8-bits与16-bits的统一： TCHAR（定义包含在头文件TCHAR.h中）

If an identifier named _UNICODE is defined and the TCHAR.H header file is included in your program, _tcslen is
defined to be wcslen :
#define _tcslen wcslen
If UNICODE isn't defined, _tcslen is defined to be strlen :
#define _tcslen strlen
And so on.

If the _UNICODE identifier is defined, TCHAR is wchar_t :
typedef wchar_t TCHAR ;
Otherwise, TCHAR is simply a char :
typedef char TCHAR ;

If the _UNICODE identifier is defined, a
macro called __T is defined like this:
#define __T(x) L##x

If the _UNICODE identifier is not defined, the __T macro is simply defined in the following way:
#define __T(x) x

Wide Characters and Windows

在windows编程中就不要用char，wchar_t这些了，windows.h包含了winnt,h，在winnt.h里面包含了许多定义：

typedef char CHAR ;
typedef wchar_t WCHAR ;

指向8-bits的指针:

typedef CHAR * PCHAR, * LPCH, * PCH, * NPSTR, * LPSTR, * PSTR ;
typedef CONST CHAR * LPCCH, * PCCH, * LPCSTR, * PCSTR ;

指向16-bits的指针:

typedef WCHAR * PWCHAR, * LPWCH, * PWCH, * NWPSTR, * LPWSTR, * PWSTR ;
typedef CONST WCHAR * LPCWCH, * PCWCH, * LPCWSTR, * PCWSTR ;

winnt.h也定义了TCHAR，（1）#define __TEXT(quote) L##quote （WCHAR用）（2） #define __TEXT(quote) quote （CHAR用）

#define TEXT(quote) __TEXT(quote)

大部分与string有关的函数也分8-bits与16-bits的，例如MessageBox就分MessageBoxA与MessageBoxW

WINUSERAPI int WINAPI MessageBoxA (HWND hWnd, LPCSTR lpText,LPCSTR lpCaption, UINT uType) ;

WINUSERAPI int WINAPI MessageBoxW (HWND hWnd, LPCWSTR lpText,LPCWSTR lpCaption, UINT uType) ;

不过也有统一了的：

#ifdef UNICODE
#define MessageBox MessageBoxW
#else
#define MessageBox MessageBoxA
#endif

最后一部分就是windows不能使用像printf这样的格式化输出函数，但是它能用sprintf把要输出的内容格式化输到缓存中，再通过MessageBox来输出，事实上也能写出格式化输出的MessageBox函数来，思路就是上面所讲的那样，利用sprintf函数。

具体的必须要引用别人的内容了，写得很详细具体：http://www.360doc.com/content/13/0205/15/7991404_264373779.shtml

第二章 Unicode编码 学习笔记

Wide Characters and C

ASCII编码的: char strlen()

UniCode编码的: wchar_t

Wide Characters and Windows

第二章 Unicode编码学习笔记