windows下转码用MultiByteToWideChar和WideCharToMultiByte这两个函数
在你得到UTF16的编码后,要将其转成内码需要参数codepage代码页,用他将UTF16编码和内码对应起来进行转码
但是我们不知道用什么codepage(当你的UTF16中有多种语种的时候更麻烦)
所以要看Unicode Subset Bitfields来确定他的codepage,下表是我从MSDN上找到的
Bit | Unicode | Description |
0 | 0020 - 007e | Basic Latin |
1 | 00a0 - 00ff | Latin-1 Supplement |
2 | 0100 - 017f | Latin Extended-A |
3 | 0180 - 024f | Latin Extended-B |
4 | 0250 - 02af | IPA Extensions |
5 | 02b0 - 02ff | Spacing Modifier Letters |
6 | 0300 - 036f | Combining Diacritical Marks |
7 | 0370 - 03ff | Basic Greek |
8 |
| Reserved |
9 | 0400 - 04ff | Cyrillic |
10 | 0530 - 058f | Armenian |
11 | 0590 - 05ff | Basic Hebrew |
12 |
| Reserved |
13 | 0600 - 06ff | Basic Arabic |
14 |
| Reserved |
15 | 0900 - 097f | Devanagari |
16 | 0980 - 09ff | Bengali |
17 | 0a00 - 0a7f | Gurmukhi |
18 | 0a80 - 0aff | Gujarati |
19 | 0b00 - 0b7f | Oriya |
20 | 0b80 - 0bff | Tamil |
21 | 0c00 - 0c7f | Telugu |
22 | 0c80 - 0cff | Kannada |
23 | 0d00 - 0d7f | Malayalam |
24 | 0e00 - 0e7f | Thai |
25 | 0e80 - 0eff | Lao |
26 | 10a0 - 10ff | Basic Georgian |
27 |
| Reserved |
28 | 1100 - 11ff | Hangul Jamo |
29 | 1e00 - 1eff | Latin Extended Additional |
30 | 1f00 - 1fff | Greek Extended |
31 | 2000 - 206f | General Punctuation |
32 | 2070 - 209f | Subscripts and Superscripts |
33 | 20a0 - 20cf | Currency Symbols |
34 | 20d0 - 20ff | Combining Diacritical Marks for Symbols |
35 | 2100 - 214f | Letter-like Symbols |
36 | 2150 - 218f | Number Forms |
37 | 2190 - 21ff | Arrows |
38 | 2200 - 22ff | Mathematical Operators |
39 | 2300 - 23ff | Miscellaneous Technical |
40 | 2400 - 243f | Control Pictures |
41 | 2440 - 245f | Optical Character Recognition |
42 | 2460 - 24ff | Enclosed Alphanumerics |
43 | 2500 - 257f | Box Drawing |
44 | 2580 - 259f | Block Elements |
45 | 25a0 - 25ff | Geometric Shapes |
46 | 2600 - 26ff | Miscellaneous Symbols |
47 | 2700 - 27bf | Dingbats |
48 | 3000 - 303f | Chinese, Japanese, and Korean (CJK) Symbols and Punctuation |
49 | 3040 - 309f | Hiragana |
50 | 30a0 - 30ff | Katakana |
51 | 3100 - 312f | Bopomofo |
52 | 3130 - 318f | Hangul Compatibility Jamo |
53 | 3190 - 319f | CJK Miscellaneous |
54 | 3200 - 32ff | Enclosed CJK Letters and Months |
55 | 3300 - 33ff | CJK Compatibility |
56 | ac00 - d7a3 | Hangul |
57 | d800 - dfff | Surrogates. Note that setting this bit implies that there is at least one codepoint beyond the Basic Multilingual Plane that is supported by this font. |
58 |
| Reserved |
59 | 4e00 - 9fff | CJK Unified Ideographs |
60 | e000 - f8ff | Private Use Area |
61 | f900 - faff | CJK Compatibility Ideographs |
62 | fb00 - fb4f | Alphabetic Presentation Forms |
63 | fb50 - fdff | Arabic Presentation Forms-A |
64 | fe20 - fe2f | Combining Half Marks |
65 | fe30 - fe4f | CJK Compatibility Forms |
66 | fe50 - fe6f | Small Form Variants |
67 | fe70 - fefe | Arabic Presentation Forms-B |
68 | ff00 - ffef | Halfwidth and Fullwidth Forms |
69 | fff0 - fffd | Specials |
70 | 0f00 - 0fcf | Tibetan |
71 | 0700 - 074f | Syriac |
72 | 0780 - 07bf | Thaana |
73 | 0d80 - 0dff | Sinhala |
74 | 1000 - 109f | Myanmar |
75 | 1200 - 12bf | Ethiopic |
76 | 13a0 - 13ff | Cherokee |
77 | 1400 - 14df | Canadian Aboriginal Syllabics |
78 | 1680 - 169f | Ogham |
79 | 16a0 - 16ff | Runic |
80 | 1780 - 17ff | Khmer |
81 | 1800 - 18af | Mongolian |
82 | 2800 - 28ff | Braille |
83 | a000 - a48c | Yi |
84-122 |
| Reserved |
123 |
| Windows 2000/XP: Layout progress: horizontal from right to left |
124 |
| Windows 2000/XP: Layout progress: vertical before horizontal |
125 |
| Windows 2000/XP: Layout progress: vertical bottom to top |
126 |
| Reserved; must be 0 |
127 |
| Reserved; must be 1 |
下表是主要的codepage
ANSI Code-Page Identifiers
Identifier | Meaning |
874 | Thai |
932 | Japanese |
936 | Chinese (PRC, Singapore) |
949 | Korean |
950 | Chinese (Taiwan; Hong Kong SAR, PRC) |
1200 | Unicode (BMP of ISO 10646) |
1250 | Windows 3.1 Eastern European |
1251 | Windows 3.1 Cyrillic |
1252 | Windows 3.1 Latin 1 (US, Western Europe) |
1253 | Windows 3.1 Greek |
1254 | Windows 3.1 Turkish |
1255 | Hebrew |
1256 | Arabic |
1257 | Baltic |
可以通过EnumSystemCodePages来枚举codepage
看起来很复杂,两个表似乎很难对应~~
自己根据一些资料经过实验总结了一部分
Unicode subrange | Description | Codepage |
0x00-0x007F | Basic Latin | 0(CP_ACP) |
0x7F-0x00FF | Latin-1 Supplement | 1252 |
0x0100-0x017F | Latin Extended-A | 1250 |
0x0180-0x024F | Latin Extended-B | ??? |
|
|
|
0x0370-0x03FF | Basic Greek | 1253 |
0x0E00-0x0E7F | Thai | 874 |
0x0590-0x05FF | Basic Hebrew | 1255 |
0x0600-0x07FF | Basic Arabic | 1256 |
也就只能如此了
LINUX下转码
在LINUX下转码的时候我找到了iconv族函数,用起来倒也简单
先打开iconv_t iconv_open(const char *tocode, const char *fromcode);
再转码size_t iconv(iconv_t cd,
char **inbuf, size_t *inbytesleft,
char **outbuf, size_t *outbytesleft);
最后关 int iconv_close(iconv_t cd);
尤其需要注意的是iconv_open的参数,两个code很容易让人出错。
这里的code和windows下的codepage很象
可以iconv –list这个命令来显示他所有的code
简单的可以用windows的codepage前加个CP,
例如 codepage是1250 code是“CP1250”
虽然简单但我还是为我的马虎(两个code下反了)付出了时间的代价
还有一个问题,我还没找到答案
一般在调用iconv的时候inbuf是char*的,他里面存放数据的顺序是先低位后高位
假如“尽”gb2312编码是0xBEA1在inbuf中应该存成
inbuf[0] = 0xBE;inbuf[1] = 0xA1;
但当编码为UTF16的时候,进iconv的inbuf的顺序是先高位后低位
例如“尽” UTF16编码是0x5C3D 在 inbuf中存成了
inbuf[0] = 0x3D;inbuf[1] = 0x5C;
搞不明白为什么,big endian small endian和CPU有关,这里又算怎么一回事~~只好把这部分的unicode挑出来了。