unicode编码问题

最新推荐文章于 2024-08-20 12:43:06 发布

thisisll

最新推荐文章于 2024-08-20 12:43:06 发布

阅读量1.4k

点赞数

文章标签： basic windows forms layout shapes extension

本文链接：https://blog.csdn.net/thisisll/article/details/723602

版权

windows下转码用MultiByteToWideChar和WideCharToMultiByte这两个函数

在你得到UTF16的编码后，要将其转成内码需要参数codepage代码页，用他将UTF16编码和内码对应起来进行转码

但是我们不知道用什么codepage(当你的UTF16中有多种语种的时候更麻烦)

所以要看Unicode Subset Bitfields来确定他的codepage，下表是我从MSDN上找到的

Bit	Unicode subrange	Description
0	0020 - 007e	Basic Latin
1	00a0 - 00ff	Latin-1 Supplement
2	0100 - 017f	Latin Extended-A
3	0180 - 024f	Latin Extended-B
4	0250 - 02af	IPA Extensions
5	02b0 - 02ff	Spacing Modifier Letters
6	0300 - 036f	Combining Diacritical Marks
7	0370 - 03ff	Basic Greek
8		Reserved
9	0400 - 04ff	Cyrillic
10	0530 - 058f	Armenian
11	0590 - 05ff	Basic Hebrew
12		Reserved
13	0600 - 06ff	Basic Arabic
14		Reserved
15	0900 - 097f	Devanagari
16	0980 - 09ff	Bengali
17	0a00 - 0a7f	Gurmukhi
18	0a80 - 0aff	Gujarati
19	0b00 - 0b7f	Oriya
20	0b80 - 0bff	Tamil
21	0c00 - 0c7f	Telugu
22	0c80 - 0cff	Kannada
23	0d00 - 0d7f	Malayalam
24	0e00 - 0e7f	Thai
25	0e80 - 0eff	Lao
26	10a0 - 10ff	Basic Georgian
27		Reserved
28	1100 - 11ff	Hangul Jamo
29	1e00 - 1eff	Latin Extended Additional
30	1f00 - 1fff	Greek Extended
31	2000 - 206f	General Punctuation
32	2070 - 209f	Subscripts and Superscripts
33	20a0 - 20cf	Currency Symbols
34	20d0 - 20ff	Combining Diacritical Marks for Symbols
35	2100 - 214f	Letter-like Symbols
36	2150 - 218f	Number Forms
37	2190 - 21ff	Arrows
38	2200 - 22ff	Mathematical Operators
39	2300 - 23ff	Miscellaneous Technical
40	2400 - 243f	Control Pictures
41	2440 - 245f	Optical Character Recognition
42	2460 - 24ff	Enclosed Alphanumerics
43	2500 - 257f	Box Drawing
44	2580 - 259f	Block Elements
45	25a0 - 25ff	Geometric Shapes
46	2600 - 26ff	Miscellaneous Symbols
47	2700 - 27bf	Dingbats
48	3000 - 303f	Chinese, Japanese, and Korean (CJK) Symbols and Punctuation
49	3040 - 309f	Hiragana
50	30a0 - 30ff	Katakana
51	3100 - 312f 31a0 - 31bf	Bopomofo Extended Bopomofo
52	3130 - 318f	Hangul Compatibility Jamo
53	3190 - 319f	CJK Miscellaneous
54	3200 - 32ff	Enclosed CJK Letters and Months
55	3300 - 33ff	CJK Compatibility
56	ac00 - d7a3	Hangul
57	d800 - dfff	Surrogates. Note that setting this bit implies that there is at least one codepoint beyond the Basic Multilingual Plane that is supported by this font.
58		Reserved
59	4e00 - 9fff 2e80 - 2eff 2f00 - 2fdf 2ff0 - 2fff 3400 - 4dbf	CJK Unified Ideographs CJK Radicals Supplement Kangxi Radicals Ideographic Description CJK Unified Ideograph Extension A
60	e000 - f8ff	Private Use Area
61	f900 - faff	CJK Compatibility Ideographs
62	fb00 - fb4f	Alphabetic Presentation Forms
63	fb50 - fdff	Arabic Presentation Forms-A
64	fe20 - fe2f	Combining Half Marks
65	fe30 - fe4f	CJK Compatibility Forms
66	fe50 - fe6f	Small Form Variants
67	fe70 - fefe	Arabic Presentation Forms-B
68	ff00 - ffef	Halfwidth and Fullwidth Forms
69	fff0 - fffd	Specials
70	0f00 - 0fcf	Tibetan
71	0700 - 074f	Syriac
72	0780 - 07bf	Thaana
73	0d80 - 0dff	Sinhala
74	1000 - 109f	Myanmar
75	1200 - 12bf	Ethiopic
76	13a0 - 13ff	Cherokee
77	1400 - 14df	Canadian Aboriginal Syllabics
78	1680 - 169f	Ogham
79	16a0 - 16ff	Runic
80	1780 - 17ff	Khmer
81	1800 - 18af	Mongolian
82	2800 - 28ff	Braille
83	a000 - a48c	Yi Yi Radicals
84-122		Reserved
123		Windows 2000/XP: Layout progress: horizontal from right to left
124		Windows 2000/XP: Layout progress: vertical before horizontal
125		Windows 2000/XP: Layout progress: vertical bottom to top
126		Reserved; must be 0
127		Reserved; must be 1

下表是主要的codepage

ANSI Code-Page Identifiers

Identifier	Meaning
874	Thai
932	Japanese
936	Chinese (PRC, Singapore)
949	Korean
950	Chinese (Taiwan; Hong Kong SAR, PRC)
1200	Unicode (BMP of ISO 10646)
1250	Windows 3.1 Eastern European
1251	Windows 3.1 Cyrillic
1252	Windows 3.1 Latin 1 (US, Western Europe)
1253	Windows 3.1 Greek
1254	Windows 3.1 Turkish
1255	Hebrew
1256	Arabic
1257	Baltic

可以通过EnumSystemCodePages来枚举codepage

看起来很复杂，两个表似乎很难对应~~

自己根据一些资料经过实验总结了一部分

Unicode subrange	Description	Codepage
0x00-0x007F	Basic Latin	0(CP_ACP)
0x7F-0x00FF	Latin-1 Supplement	1252
0x0100-0x017F	Latin Extended-A	1250
0x0180-0x024F	Latin Extended-B	???

0x0370-0x03FF	Basic Greek	1253
0x0E00-0x0E7F	Thai	874
0x0590-0x05FF	Basic Hebrew	1255
0x0600-0x07FF	Basic Arabic	1256

也就只能如此了

LINUX下转码

在LINUX下转码的时候我找到了iconv族函数，用起来倒也简单

先打开iconv_t iconv_open(const char *tocode, const char *fromcode);

再转码size_t iconv(iconv_t cd,

                     char **inbuf, size_t *inbytesleft,

                     char **outbuf, size_t *outbytesleft);

最后关 int iconv_close(iconv_t cd);

尤其需要注意的是iconv_open的参数，两个code很容易让人出错。

这里的code和windows下的codepage很象

可以iconv –list这个命令来显示他所有的code

简单的可以用windows的codepage前加个CP，

例如 codepage是1250 code是“CP1250”

虽然简单但我还是为我的马虎（两个code下反了）付出了时间的代价

还有一个问题，我还没找到答案

一般在调用iconv的时候inbuf是char*的，他里面存放数据的顺序是先低位后高位

假如“尽”gb2312编码是0xBEA1在inbuf中应该存成

inbuf[0] = 0xBE;inbuf[1] = 0xA1;

但当编码为UTF16的时候，进iconv的inbuf的顺序是先高位后低位

例如“尽” UTF16编码是0x5C3D 在 inbuf中存成了

inbuf[0] = 0x3D;inbuf[1] = 0x5C;

搞不明白为什么，big endian small endian和CPU有关，这里又算怎么一回事~~只好把这部分的unicode挑出来了。

thisisll

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
unicode编码问题

windows下转码用MultiByteToWideChar和WideCharToMultiByte这两个函数在你得到UTF16的编码后，要将其转成内码需要参数codepage代码页，用他将UTF16编码和内码对应起来进行转码但是我们不知道用什么codepage(当你的UTF16中有多种语种的时候更麻烦)所以要看Unicode Subset Bitfields来确定他的codepage
复制链接

扫一扫