unicode编码问题

windows下转码用MultiByteToWideCharWideCharToMultiByte这两个函数

在你得到UTF16的编码后,要将其转成内码需要参数codepage代码页,用他将UTF16编码和内码对应起来进行转码

但是我们不知道用什么codepage(当你的UTF16中有多种语种的时候更麻烦)

所以要看Unicode Subset Bitfields来确定他的codepage,下表是我从MSDN上找到的

Bit

Unicode
subrange

Description

0

0020 - 007e

Basic Latin

1

00a0 - 00ff

Latin-1 Supplement

2

0100 - 017f

Latin Extended-A

3

0180 - 024f

Latin Extended-B

4

0250 - 02af

IPA Extensions

5

02b0 - 02ff

Spacing Modifier Letters

6

0300 - 036f

Combining Diacritical Marks

7

0370 - 03ff

Basic Greek

8

 

Reserved

9

0400 - 04ff

Cyrillic

10

0530 - 058f

Armenian

11

0590 - 05ff

Basic Hebrew

12

 

Reserved

13

0600 - 06ff

Basic Arabic

14

 

Reserved

15

0900 - 097f

Devanagari

16

0980 - 09ff

Bengali

17

0a00 - 0a7f

Gurmukhi

18

0a80 - 0aff

Gujarati

19

0b00 - 0b7f

Oriya

20

0b80 - 0bff

Tamil

21

0c00 - 0c7f

Telugu

22

0c80 - 0cff

Kannada

23

0d00 - 0d7f

Malayalam

24

0e00 - 0e7f

Thai

25

0e80 - 0eff

Lao

26

10a0 - 10ff

Basic Georgian

27

 

Reserved

28

1100 - 11ff

Hangul Jamo

29

1e00 - 1eff

Latin Extended Additional

30

1f00 - 1fff

Greek Extended

31

2000 - 206f

General Punctuation

32

2070 - 209f

Subscripts and Superscripts

33

20a0 - 20cf

Currency Symbols

34

20d0 - 20ff

Combining Diacritical Marks for Symbols

35

2100 - 214f

Letter-like Symbols

36

2150 - 218f

Number Forms

37

2190 - 21ff

Arrows

38

2200 - 22ff

Mathematical Operators

39

2300 - 23ff

Miscellaneous Technical

40

2400 - 243f

Control Pictures

41

2440 - 245f

Optical Character Recognition

42

2460 - 24ff

Enclosed Alphanumerics

43

2500 - 257f

Box Drawing

44

2580 - 259f

Block Elements

45

25a0 - 25ff

Geometric Shapes

46

2600 - 26ff

Miscellaneous Symbols

47

2700 - 27bf

Dingbats

48

3000 - 303f

Chinese, Japanese, and Korean (CJK) Symbols and Punctuation

49

3040 - 309f

Hiragana

50

30a0 - 30ff

Katakana

51

3100 - 312f
31a0 - 31bf

Bopomofo
Extended Bopomofo

52

3130 - 318f

Hangul Compatibility Jamo

53

3190 - 319f

CJK Miscellaneous

54

3200 - 32ff

Enclosed CJK Letters and Months

55

3300 - 33ff

CJK Compatibility

56

ac00 - d7a3

Hangul

57

d800 - dfff

Surrogates. Note that setting this bit implies that there is at least one codepoint beyond the Basic Multilingual Plane that is supported by this font.

58

 

Reserved

59

4e00 - 9fff
2e80 - 2eff
2f00 - 2fdf
2ff0 - 2fff
3400 - 4dbf

CJK Unified Ideographs
CJK Radicals Supplement
Kangxi Radicals
Ideographic Description
CJK Unified Ideograph Extension A

60

e000 - f8ff

Private Use Area

61

f900 - faff

CJK Compatibility Ideographs

62

fb00 - fb4f

Alphabetic Presentation Forms

63

fb50 - fdff

Arabic Presentation Forms-A

64

fe20 - fe2f

Combining Half Marks

65

fe30 - fe4f

CJK Compatibility Forms

66

fe50 - fe6f

Small Form Variants

67

fe70 - fefe

Arabic Presentation Forms-B

68

ff00 - ffef

Halfwidth and Fullwidth Forms

69

fff0 - fffd

Specials

70

0f00 - 0fcf

Tibetan

71

0700 - 074f

Syriac

72

0780 - 07bf

Thaana

73

0d80 - 0dff

Sinhala

74

1000 - 109f

Myanmar

75

1200 - 12bf

Ethiopic

76

13a0 - 13ff

Cherokee

77

1400 - 14df

Canadian Aboriginal Syllabics

78

1680 - 169f

Ogham

79

16a0 - 16ff

Runic

80

1780 - 17ff

Khmer

81

1800 - 18af

Mongolian

82

2800 - 28ff

Braille

83

a000 - a48c

Yi
Yi Radicals

84-122

 

Reserved

123

 

Windows 2000/XP: Layout progress: horizontal from right to left

124

 

Windows 2000/XP: Layout progress: vertical before horizontal

125

 

Windows 2000/XP: Layout progress: vertical bottom to top

126

 

Reserved; must be 0

127

 

Reserved; must be 1

 

下表是主要的codepage

ANSI Code-Page Identifiers

Identifier

Meaning

874

Thai

932

Japanese

936

Chinese (PRC, Singapore)

949

Korean

950

Chinese (Taiwan; Hong Kong SAR, PRC)

1200

Unicode (BMP of ISO 10646)

1250

Windows 3.1 Eastern European

1251

Windows 3.1 Cyrillic

1252

Windows 3.1 Latin 1 (US, Western Europe)

1253

Windows 3.1 Greek

1254

Windows 3.1 Turkish

1255

Hebrew

1256

Arabic

1257

Baltic

可以通过EnumSystemCodePages来枚举codepage

 

看起来很复杂,两个表似乎很难对应~~

自己根据一些资料经过实验总结了一部分

Unicode subrange

Description

Codepage

0x00-0x007F

Basic Latin

0(CP_ACP)

0x7F-0x00FF

Latin-1 Supplement

1252

0x0100-0x017F

Latin Extended-A

1250

0x0180-0x024F

Latin Extended-B

???

 

 

 

0x0370-0x03FF 

Basic Greek

1253

0x0E00-0x0E7F

Thai

874

0x0590-0x05FF

Basic Hebrew

1255

0x0600-0x07FF

Basic Arabic

1256

也就只能如此了

 

 

LINUX下转码

LINUX下转码的时候我找到了iconv族函数,用起来倒也简单

先打开iconv_t iconv_open(const char *tocode, const char *fromcode);
再转码size_t iconv(iconv_t cd,
                     char **inbuf, size_t *inbytesleft,
                     char **outbuf, size_t *outbytesleft);
最后关 int iconv_close(iconv_t cd);

 

尤其需要注意的是iconv_open的参数,两个code很容易让人出错。

这里的codewindows下的codepage很象

可以iconv –list这个命令来显示他所有的code

简单的可以用windowscodepage前加个CP

例如 codepage1250  code“CP1250”

虽然简单但我还是为我的马虎(两个code下反了)付出了时间的代价

 

还有一个问题,我还没找到答案

一般在调用iconv的时候inbufchar*的,他里面存放数据的顺序是先低位后高位

假如“尽”gb2312编码是0xBEA1inbuf中应该存成

inbuf[0] = 0xBE;inbuf[1] = 0xA1;

但当编码为UTF16的时候,进iconvinbuf的顺序是先高位后低位

例如“尽” UTF16编码是0x5C3D inbuf中存成了

      inbuf[0] = 0x3D;inbuf[1] = 0x5C;

搞不明白为什么,big endian small endianCPU有关,这里又算怎么一回事~~只好把这部分的unicode挑出来了。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值