浅谈unicode 内码

前段时间做了个程序从hotmail获取好友列表,发现返回来的都是类似飓这种代码

我本来想返回一个汉字“飓”,居然返回来的是飓后来上网查了下,原来这个就是传说中的unicode汉字内码,并且还有一个巨大的内码和汉字的对应表

如果要将内码转换为汉字,难道要加载这个对应表做映射?,这个实在是麻烦,通过一些简单的分析,发现了两者之间的对应规律,其实很简单,汉字的utf-16编码的字节为4个,取低位的两个字节然后做一个双字节转无符号整数的计算就得到了这个内码

现在的情况的需要从内码转为汉字,此过程就是上面的过程的一个逆过程,了解了上面的冬冬,做个反转换就简单了

比如汉字“39123---16进制 98d3---〉字节数组[-104, -45]----utf-16编码成“飓“

同理用于其它特殊字符的unicode内码

阅读更多

unicode内码的问题总结

02-14

前阵子问了这个问题rn现在自己总结了一点,跟大家分享rnrnwindows下转码用MultiByteToWideChar和WideCharToMultiByte这两个函数rn在你得到UTF16的编码后,要将其转成内码需要参数codepage代码页,用他将UTF16编码和内码对应起来进行转码rn但是我们不知道用什么codepage(当你的UTF16中有多种语种的时候更麻烦)rn所以要看Unicode Subset Bitfields来确定他的codepage,下表是我从MSDN上找到的rnBit Unicode subrange Descriptionrn0 0020 - 007e Basic Latinrn1 00a0 - 00ff Latin-1 Supplementrn2 0100 - 017f Latin Extended-Arn3 0180 - 024f Latin Extended-Brn4 0250 - 02af IPA Extensionsrn5 02b0 - 02ff Spacing Modifier Lettersrn6 0300 - 036f Combining Diacritical Marksrn7 0370 - 03ff Basic Greekrn8 Reserved rn9 0400 - 04ff Cyrillicrn10 0530 - 058f Armenianrn11 0590 - 05ff Basic Hebrewrn12 Reserved rn13 0600 - 06ff Basic Arabicrn14 Reserved rn15 0900 - 097f Devanagarirn16 0980 - 09ff Bengalirn17 0a00 - 0a7f Gurmukhirn18 0a80 - 0aff Gujaratirn19 0b00 - 0b7f Oriyarn20 0b80 - 0bff Tamilrn21 0c00 - 0c7f Telugurn22 0c80 - 0cff Kannadarn23 0d00 - 0d7f Malayalamrn24 0e00 - 0e7f Thairn25 0e80 - 0eff Laorn26 10a0 - 10ff Basic Georgianrn27 Reserved rn28 1100 - 11ff Hangul Jamorn29 1e00 - 1eff Latin Extended Additionalrn30 1f00 - 1fff Greek Extendedrn31 2000 - 206f General Punctuationrn32 2070 - 209f Subscripts and Superscriptsrn33 20a0 - 20cf Currency Symbolsrn34 20d0 - 20ff Combining Diacritical Marks for Symbolsrn35 2100 - 214f Letter-like Symbolsrn36 2150 - 218f Number Formsrn37 2190 - 21ff Arrowsrn38 2200 - 22ff Mathematical Operatorsrn39 2300 - 23ff Miscellaneous Technicalrn40 2400 - 243f Control Picturesrn41 2440 - 245f Optical Character Recognitionrn42 2460 - 24ff Enclosed Alphanumericsrn43 2500 - 257f Box Drawingrn44 2580 - 259f Block Elementsrn45 25a0 - 25ff Geometric Shapesrn46 2600 - 26ff Miscellaneous Symbolsrn47 2700 - 27bf Dingbatsrn48 3000 - 303f Chinese, Japanese, and Korean (CJK) Symbols and Punctuationrn49 3040 - 309f Hiraganarn50 30a0 - 30ff Katakanarn51 3100 - 312f31a0 - 31bf Bopomofo Extended Bopomoforn52 3130 - 318f Hangul Compatibility Jamorn53 3190 - 319f CJK Miscellaneousrn54 3200 - 32ff Enclosed CJK Letters and Monthsrn55 3300 - 33ff CJK Compatibilityrn56 ac00 - d7a3 Hangulrn57 d800 - dfff Surrogates. Note that setting this bit implies that there is at least one codepoint beyond the Basic Multilingual Plane that is supported by this font. rn58 Reserved rn59 4e00 - 9fff2e80 - 2eff2f00 - 2fdf2ff0 - 2fff3400 - 4dbf CJK Unified IdeographsCJK Radicals SupplementKangxi RadicalsIdeographic DescriptionCJK Unified Ideograph Extension Arn60 e000 - f8ff Private Use Arearn61 f900 - faff CJK Compatibility Ideographsrn62 fb00 - fb4f Alphabetic Presentation Formsrn63 fb50 - fdff Arabic Presentation Forms-Arn64 fe20 - fe2f Combining Half Marksrn65 fe30 - fe4f CJK Compatibility Formsrn66 fe50 - fe6f Small Form Variantsrn67 fe70 - fefe Arabic Presentation Forms-Brn68 ff00 - ffef Halfwidth and Fullwidth Formsrn69 fff0 - fffd Specialsrn70 0f00 - 0fcf Tibetanrn71 0700 - 074f Syriacrn72 0780 - 07bf Thaanarn73 0d80 - 0dff Sinhalarn74 1000 - 109f Myanmarrn75 1200 - 12bf Ethiopicrn76 13a0 - 13ff Cherokeern77 1400 - 14df Canadian Aboriginal Syllabicsrn78 1680 - 169f Oghamrn79 16a0 - 16ff Runicrn80 1780 - 17ff Khmerrn81 1800 - 18af Mongolianrn82 2800 - 28ff Braillern83 a000 - a48c Yi Yi Radicalsrn84-122 Reserved rn123 Windows 2000/XP: Layout progress: horizontal from right to leftrn124 Windows 2000/XP: Layout progress: vertical before horizontalrn125 Windows 2000/XP: Layout progress: vertical bottom to toprn126 Reserved; must be 0rn127 Reserved; must be 1 rnrn下表是主要的codepagernANSI Code-Page IdentifiersrnIdentifier Meaningrn874 Thairn932 Japanesern936 Chinese (PRC, Singapore)rn949 Koreanrn950 Chinese (Taiwan; Hong Kong SAR, PRC) rn1200 Unicode (BMP of ISO 10646)rn1250 Windows 3.1 Eastern European rn1251 Windows 3.1 Cyrillicrn1252 Windows 3.1 Latin 1 (US, Western Europe)rn1253 Windows 3.1 Greekrn1254 Windows 3.1 Turkishrn1255 Hebrewrn1256 Arabicrn1257 Balticrn可以通过EnumSystemCodePages来枚举codepagernrn看起来很复杂,两个表似乎很难对应~~rn自己根据一些资料经过实验总结了一部分rnUnicode subrange Description Codepagern0x00-0x007F Basic Latin 0(CP_ACP)rn0x7F-0x00FF Latin-1 Supplement 1252rn0x0100-0x017F Latin Extended-A 1250rn0x0180-0x024F Latin Extended-B ???rn rn0x0370-0x03FF Basic Greek 1253rn0x0E00-0x0E7F Thai 874rn0x0590-0x05FF Basic Hebrew 1255rn0x0600-0x07FF Basic Arabic 1256rn也就只能如此了rnrnrnLINUX下转码rn在LINUX下转码的时候我找到了iconv族函数,用起来倒也简单rn先打开iconv_t iconv_open(const char *tocode, const char *fromcode);rn再转码size_t iconv(iconv_t cd,rn char **inbuf, size_t *inbytesleft,rn char **outbuf, size_t *outbytesleft);rn最后关 int iconv_close(iconv_t cd);rnrn尤其需要注意的是iconv_open的参数,两个code很容易让人出错。rn这里的code和windows下的codepage很象rn可以iconv –list这个命令来显示他所有的codern简单的可以用windows的codepage前加个CP,rn例如 codepage是1250 code是“CP1250”rn虽然简单但我还是为我的马虎(两个code下反了)付出了时间的代价rnrn还有一个问题,我还没找到答案rn一般在调用iconv的时候inbuf是char*的,他里面存放数据的顺序是先低位后高位rn假如“尽”gb2312编码是0xBEA1在inbuf中应该存成rninbuf[0] = 0xBE;inbuf[1] = 0xA1;rn但当编码为UTF16的时候,进iconv的inbuf的顺序是先高位后低位rn例如“尽” UTF16编码是0x5C3D 在 inbuf中存成了rn inbuf[0] = 0x3D;inbuf[1] = 0x5C;rn搞不明白为什么,big endian small endian和CPU有关,这里又算怎么一回事~~只好把这部分的unicode挑出来了。rn

没有更多推荐了,返回首页