Unicode UTF-8 Ansi 互转及MultiByteToWideChar和WideCharToMultiByte用法等编码相关

最新推荐文章于 2023-03-02 15:18:48 发布

pirate97

最新推荐文章于 2023-03-02 15:18:48 发布

阅读量7.4k

点赞数

分类专栏： jockey--c++ 文章标签： null buffer encoding character file

jockey--c++ 专栏收录该内容

54 篇文章 0 订阅

订阅专栏

Unicode UTF-8 Ansi 互转及MultiByteToWideChar和WideCharToMultiByte用法等编码相关

分类： MFC/SDK/C++ 2010-05-18 20:53 2818人阅读评论(1) 收藏举报

目录(?)[+]

Unicode，到UTF-8。

[cpp]view plaincopy 
    
 qp::StringW Global::AnsiToUnicode(const char* buf)  
 {  
     int len = ::MultiByteToWideChar(CP_ACP, 0, buf, -1, NULL, 0);  
     if (len == 0) return L"";  
   
     std::vector<wchar_t> unicode(len);  
     ::MultiByteToWideChar(CP_ACP, 0, buf, -1, &unicode[0], len);  
   
     return &unicode[0];  
 }  
   
 qp::StringA Global::UnicodeToAnsi(const wchar_t* buf)  
 {  
     int len = ::WideCharToMultiByte(CP_ACP, 0, buf, -1, NULL, 0, NULL, NULL);  
     if (len == 0) return "";  
   
     std::vector<char> utf8(len);  
     ::WideCharToMultiByte(CP_ACP, 0, buf, -1, &utf8[0], len, NULL, NULL);  
   
     return &utf8[0];  
 }  
   
 qp::StringW Global::Utf8ToUnicode(const char* buf)  
 {  
     int len = ::MultiByteToWideChar(CP_UTF8, 0, buf, -1, NULL, 0);  
     if (len == 0) return L"";  
   
     std::vector<wchar_t> unicode(len);  
     ::MultiByteToWideChar(CP_UTF8, 0, buf, -1, &unicode[0], len);  
   
     return &unicode[0];  
 }  
   
 qp::StringA Global::UnicodeToUtf8(const wchar_t* buf)  
 {  
     int len = ::WideCharToMultiByte(CP_UTF8, 0, buf, -1, NULL, 0, NULL, NULL);  
     if (len == 0) return "";  
   
     std::vector<char> utf8(len);  
     ::WideCharToMultiByte(CP_UTF8, 0, buf, -1, &utf8[0], len, NULL, NULL);  
   
     return &utf8[0];  
 }  

MultiByteToWideChar和WideCharToMultiByte用法详解
//========================================================================
//TITLE:
//    MultiByteToWideChar和WideCharToMultiByte用法详解
//AUTHOR:
//    norains
//DATE:
//    第一版:Monday 25-December -2006
//    增补版:Wednesday 27-December -2006
//    修订版:Wednesday 14-March-2007 (修正之前的错误例子)
//Environment:
// EVC4.0 + Standard SDK
//========================================================================

1.使用方法详解

在本文开始之处,先简要地说一下何为短字符和宽字符.
所谓的短字符,就是用8bit来表示的字符,典型的应用是ASCII码.而宽字符,顾名思义,就是用16bit表示的字符,典型的有UNICODE.

关于windows下的ASCII和UNICODE的更多信息,可以参考这两本经典著作:《windows 程序设计》,《windows 核心编程》.这两本书关于这两种字符都有比较详细的解说.

宽字符转换为多个短字符是一个难点,不过我们只要掌握到其中的要领,便可如鱼得水.
好吧,那就让我们开始吧.

这个是我们需要转化的多字节字符串:
char sText[20] = {"多字节字符串!OK!"};

我们需要知道转化后的宽字符需要多少个数组空间.虽然在这个里程里面,我们可以直接定义一个20*2宽字符的数组,并且事实上将运行得

    
    
     
     非常轻松愉快.但假如多字节字符串更多,达到上千个乃至上万个,我们将会发现其中浪费的内存将会越来越多.所以以多字节字符的个数的两

    
    
     
     倍作为宽字符数组下标的声明绝对不是一个好主意.
  所幸,我们能够确知所需要的数组空间.
  我们只需要将MultiByteToWideChar()的第四个形参设为-1,即可返回所需的短字符数组空间的个数:
  DWORD dwNum = MultiByteToWideChar (CP_ACP, 0, sText, -1, NULL, 0);
 
  接下来,我们只需要分配响应的数组空间:
  wchar_t *pwText;
  pwText = new wchar_t[dwNum];
  if(!pwText)
  {
   delete []pwText;
  }
 
  接着,我们就可以着手进行转换了.在这里以转换成ASCII码做为例子:
  MultiByteToWideChar (CP_ACP, 0, psText, -1, sText, dwSize);
 
  最后,使用完毕当然要记得释放占用的内存:
  delete []psText;
 
 
  同理,宽字符转为多字节字符的代码如下:  
  wchar_t wText[20] = {L"宽字符转换实例!OK!"};
  DWORD dwNum = WideCharToMultiByte(CP_OEMCP,NULL,lpcwszStr,-1,NULL,0,NULL,FALSE);
  char *psText;
  psText = new char[dwNum];
  if(!psText)
  {
   delete []psText;
  }
  WideCharToMultiByte (CP_OEMCP,NULL,lpcwszStr,-1,psText,dwNum,NULL,FALSE);
  delete []psText;
 
   如果之前我们已经分配好空间,并且由于字符串较短,可以不理会浪费的空间,仅仅只是想简单地将短字符和宽字符相互转换,那有没有

    
    
     
     什么简便的方法呢?
   WIN32 API里没有符合这种要求的函数,但我们可以自己进行封装:
     
  //-------------------------------------------------------------------------------------
  //Description:
  // This function maps a character string to a wide-character (Unicode) string
  //
  //Parameters:
  // lpcszStr: [in] Pointer to the character string to be converted
  // lpwszStr: [out] Pointer to a buffer that receives the translated string.
  // dwSize: [in] Size of the buffer
  //
  //Return Values:
  // TRUE: Succeed
  // FALSE: Failed
  //
  //Example:
  // MByteToWChar(szA,szW,sizeof(szW)/sizeof(szW[0]));
  //---------------------------------------------------------------------------------------
  BOOL MByteToWChar(LPCSTR lpcszStr, LPWSTR lpwszStr, DWORD dwSize)
  {
    // Get the required size of the buffer that receives the Unicode
    // string.
    DWORD dwMinSize;
    dwMinSize = MultiByteToWideChar (CP_ACP, 0, lpcszStr, -1, NULL, 0);
 
    if(dwSize < dwMinSize)
    {
     return FALSE;
    }
 
    
    // Convert headers from ASCII to Unicode.
    MultiByteToWideChar (CP_ACP, 0, lpcszStr, -1, lpwszStr, dwMinSize);  
    return TRUE;
  }
 
  //-------------------------------------------------------------------------------------
  //Description:
  // This function maps a wide-character string to a new character string
  //
  //Parameters:
  // lpcwszStr: [in] Pointer to the character string to be converted
  // lpszStr: [out] Pointer to a buffer that receives the translated string.
  // dwSize: [in] Size of the buffer
  //
  //Return Values:
  // TRUE: Succeed
  // FALSE: Failed
  //
  //Example:
  // MByteToWChar(szW,szA,sizeof(szA)/sizeof(szA[0]));
  //---------------------------------------------------------------------------------------
  BOOL WCharToMByte(LPCWSTR lpcwszStr, LPSTR lpszStr, DWORD dwSize)
  {
   DWORD dwMinSize;
   dwMinSize = WideCharToMultiByte(CP_OEMCP,NULL,lpcwszStr,-1,NULL,0,NULL,FALSE);
   if(dwSize < dwMinSize)
   {
    return FALSE;
   }
   WideCharToMultiByte(CP_OEMCP,NULL,lpcwszStr,-1,lpszStr,dwSize,NULL,FALSE);
   return TRUE;
  }
 
 
  使用方法也很简单,示例如下:
  wchar_t wText[10] = {L"函数示例"};
  char sText[20]= {0};
  WCharToMByte(wText,sText,sizeof(sText)/sizeof(sText[0]));
  MByteToWChar(sText,wText,sizeof(wText)/sizeof(wText[0]));
 
  这两个函数的缺点在于无法动态分配内存,在转换很长的字符串时可能会浪费较多内存空间;优点是,在不考虑浪费空间的情况下转换较

    
    
     
     短字符串非常方便.

 
2.MultiByteToWideChar()函数乱码的问题

  有的朋友可能已经发现,在标准的WinCE4.2或WinCE5.0 SDK模拟器下,这个函数都无法正常工作,其转换之后的字符全是乱码.及时

    
    
     
     更改MultiByteToWideChar()参数也依然如此.
  不过这个不是代码问题,其结症在于所定制的操作系统.如果我们定制的操作系统默认语言不是中文,也会出现这种情况.由于标准的SDK

    
    
     
     默认语言为英文,所以肯定会出现这个问题.而这个问题的解决,不能在简单地更改控制面板的"区域选项"的"默认语言",而是要在系统定制

    
    
     
     的时候,选择默认语言为"中文".
  系统定制时选择默认语言的位置于:
  Platform -> Setting... -> locale -> default language ,选择"中文",然后编译即可.

    
    
     
     
      
      
       
       [cpp] 
       
       view plain
       
       copy
       
       
      
      
     
     
     
     CString my_strEditA=_T(""),my_strEditB=_T(""),my_strEditC=_T("");   
   my_strlength = my_Base64Msg.GetLength();  
   char   *pstra = new char[my_strlength];  
   for(i=0;i<my_strlength;i++)   
   {   
    *(pstra+i)=(char)my_Base64Msg[i];   
   }  
   my_testA = pstra;   
  
  //BASE64解码函数，见另一篇博文   
   my_testB = decode(my_testA);  
   ///  
   ///  
  
  //获取转换字符串长度  
   DWORD dwNum = MultiByteToWideChar (CP_ACP, 0, my_testB, -1, NULL, 0);  
   wchar_t *pwText;  
   pwText = new wchar_t[dwNum];  
    if(!pwText)  
   {  
    delete []pwText;  
   }  
  
  //转换成UNICODE码  
  
  //CP_ACP=936,简体中文  
   MultiByteToWideChar (CP_ACP, 0 , my_testB, -1, pwText, dwNum);  
   m_EditMail += pwText;  
   delete []pwText;

  
  
   
   
    
    
     
     [cpp] 
     
     view plain
     
     copy
     
     
    
    
   
   
   
   **************************************************************    
*         功         能: 将unicode字符转换成gb2312字串    
*         参         数: unistr 源字串    
                         gbstr 目标字串    
                         msg_len 源串长度    
*         返   回   值: 无    
**************************************************************    
void str_unic_decode( unsigned short *unistr, unsigned char *gbstr, int msg_len)    
{    
    int   i;    
    int   index;    
    unsigned   short   ch;    
    unsigned   char   str[2];    
      
    msg_len   =   msg_len==-1?  str_unic_len(unistr)   :   msg_len;    
    for(i=0,index=0;   i<msg_len;   i++)    
    {    
        ch   =   UNICODE_TO_GB2312[unistr[i]]; //查表法   
        str[0]   =   ch   &   0xff;    
        str[1]   =   ch>>8   &   0xff;    
        if(str[1]   >   0xa0)    
        {    
            gbstr[index++]   =   str[0];    
            gbstr[index++]   =   str[1];    
        }    
        else    
        {    
            gbstr[index++]   =   str[0];    
        }    
    }    
}

  
  
   
   UTF-8
　　现在明白了Unicode，那么UTF-8又是什么呢？又为什么会出现UTF-8呢？
　　ASCII转换成UCS-2，只是在编码前插入一个0x0。用这些编码，会包括一些控制符，比如 '' 或 '/'，这在UNIX和一些C函数中，将
  
  
  
  
   
   会产生
  
  
  
  
   
   严重错误。因此可以肯定，UCS-2不适合作为Unicode的外部编码。
　　因此，才诞生了UTF-8。那么UTF-8是如何编码的？又是如何解决UCS-2的问题呢？
例：
E4 BD A0　　　　　　　　11100100 10111101 10100000
这是“你”字的UTF-8编码
4F 60　　　　　　　　　　01001111 01100000
这是“你”的Unicode编码
按照UTF-8的编码规则，分解如下：xxxx0100 xx111101 xx100000
把除了x之外的数字拼接在一起，就变成“你”的Unicode编码了。
注意UTF-8的最前面３个1，表示整个UTF-8串是由３个字节构成的。
经过UTF-8编码之后，再也不会出现敏感字符了，因为最高位始终为1。

以下是Unicode和UTF-8之间的转换关系表：
U-00000000 - U-0000007F: 0xxxxxxx       //没有1表示只有1个字节
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx  //前面2个1表示由2个字节
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx //前面3个1表示由3个字节
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx //依次类推
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
Unicode编码转换到UTF-8,简单的把Unicode字节流套到x中就变成UTF-8了。


所以,可以看到unicode编码和utf-8编码有线性转换关系,而unicode编码和gb2312编码不存在线性转换关系,所以我们必须使用对照表
  
  
  
  
   
   来进行unicode和gb2312编码的互换,就像阳历和农历转换算法一样,不能作线性计算
  
  
  
  
   
    
  
  
  
  
   
    
  
  
  
  
   
   由于各种编码之间不存在互相变换的算法，只能通过查表解决转换问题。自编代码进行转换在嵌入式系统中最有实际意义，该方法具有最
  
  
  
  
   
   方便的移植特性和最小的代码量。需要解决的主要技术问题有：

·获取所需的编码转换表

·实现码表的快速搜索算法（UTF-8转GB码才需要，其实就是折半查找）

·待转换字符串中的中/西文字符判别

由于折半查找要求码表是事先排序的，正变换和反变换需要各有一张转换表。转换表可以从开源软件中获取也可以自己编段程序生成一份。

由于非unicode编码字串中的西文字母只有一字节/字符，而汉字是2字节/字符，需要在转换时区别对待。判断方法在本文的前面部分有介绍。

由GB2312码转unicode时，由于转换表是按区位表排列的，可以直接由汉字的GB码通过计算得到转换表中的行列值，计算公式为：

Row = MSB - 0xA0 - 16

Col = LSB – 0xA0

由于转换表是从汉字区开始的，即第一个汉字是“啊”，开始行不是0，而是16，所以要从行值中减去一个偏移量。得到行列值后，可以直
  
  
  
  
   
   接取回表中的unicode：

Unicode = CODE_LUT[Row][Col];

今天对网上找到的转换表不太满意，于是自己编程序生成了一个新的。转换程序不大，
  
  
  
  
   
   
    
    
     
     
      
      [cpp] 
      
      view plain
      
      copy
      
      
     
     
    
    
    
    全部源码如下：  
  
// UnicodeCvt.cpp  
// by mxh0506, 20081102  
  
#include "stdafx.h"  
#include <string.h>  
#include <windows.h>  
  
  
  
int _tmain(int argc, _TCHAR* argv[])  
{  
     wchar_t wstr[8];  
     char szBuff[8];  
     FILE *fp;  
     unsigned char rowCode,colCode;  
     char szStr[64];  
     char szErr[16];  
     strcpy(szErr,"/*XX*/0x0000,");  
     fp = fopen("GB2Uni_LUT.h", "w+, ccs=UNICODE");  
     if( fp ){  
         strcpy( szStr, "unsigned short Unicode[72][96]={/n");   
         fwrite(szStr,1,strlen(szStr),fp);          
         szBuff[2] = 0;  
         for( unsigned char row = 0; row < 72; row++){               
             for( unsigned char col = 0; col < 96; col++){                   
                 rowCode = (row + 16) + 0xA0;                  
                 colCode = col + 0xA0;                  
                 szBuff[0] = rowCode;                  
                 szBuff[1] = colCode;                  
                 if( MultiByteToWideChar(CP_THREAD_ACP,MB_ERR_INVALID_CHARS,szBuff,2,wstr,8)){                       
                     sprintf(szStr,"/*%s%X*/0x%X,",szBuff,*((unsigned short*)szBuff),wstr[0]);                      
                     fwrite(szStr,1,strlen(szStr),fp);                       
                 }else{                      
                     fwrite( szErr, 1, 13, fp );                      
                 }                  
             }              
             fwrite( "/n", 1, 1, fp );               
         }           
         strcpy( szStr, "};/n");           
         fwrite( szStr, 1, strlen( szStr ), fp );           
         fclose(fp);           
     }  
       
     return 0;  
       
}  
   
    
  
  
  
  
   
   来测试发现这段程序成生的码表中丢掉了几个花括号，不过大体功能还是正确的。如果有人感兴趣，可以加上。另外，还可以试试码表从
  
  
  
  
   
   0行(而不是16行)开始会怎样。
  
  
  
  
   
    
  
  
  
  
   
    
  
  
  
  
   
   
    
    
     
     GB2312与unicode间的转换 

GB2312与unicode互转的两个函数，有点简陋，待转换的字符串长度要在256以内。


    
    
    
    
     
      
    
    
    
    
     
      static int
_convertCharSetFromGBKToUnicode(char *from, char *to)
{
        iconv_t h;
        char tmp_from[256] = { '/0' };
        char tmp_to[256] = { '/0' };
        char *p_from = tmp_from;
        char *p_to = tmp_to;
        int size_from, size_to;
        strncpy(p_from, from, sizeof(tmp_from)-1);
        size_from = strlen(p_from);
        size_to = sizeof(tmp_to);
        if ((h = iconv_open("UTF-8", "GBK")) < 0)
                return -1;
        iconv(h, &p_from, &size_from, &p_to, &size_to);
        iconv_close(h);
        printf("GBK Code : %s, UNICODE Code : %s %d/n", tmp_from, tmp_to, size_to);
        strncpy(to, tmp_to, size_to);
        return 0;
}

static int
_convertCharSetFromUnicodeToGBK(char *from, char *to)
{
        iconv_t h;
        char tmp_from[256] = { '/0' };
        char tmp_to[256] = { '/0' };
        char *p_from = tmp_from;
        char *p_to = tmp_to;
        int size_from, size_to;
        strncpy(p_from, from, sizeof(tmp_from)-1);
        size_from = strlen(p_from);
        size_to = sizeof(tmp_to);
        if ((h = iconv_open("GBK", "UTF-8")) < 0)
                return -1;
        iconv(h, &p_from, &size_from, &p_to, &size_to);
        iconv_close(h);
        printf("UNICODE Code : %s, GBK Code : %s %d/n", tmp_from, tmp_to, size_to);
        strncpy(to, tmp_to, size_to);
        return 0;
} 
    
    
    
    
     
      
    
    
    
    
     
      
    
    
    
    
     
      
    
    
    
    
     
     
 
    
    
    
    
     
     
      
       
     
     
     
     
      
       
     
     
     
     
      
       
     
     
     
     
      
       
     
     
     
     
      
       
     
     
     
     
      
       
     
     
     
     
      
       
     
     
     
     
      
       
     
     
     
     
      
       
     
     
     
     
      
       
     
     
     
     
      
       
     
     
     
     
      
       
     
     
     
     
      
       
     
     
     
     
      
      Linux下转换字符集(UTF8转换) 借鉴此文自己已成功在LInux上实现gtk中 从UTF8到GB的转换
在LINUX上进行编码转换时,既可以利用iconv函数族编程实现,也可以利用iconv命令来实现,只不过后者是针对文件的,即将指定文件从
     
     
     
     
      
      一种编码转换为另一种编码。
一、利用iconv函数族进行编码转换
iconv函数族的头文件是iconv.h,使用前需包含之。
#include <iconv.h>
iconv函数族有三个函数,原型如下:
(1) iconv_t iconv_open(const char *tocode, const char *fromcode);
此函数说明将要进行哪两种编码的转换,tocode是目标编码,fromcode是原编码,该函数返回一个转换句柄,供以下两个函数使用。
(2) size_t iconv(iconv_t cd,char **inbuf,size_t *inbytesleft,char **outbuf,size_t *outbytesleft);
此函数从inbuf中读取字符,转换后输出到outbuf中,inbytesleft用以记录还未转换的字符数,outbytesleft用以记录输出缓冲的剩余空间。
(3) int iconv_close(iconv_t cd);
此函数用于关闭转换句柄,释放资源。
例子1: 用C语言实现的转换示例程序

/* f.c : 代码转换示例C程序 */具体讲，自己的验证实现是根据文中的f.c实现的
#include <iconv.h>
#define OUTLEN 255
main()
{
char *in_utf8 = "姝ｅ?ㄥ??瑁?";<=======此字符串似乎并不是“正在安装”四个字的UTF8的对照串，
char *in_gb2312 = "正在安装";
char out[OUTLEN];

//unicode码转为gb2312码
rc = u2g(in_utf8,strlen(in_utf8),out,OUTLEN);
printf("unicode-->gb2312 out=%sn",out);
//gb2312码转为unicode码
rc = g2u(in_gb2312,strlen(in_gb2312),out,OUTLEN);
printf("gb2312-->unicode out=%sn",out);
}
//代码转换:从一种编码转为另一种编码
int code_convert(char *from_charset,char *to_charset,char *inbuf,int inlen,char *outbuf,int outlen)
{
iconv_t cd;
int rc;
char **pin = &inbuf;
char **pout = &outbuf;

cd = iconv_open(to_charset,from_charset);
if (cd==0) return -1;
memset(outbuf,0,outlen);
if (iconv(cd,pin,&inlen,pout,&outlen)==-1) return -1;
iconv_close(cd);
return 0;
}
//UNICODE码转为GB2312码
int u2g(char *inbuf,int inlen,char *outbuf,int outlen)
{
return code_convert("utf-8","gb2312",inbuf,inlen,outbuf,outlen);
}
//GB2312码转为UNICODE码
int g2u(char *inbuf,size_t inlen,char *outbuf,size_t outlen)
{
return code_convert("gb2312","utf-8",inbuf,inlen,outbuf,outlen);
}

例子2: 用C++语言实现的转换示例程序

/* f.cpp : 代码转换示例C++程序 */
#include <iconv.h>
#include <iostream>

#define OUTLEN 255

using namespace std;

// 代码转换操作类
class CodeConverter {
private:
iconv_t cd;
public:
// 构造
CodeConverter(const char *from_charset,const char *to_charset) {
cd = iconv_open(to_charset,from_charset);
}

// 析构
~CodeConverter() {
iconv_close(cd);
}

// 转换输出
int convert(char *inbuf,int inlen,char *outbuf,int outlen) {
char **pin = &inbuf;
char **pout = &outbuf;

memset(outbuf,0,outlen);
return iconv(cd,pin,(size_t *)&inlen,pout,(size_t *)&outlen);
}
};

int main(int argc, char **argv)
{
char *in_utf8 = "姝ｅ?ㄥ??瑁?";
char *in_gb2312 = "正在安装";
char out[OUTLEN];

// utf-8-->gb2312
CodeConverter cc = CodeConverter("utf-8","gb2312");
cc.convert(in_utf8,strlen(in_utf8),out,OUTLEN);
cout << "utf-8-->gb2312 in=" << in_utf8 << ",out=" << out << endl;

// gb2312-->utf-8
CodeConverter cc2 = CodeConverter("gb2312","utf-8");
cc2.convert(in_gb2312,strlen(in_gb2312),out,OUTLEN);
cout << "gb2312-->utf-8 in=" << in_gb2312 << ",out=" << out << endl;
}


二、利用iconv命令进行编码转换

iconv命令用于转换指定文件的编码,默认输出到标准输出设备,亦可指定输出文件。

用法： iconv [选项...] [文件...]

有如下选项可用:

输入/输出格式规范：
-f, --from-code=名称 原始文本编码
-t, --to-code=名称 输出编码

信息：
-l, --list 列举所有已知的字符集

输出控制：
-c 从输出中忽略无效的字符
-o, --output=FILE 输出文件
-s, --silent 关闭警告
--verbose 打印进度信息

-?, --help 给出该系统求助列表
--usage 给出简要的用法信息
-V, --version 打印程序版本号

例子:
iconv -f utf-8 -t gb2312 aaa.txt >bbb.txt
这个命令读取aaa.txt文件，从utf-8编码转换为gb2312编码,其输出定向到bbb.txt文件。
     
     
     
     
      
      
       
       使用iconv进行内码转换（Big5-GB2312）

使用iconv进行内码转换（Big5-GB2312）

http://www.freebsd.org/ports/converters.html

概述
iconv是一个通过unicode作为中间码实现各种内码间相互转换的库，它基本上囊括了世界上所有编码方式，例如，ASCII、GB2312、
      
      
      
      
       
        GBK、 GB18030、BIG5、UTF-8、UCS-2、UCS-2BE、UCS-2LE、UCS-4、UCS-4BE、UCS-4LE、UTF-16、 UTF-16BE、
      
      
      
      
       
       UTF-16LE、UTF-32、UTF-32BE、UTF-32LE、UTF-7等等等，除此之外，还包括泰语、日语、韩语、西欧等国家语言的编码。
      
      
      
      
       
       下面我们演示如何使用iconv实现Big5到GB2312的转换，当然只要简单修改一下便可实现iconv支持任何编码间的转换。

下载
libiconv是linux版本的iconv，可在 http://www.gnu.org/software/libiconv/ 下载
iconv的win32版本可以在 http://gnuwin32.sourceforge.net/packages/libiconv.htm 下载

SVN源码
另外，还有一些演示代码，需要的可以到我的SVN下载
http://xcyber.googlecode.com/svn/trunk/Convert/

演示代码

   1. /****************************************************************************
   2. *   Big5ToGB2312 - Convert Big5 encoding file to GB2312 encoding file
   3. *   File:
   4. *     Big5ToGb2312.c
   5. *   Description:
   6. *     Convert Big5 encoding file to GB2312 encoding file using iconv library
   7. *   Author:
   8. *     XCyber    email:XCyber@sohu.com
   9. *   Date:
  10. *     August 7, 2008
  11. *   Other:
  12. *     visit http://www.gnu.org/software/libiconv/ for more help of iconv
  13. ***************************************************************************/
  14.
  15.
  16. #include <stdio.h>
  17. #include <stdlib.h>
  18. #include <tchar.h>
  19. #include <locale.h>
  20. #include "../iconv-1.9.2.win32/include/iconv.h"
  21.
  22. //#pragma comment(lib, "../iconv-1.9.2.win32/lib/iconv.lib")   // using iconv dynamic-link lib, iconv.dll
  23. #pragma comment(lib, "../iconv-1.9.2.win32/lib/iconv_a.lib")   // using iconv static lib
  24.
  25. #define BUFFER_SIZE 1024    //BUFFER_SIZE must >= 2
  26.
  27.
  28. void usage()
  29. {
  30.      printf("/nBig5ToGB2312 - Convert Big5 encoding file to GB2312 encoding file/n");
  31.      printf("XCyber@sohu.com on August 7, 2008/n");
  32.      printf("   Usage:/n");
  33.      printf("       Big5ToGB2312 [Big5 file(in)]   [GB2312 file(out)]/n/n");
  34. }
  35.
  36.
  37. int main(int argc, char* argv[])
  38. {
  39.     FILE * pSrcFile = NULL;
  40.     FILE * pDstFile = NULL;
  41.
  42.     char szSrcBuf[BUFFER_SIZE];
  43.     char szDstBuf[BUFFER_SIZE];
  44.
  45.     size_t nSrc   = 0;
  46.     size_t nDst   = 0;
  47.     size_t nRead = 0;
  48.     size_t nRet   = 0;
  49.
  50.     char *pSrcBuf = szSrcBuf;
  51.     char *pDstBuf = szDstBuf;
  52.
  53.      iconv_t icv;
  54.     int argument = 1;
  55.
  56.     //check input arguments
  57.     if(argc != 3)
  58.      {
  59.          usage();
  60.         return -1;
  61.      }
  62.
  63.
  64.      pSrcFile = fopen(argv[1],"r");
  65.     if(pSrcFile == NULL)
  66.      {
  67.          printf("can't open source file!/n");
  68.         return -1;
  69.      }
  70.
  71.      pDstFile = fopen(argv[2],"w");
  72.     if(pSrcFile == NULL)
  73.      {
  74.          printf("can't open destination file!/n");
  75.         return -1;
  76.      }
  77.
  78.     //initialize iconv routine, perform conversion from BIG5 to GB2312
  79.     //TODO: if you want to perfom other type of coversion, e.g. GB2312->BIG5, GB2312->UTF-8 ...
  80.     //just change following two paremeters of iconv_open()
  81.      icv = iconv_open("GB2312","BIG5");
  82.     if(icv == 0)
  83.      {
  84.          printf("can't initalize iconv routine!/n");
  85.         return -1;
  86.      }
  87.
  88.     //enable "illegal sequence discard and continue" feature, so that if met illeagal sequence,
  89.     //conversion will continue instead of being terminated
  90.     if(iconvctl (icv ,ICONV_SET_DISCARD_ILSEQ,&argument) != 0)
  91.      {
  92.          printf("can't enable /"illegal sequence discard and continue/" feature!/n");
  93.         return -1;
  94.      }
  95.
  96.     while(!feof(pSrcFile))
  97.      {
  98.          pSrcBuf = szSrcBuf;
  99.          pDstBuf = szDstBuf;
100.          nDst = BUFFER_SIZE;
101.
102.         // read data from source file
103.          nRead = fread(szSrcBuf + nSrc,sizeof(char),BUFFER_SIZE - nSrc,pSrcFile);
104.         if(nRead == 0)
105.             break;
106.
107.         // the amount of data to be converted should include previous left data and current read data
108.          nSrc = nSrc + nRead;
109.
110.         //perform conversion
111.          nRet = iconv(icv,(const char**)&pSrcBuf,&nSrc,&pDstBuf,&nDst);
112.
113.         if(nRet == -1)
114.          {
115.             // include all case of errno: E2BIG, EILSEQ, EINVAL
116.             //      E2BIG: There is not sufficient room at *outbuf.
117.             //      EILSEQ: An invalid multibyte sequence has been encountered in the input.
118.             //      EINVAL: An incomplete multibyte sequence has been encountered in the input
119.             // move the left data to the head of szSrcBuf in other to link it with the next data block
120.              memmove(szSrcBuf,pSrcBuf,nSrc);
121.          }
122.
123.         //wirte data to destination file
124.          fwrite(szDstBuf,sizeof(char),BUFFER_SIZE - nDst,pDstFile);
125.         
126.      }
127.      iconv_close(icv);
128.      fclose(pSrcFile);
129.      fclose(pDstFile);
130.
131.      printf("conversion complete./n");
132.
133.     return 0;
134. }
      
      
     
     
    
    
   
    
  
  
  
  
   
   B2312字符串转换为UTF-8的字符串，代码如下：
#include <stdio.h>
#include <stdlib.h>
#include <iconv.h>


int main(void)
{
    unsigned char *src = "魅影追击和歌姬"; /* 需转换的字串 */
    unsigned char dst[256] = {0}; /* 转换后的内容 */
    unsigned char buf[1024] = {0}; /* 格式化转换后的字串 */
    size_t src_len = strlen(src);
    size_t dst_len = sizeof(dst);
    unsigned char *in = src;
    unsigned char *out = dst;
    
    iconv_t cd;
    int i;  
    int j;  
    
    cd = iconv_open("UTF-8", "GB2312"); /* 将GB2312字符集转换为UTF-8字符集 */
    if ((iconv_t)-1 == cd)
    {
        return -1;
    }

    printf("src: %s/n", src);
    iconv(cd, &in, &src_len, &out, &dst_len); /* 执行转换 */

    /* 以下将转换后的内容格式化为: %XX%XX...形式的字串 */
    printf("dst: ");
    j = 0;  
    for (i = 0; i < strlen(dst); i++)
    {
        printf("%.2X ", dst[i]);
        buf[j++] = ''%'';
        snprintf(buf + j, 3, "%.2X", dst[i]);
        j += 2; 
    }
    printf("/n");
    printf("buf: %s/n", buf);
    
    iconv_close(cd); /* 执行清理 */
    return 0;
}

  
  
  
   
  
  
  
  
   
   vs2008 编译iconv
  
  
  
  
   
    iconv是常用的一个字符集转换的开源库，主页在http://www.gnu.org/software/libiconv/

1.11.1版本是最后一个支持MSVC编译的版本，1.12及之后的版本只支持MingW和Cygwin编译，下面是我用vs2008编译iconv的过程



1.下载1.11版本的libiconv

2.在srclib/progname.h文件中添加一行：

   #define EXEEXT ".exe"

3.将srclib/stdint_.h更名为srclib/stdint.h，并将'@'符号全部移除

4.对srclib/Makefile.msvc进行以下改动:

    1) 在OBJS=的定义中添加 width.obj

    2) 添加如下定义:

        width.obj : width.c

        $(CC) $(INCLUDES) $(CFLAGS) -c width.c

5.调用以下命令编译DLL或LIB

   nmake -f Makefile.msvc NO_NLS=1 DLL=1 MFLAGS=-MD PREFIX="c:/lib_x86" IIPREFIX="c:/lib_x86"

   nmake -f Makefile.msvc NO_NLS=1 DLL=1 MFLAGS=-MD install PREFIX="c:/lib_x86" IIPREFIX="c:/lib_x86" 

   或

   nmake -f Makefile.msvc NO_NLS=1 MFLAGS=-MD PREFIX="c:/slib_x86" IIPREFIX="c:/slib_x86"

   nmake -f Makefile.msvc NO_NLS=1 MFLAGS=-MD install PREFIX="c:/slib_x86" IIPREFIX="c:/slib_x86"

   PREFIX 和 IIPREFIX中的路径，必须用绝对路径

6.编译完后，程序在./lib_x86目录下


  
  
  
  

  
  
  
  

  
  
  
  
   
   ANSI，UTF-8，宽字符间互转

   
   const char* AToU(LPCSTR str){
    const wchar_t* wStr = 0;
    const char* ret = 0;
    if(!str)
    {
        return 0;
    }
    wStr = AToW(str);
    ret = WToU(wStr);
    FREE_BUFF(wStr);
    return ret;
}
 
const wchar_t* AToW(LPCSTR str)
{
    int needSize = 0;
    wchar_t* ret = 0;
    if(!str)
    {
        return 0;
    }
    needSize = MultiByteToWideChar(GetACP(), 0, str, -1, NULL, 0);
    ret = (wchar_t*)malloc((needSize + 1) * sizeof(wchar_t));
    memset(ret, 0, (needSize + 1) * sizeof(wchar_t));
    MultiByteToWideChar(GetACP(), 0, str, -1, ret, needSize);
    return ret;
}
 
const char* UToA(LPCSTR str)
{
    const wchar_t* wStr = 0;
    const char* ret = 0;
    if(!str)
    {
        return NULL;
    }
    wStr = UToW(str);
    ret = WToA(wStr);
    FREE_BUFF(wStr);
    return ret;
}
 
const wchar_t* UToW(LPCSTR str)
{
    int needSize = 0;
    wchar_t* ret = 0;
    if(!str)
    {
        return NULL;
    }
    needSize = MultiByteToWideChar(CP_UTF8, 0, str, -1, NULL, 0);
    ret = (wchar_t*)malloc((needSize + 1) * sizeof(wchar_t));
    memset(ret, 0, (needSize + 1) * sizeof(wchar_t));
    MultiByteToWideChar(CP_UTF8, 0, str, -1, ret, needSize);
    return ret;
}
 
const char* WToA(LPCWSTR str)
{
    int needSize = 0;
    char* ret = 0;
    if(!str)
    {
        return NULL;
    }
    needSize = WideCharToMultiByte(GetACP(), 0, str, -1, NULL, 0, NULL, NULL);
    ret = (char*)malloc(needSize + 1);
    memset(ret, 0, needSize + 1);
    WideCharToMultiByte(GetACP(), 0, str, -1, ret, needSize, NULL, NULL);
    return ret;
}
 
const char* WToU(LPCWSTR str)
{
    int needSize = 0;
    char* ret = 0;
    if(!str)
    {
        return NULL;
    }
    needSize = WideCharToMultiByte(CP_UTF8, 0, str, -1, NULL, 0, NULL, NULL);
    ret = (char*)malloc(needSize + 1);
    memset(ret, 0, needSize + 1);
    WideCharToMultiByte(CP_UTF8, 0, str, -1, ret, needSize, NULL, NULL);
    return ret;
}
 
void FreeBuffer(void* buff)
{
    free(buff);
}

#ifndef _H_CONVERT
#define _H_CONVERT
 
#ifdef __cplusplus
extern "C"
{
#endif
 
#include <windows.h>
 
#define FREE_BUFF(a) FreeBuffer((void*)a)
 
    // ANSI 转成 UTF-8
    const char* AToU(LPCSTR);
    // ANSI 转成宽字符
    const wchar_t* AToW(LPCSTR);
    // UTF-8 转成 ANSI
    const char* UToA(LPCSTR);
    // UTF-8 转成宽字符
    const wchar_t* UToW(LPCSTR);
    // 宽字符转成 ANSI
    const char* WToA(LPCWSTR);
    // 宽字符转成 UTF-8
    const char* WToU(LPCWSTR);
    // 以上函数返回值全由此函数释放
    void FreeBuffer(void*);
 
#ifdef __cplusplus
}
#endif

pirate97

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Unicode UTF-8 Ansi 互转及MultiByteToWideChar和WideCharToMultiByte用法等编码相关

Unicode UTF-8 Ansi 互转及MultiByteToWideChar和WideCharToMultiByte用法等编码相关分类： MFC/SDK/C++2010-05-18 20:53 2818人阅读评论(1) 收藏举报目录(?)[+]Unicode，到UTF-8。 [cpp] view plainc
复制链接

扫一扫