utf8、iso、gbk转码总结

最新推荐文章于 2023-04-20 10:54:09 发布

woshiyisang

最新推荐文章于 2023-04-20 10:54:09 发布

阅读量2.6k

点赞数

本文链接：https://blog.csdn.net/woshiyisang/article/details/10821479

版权

.UTF-8和GBK之间可以互相转换，并且UTF-8大于GBK，可以简单理解为GBK是UTF-8的子集。

2.UTF-8和iso-8859-1之间的转换：iso-8859-1可以转成UTF-8，但是UTF-8不能转换成iso-8859-1，原因很简单，你可以简单理解为强制类型转换，把一个高精度的类型转成低精度的，会造成数据丢失！其实，真实原因是UTF-8编码的中文字符在iso-8859-1编码表中没有匹配的位置。另外，iso-8859-1也可以写成ISO8859_1。

3.GBK和iso-8859-1之间的转换：道理同UTF-8和iso-8859-1之间转换一样

以上是一点理解，gbk，utf8，big5 如果在linux程序下，直接用iconv是可以直接转码的，但是对于iso还需要研究一下

1、用mbstowcs的相关函数

（遇到converting to execution character set: Invalid or incomplete multibyte or wide character，

$ cat ws.cc
#include <string>
#include <iostream>
int main()
{
        std::wstring wstr = L"世界你好！";
        std::wcout << wstr << std::endl;
        return 0;
}

$ g++ ws.cc -o ws

iconv -f GBK -t utf-8 gg.cpp >ggu.cpp
原因：在windows上编辑后上传到Linux再编译的，要把test.cc文件保存为UTF-8编码方式，Windows中文版默认的是GBK）

程序

wcstombs : Invalid or incomplete multibyte or wide character

#include <iostream>
#include <stdlib.h>
int main()
{

wchar_t wcstr[20] = L"字符测试123abc";

char* pLocale=setlocale( LC_ALL,"zh_CN.gb2312");
if(!pLocale){printf("set local error\n");return -1;}

int len = wcslen(wcstr)+1;
printf("len = %d \n",len);
for(int i=0;i <len;i++)
printf("0x%08x ",wcstr[i]);
printf("\n");

//char str[55] ={0};
//int n= wcstombs(str,wcstr,55);
int iLength = sizeof(wchar_t)*(wcslen(wcstr)) ;
char str[iLength];
int n= wcstombs(str,wcstr,iLength);
if(-1 == n)
{
perror("wcstombs ");
exit(-1);
}
printf("n = %d\n",n);
for(int i=0;i<n+1 ;i++)
printf("0x%08x ",str[i]);
printf("\n");
wchar_t wch[50]={0};
int m= mbstowcs(wch,str,n);
if(m==-1)
{perror("Converting");exit(-1);}
printf("m=%d\n",m);
for(int i=0;i<m+1 ;i++)
printf("0x%08x ",wch[i]);
printf("\n");

return 0;
}

iconv -f GBK -t utf-8 mbstowcs.cpp >mbstowcsu.cpp

g++ mbstowcsu.cpp

结果显示：

len = 11
0x00005b57 0x00007b26 0x00006d4b 0x00008bd5 0x00000031 0x00000032 0x00000033 0x00000061 0x00000062 0x00000063 0x00000000
n = 14
0xffffffd7 0xffffffd6 0xffffffb7 0xfffffffb 0xffffffb2 0xffffffe2 0xffffffca 0xffffffd4 0x00000031 0x00000032 0x00000033 0x00000061 0x00000062 0x00000063 0x00000000
m=10
0x00005b57 0x00007b26 0x00006d4b 0x00008bd5 0x00000031 0x00000032 0x00000033 0x00000061 0x00000062 0x00000063 0x00000000

size_t wcstombs(char *dest, const wchar_t *src, size_t n);

程序貌似就是把unicode编码换成gbk编码之后又成功换回来啦（locale -a可查看locale参数）

char* pLocale=setlocale( LC_ALL,"zh_CN.gb2312");决定了mbstowcs函数。

转换char* pLocale=setlocale( LC_ALL,"yi_US.utf8");，

char* pLocale=setlocale( LC_ALL,"cy_GB.iso885914");失败wcstombs返回-1

char* pLocale=setlocale( LC_ALL,"da_DK.iso88591");失败wcstombs返回-1

修改输入，为只有数字和字母是，程序运行良好。

思考：

unicode可以编码中文但是，不能拥有对应的西欧编码数据。

但是java语言中的数据却可以转换成功？？ http://blog.csdn.net/qinysong/article/details/1179489

现在中文“字符测试abc123”是以unicode进行编码之后，想用wcstombs变成iso编码的字节串，在这个过程出错，

char* pLocale=setlocale( LC_ALL,"zu_ZA.iso88591");

显示

len = 11
0x00005b57 0x00007b26 0x00006d4b 0x00008bd5 0x00000031 0x00000032 0x00000033 0x00000061 0x00000062 0x00000063 0x00000000
wcstombs : ????????????????

总结：

中文转西欧编码，对应的真实原因编码的中文字符在iso-8859-1编码表中没有匹配的位置，导致iconv和wcstombs转换中文均不成功

http://www.cnblogs.com/hnrainll/archive/2011/05/07/2039700.html

关于setlocale的介绍

http://bbs.csdn.net/topics/350110238

http://blog.sina.com.cn/s/blog_4b4b54da01015iu5.html

char转wchar_t，wchar_t转char（Unicode转ANSI）

字符字节编码，基本原理

http://www.regexlab.com/zh/encoding.htm#instances

woshiyisang

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
utf8、iso、gbk转码总结

.UTF-8和GBK之间可以互相转换，并且UTF-8大于GBK，可以简单理解为GBK是UTF-8的子集。2.UTF-8和iso-8859-1之间的转换：iso-8859-1可以转成UTF-8，但是UTF-8不能转换成iso-8859-1，原因很简单，你可以简单理解为强制类型转换，把一个高精度的类型转成低精度的，会造成数据丢失！其实，真实原因是UTF-8编码的中文字符在iso-8859-1编码表
复制链接

扫一扫