汉字编码有gb2312(精简字符集)gbk, gb18030。这3个字符集的包含关系是:
{ { { gb2312 } gbk} gb18030} 也就是说gb2312的所有字不但都在gbk中,而且值都一样,那用gb18030-->utf-8的转换器是可以转换这三个字符集到utf-8的。
windows 上用系统api实现,原理应该是将输入的编码转换为Unicode,再转换成目标编码:
MultiByteToWideChar(CP_ACP, 0, gb2312_addr, -1, temp_addr, n);
WideCharToMultiByte(CP_UTF8, 0, temp_addr, -1, utf8_addr, n, NULL, NULL);
CP_ACP 是ANSI 编码,就是系统字符集,dos中输入chcp回返回936(gbk),说明系统使用gbk编码集,所以gb18030在gbk上扩展出的字符windows上就处理不了了,要安装GB18030-2000扩展支持包才行(以下验证部分有截图)。
linux下需要使用iconv还是实现:
iconv_t cd = iconv_open("utf-8", "GBK");
size_t ret = iconv(cd, &pSource, &charInPutLen, &pTemp, &charOutPutLen);
iconv_close(cd);
// 函数说明:https://linux.die.net/man/3/iconv
具体代码实现 CodeCovert.hpp:
// CodeCovert.hpp
// 汉字编码转换
#pragma once
#include <stdio.h>
#include <string>
#ifdef WIN32
#include <WinSock2.h>
#include <Windows.h>
static constexpr UINT UTF8 = CP_UTF8;
static constexpr UINT GB2312 = CP_ACP;
static constexpr UINT GBK = CP_ACP;
static constexpr UINT GB18030 = CP_ACP;
static int codeConvert(UINT from, UINT to, std::string &source, std::string &output)
{
output.clear();
if (source.empty()) { return 0; }
WCHAR *temp = new WCHAR[source.size() + 1];
// 防止内存不够直接+1*2, 汉字最低2字节最高4字节组成
const int outSize = (source.size() + 1) * 2;
char *out = new char[outSize];
MultiByteToWideChar(from, 0, source.data(), -1, temp, source.size());
WideCharToMultiByte(to, 0, temp, -1, out, outSize, NULL, NULL);
output = out;
delete[]out;
delete[]temp;
return 0;
}
#else
#include <iconv.h>
static const char* UTF8 = "utf-8";
static const char* GB2312 = "gb2312";
static const char* GBK = "gbk";
static const char* GB18030 = "gb18030";
static int codeConvert(const char* from, const char* to, std::string &source, std::string &output)
{
output.clear();
if (source.empty()) { return 0; }
// 函数说明:https://linux.die.net/man/3/iconv
iconv_t cd = iconv_open(to, from);
if (cd == (iconv_t)(-1))
{
printf("open convert : %s -> %s error.\n", from, to);
return -1;
}
size_t inLen = source.size();
size_t outLen = (source.size() + 1) * 2;
char *in = (char*)source.data();
char *out = new char[outLen];
char *pout = out; // 会在iconv中被修改
iconv(cd, &in, &inLen, &pout, &outLen);
iconv_close(cd);
output = out;
delete[]out;
return 0;
}
#endif // WIN32
// 任意GB* 到utf-8
static int gbxToUtf8(std::string &source, std::string &output)
{
return codeConvert(GB18030, UTF8, source, output);
}
static int utf8ToGb2312(std::string &source, std::string &output)
{
return codeConvert(UTF8, GB2312, source, output);
}
static int utf8ToGbk(std::string &source, std::string &output)
{
return codeConvert(UTF8, GBK, source, output);
}
static int utf8ToGb18030(std::string &source, std::string &output)
{
return codeConvert(UTF8, GB18030, source, output);
}
测试代码:
#define _CRT_SECURE_NO_WARNINGS // 消除fopen告警
#include "CodeCovert.hpp"
int main()
{
FILE *fp = fopen("utf8.txt", "rb");
char buf[1024];
int len = fread(buf, 1, 1023, fp);
std::string gb2312;
std::string utf8(buf, len);
printf("read utf8 string : %s\n", utf8.data());
utf8ToGb2312(utf8, gb2312);
printf("convert utf8 to gb2312 : %s\n", gb2312.data());
gbxToUtf8(gb2312, utf8);
printf("convert gb2312 to utf8 : %s\n", utf8.data());
return 0;
}
输出结果:
Windows(windows系统都是按ANSI编码(gbk)转换) :
linux:
验证:
在https://www.qqxiuzi.cn/bianma/zifuji.php上输入汉字【月玥،】,可以看到gb2312,gbk,gb18030为包含关系,同时可以看到,gb2312和gbk为2字节,gb18030扩展到可以使用4字节。(big5是湾湾使用的繁体字集,Unicode是各国统一字符编码)
将以上3字符保存到文件中时:
可以看到,由于系统ANSI编码实际是gbk(win95前是gb2312),没法保存【،】,所以提示要转换成Unicode才行。