关于ANSI,UNICODE与UTF-8的区别及转换

最新推荐文章于 2023-12-11 13:46:48 发布

江左周郎

最新推荐文章于 2023-12-11 13:46:48 发布

阅读量1.1k

点赞数

分类专栏：【C++】字符编码文章标签： ansiutf8和unicode的区别编码

本文链接：https://blog.csdn.net/Best_ZYJ/article/details/79011502

版权

【C++】字符编码专栏收录该内容

1 篇文章 0 订阅

订阅专栏

先做一个小小的试验：

在一个文件夹里，把一个txt文本（文本里包含“今天的天气非常好”这句话）分别另存为ansi、unicode、utf-8这三种编码的txt文件。然后，在该文件夹上点击右键，选择“搜索(E)…”。

搜索“天气”二字，可以搜索出ansi和unicode这两种编码的txt文件，搜索不出utf-8编码的文件。

原因：

1.中文操作系统默认ansi编码，生成的txt文件默认为ansi编码，所以，可以搜索出来。

2.unicode是国际通用编码，所以，可以搜索出来。

3.utf-8编码是unicode编码在网络之间（主要是网页）传输时的一种“变通”和“桥梁”编码。utf-8在网络之间传输时可以节约数据量。所以，使用操作系统无法搜索出txt文本。

按照utf-8创始人的愿望：

端（unicode）——传输（utf-8）——端（unicode）

但是，后来，许多网站开发者在开发网页时直接使用utf-8编码。

端（utf-8）——传输（utf-8）——端（utf-8）

所以，在浏览器上看到的编码是：unicode（utf-8）。正因为在浏览器上这么并列地列出unicode（utf-8），造成许多网友（甚至不少程序员）误认为unicode=utf-8。其实，按照utf-8创始人的原意，在开发网页时使用utf-8编码是错误的做法，并且，早期的浏览器也不支持解析utf-8编码。但是，众人的力量是巨大的，微软不得不“趋炎附势”，在浏览器上支持解析utf-8编码。

问题是：utf-8编码影响了网站开发者，或者说，网站开发者“扩展”了utf-8编码的使用范围。但是，网站开发者仍然无法影响各类文档的开发者，所以，word文档和一些国际通用的文档仍然使用unicode编码而不使用utf-8编码。

比如：“严”的Unicode码是4E25，UTF-8编码是E4B8A5，两者是不一样的。

在中文和日文操作系统里生成的（txt和xml）文件的编码虽然都是ansi，但是，在简体中文系统下，ansi 编码代表 GB2312编码，在日文操作系统下，ansi 编码代表 JIS 编码。不同 ansi编码之间互不兼容，当信息在国际间交流时，无法将属于两种语言的文字，存储在同一段 ansi 编码的文本中。

结论：国际文档（txt和xml）使用unicode编码是正宗做法；操作系统和浏览器都能够“理解”unicode编码。浏览器“迫于压力”才“理解”utf-8编码。但是，操作系统有时只认unicode编码。

Unicode与Unicodebig endian的区别：你吃鸡蛋时先吃小头还是先吃大头？Unicode与Unicode bigendian的区别就是在编码时小头优先与大头优先的区别。“随波逐流”使用Unicode就OK了。

ANSI, UNICODE 与 UTF8之间的格式转换：

	wstring AsciiToUnicode(const string& str) {
		// 预算-缓冲区中宽字节的长度    
		int unicodeLen = MultiByteToWideChar(CP_ACP, 0, str.c_str(), -1, nullptr, 0);
		// 给指向缓冲区的指针变量分配内存    
		wchar_t *pUnicode = (wchar_t*)malloc(sizeof(wchar_t)*unicodeLen);
		// 开始向缓冲区转换字节    
		MultiByteToWideChar(CP_ACP, 0, str.c_str(), -1, pUnicode, unicodeLen);
		wstring ret_str = pUnicode;
		free(pUnicode);
		return ret_str;
	}

	string UnicodeToAscii(const wstring& wstr) {
		// 预算-缓冲区中多字节的长度    
		int ansiiLen = WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), -1, nullptr, 0, nullptr, nullptr);
		// 给指向缓冲区的指针变量分配内存    
		char *pAssii = (char*)malloc(sizeof(char)*ansiiLen);
		// 开始向缓冲区转换字节    
		WideCharToMultiByte(CP_ACP, 0, wstr.c_str(), -1, pAssii, ansiiLen, nullptr, nullptr);
		string ret_str = pAssii;
		free(pAssii);
		return ret_str;
	}

	wstring Utf8ToUnicode(const string& str) {
		// 预算-缓冲区中宽字节的长度    
		int unicodeLen = MultiByteToWideChar(CP_UTF8, 0, str.c_str(), -1, nullptr, 0);
		// 给指向缓冲区的指针变量分配内存    
		wchar_t *pUnicode = (wchar_t*)malloc(sizeof(wchar_t)*unicodeLen);
		// 开始向缓冲区转换字节
		MultiByteToWideChar(CP_UTF8, 0, str.c_str(), -1, pUnicode, unicodeLen);
		wstring ret_str = pUnicode;
		free(pUnicode);
		return ret_str;
	}

	string UnicodeToUtf8(const wstring& wstr) {
		// 预算-缓冲区中多字节的长度    
		int ansiiLen = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, nullptr, 0, nullptr, nullptr);
		// 给指向缓冲区的指针变量分配内存    
		char *pAssii = (char*)malloc(sizeof(char)*ansiiLen);
		// 开始向缓冲区转换字节    
		WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, pAssii, ansiiLen, nullptr, nullptr);
		string ret_str = pAssii;
		free(pAssii);
		return ret_str;
	}

	string AsciiToUtf8(const string& str) 
	{
		return UnicodeToUtf8(AsciiToUnicode(str));
	}

	string Utf8ToAscii(const string& str) 
	{
		return UnicodeToAscii(CommonHelper::Utf8ToUnicode(str));
	}

备注：Windows系统接口最好不要用使用含有中文的UTF8字符，会变成乱码。