关于Unicode几点疑问的总结

最新推荐文章于 2022-05-13 17:12:19 发布

hailongchang

最新推荐文章于 2022-05-13 17:12:19 发布

阅读量1.4k

点赞数

文章标签： windows character 语言 transformation numbers html

本文链接：https://blog.csdn.net/hailongchang/article/details/1483568

版权

What is Unicode?
Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.

以上是unicode在www.unicode.org上的定义。

以下是我的几点疑问：

1:windows 2000及其以后的操作系统既然已经使用了unicode，那为什么还会有区域的概念，为什么还会有代码页的设定。

2:汉字总数要非常大，肯定超过unicode的表示范围，为什么会说unicode能保证所有的书写语言的需要。

3:既然C支持unicode，那为什么还要设置正确的代码页才能显示正确的结果。

#define _UNICODE
#include <stdio.h>
#include<TChar.h>
#include <locale.h>

int main(int argc, char *argv[])
{

TCHAR *s = _TEXT("你好!");
setlocale( LC_ALL, "CHS"); //注释掉这一行以后,程序无法正确显示
printf("%S",s);
return 0;
}

经过查阅资料,彻底解决了以上几个问题,总结如下。

一：基本概念

1：内码

字符必须编码后才能被计算机处理，计算机使用的缺省编码方式就是计算机的内码，所以说内码是指操作系统内部的字符编码。早期操作系统的内码是与语言相关的.现在的Windows在内部统一使用Unicode，然后用代码页适应各种语言,内码的概念就比较模糊了。微软一般将缺省代码页指定的编码说成是内码，在特殊的场合也会说自己的内码是Unicode。

早期的计算机使用7位的ASCII编码，同时为了处理汉字，程序员设计了用于简体中文的GB2312和用于繁体中文的big5。

2：代码页

所谓代码页(code page)就是针对一种语言文字的字符编码。代码页可以被理解为前面提到的内码。

目前Windows的内核已经采用Unicode编码，这样在内核上可以支持全世界所有的语言文字。但是由于现有的大量程序和文档都采用了某种特定语言的编码，例如GBK，Windows不可能不支持现有的编码，而全部改用Unicode。所以Windows使用代码页(code page)来适应各个国家和地区，前面说过，代码页其实就是一种对语言文字的字符编码，因此对于采用特定编码的文件，windows可以用特定的代码页来正常显示，同时Windows的内码是Unicode

它可以同时支持多个代码页。只要文件能说明自己使用什么编码，用户又安装了对应的代码页，Windows就能正确显示，例如在HTML文件中就可以指定charset。

3：Unicode

Unicode也是一种字符编码方法，不过它是由国际组织设计，可以容纳全世界所有语言文字的编码方案。Unicode是"Universal Multiple-Octet Coded Character Set"，简称为UCS。也可以看作是"Unicode Character Set"的缩写。　UCS只是规定如何编码，并没有规定如何传输、保存这个编码。

因此对于通信，可以采用Ascii方式进行传输，也可以采用utf编码来传送，UTF是“UCS Transformation Format”的缩写。utf编码主要有utf-8,utf-16等等。

对于第一个问题：windows是使用了Unicode编码，但是对于特定的语言，windows无法正确显示相应的字符，因此只能用代码页来正常显示，代码页其实就是一种字符集，它指定了代码到字符的映射关系，而Unicode只是定义了一种代码。第三个问题与第一个问题其实都是一样的。

对于第二个问题，可以参考下面的文章。

http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html

Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0645. This magic number is called a code point. The U+ means "Unicode" and the numbers are hexadecimal. U+FEC9 is the Arabic letter Ain. The English letter A would be U+0041. You can find them all using the charmap utility on Windows 2000/XP or visiting the Unicode web site.

There is no real limit on the number of letters that Unicode can define and in fact they have gone beyond 65,536 so not every unicode letter can really be squeezed into two bytes, but that was a myth anyway.

简单的说就是一个字符对应于一个code point，但是同一个code point同时可能有很多不同的字符对应，同时Unicode也不是仅仅限于2个字节，这样一来，不同的书写系统可能使用的是同样的code，但是显示出来的却是不同的东西。