Unicode、UTF-8、UTF-16之间的关系

最新推荐文章于 2021-08-15 22:48:10 发布

weixin_34349320

最新推荐文章于 2021-08-15 22:48:10 发布

阅读量179

点赞数

文章标签： python

原文链接：https://my.oschina.net/u/2525142/blog/618484

版权

2019独角兽企业重金招聘Python工程师标准>>>

1、为什么需要Unicode 在很早以前所有，在计算机的世界里只有ASCII，后来多了一些控制字符、标点等，最后就是今天的世界里你能够看到很多种语言在一个文档中，例如：English, العربية, 汉语, עִבְרִית, ελληνικά, and ភាសាខ្មែរ ，后期或许会出现更多的其他语言的字符，计算机中需要显示所有的这些语言的字符。因此：一个包容所有语言字符的字符集很有必要，这就是Unicode的诞生的意义。

2、Unicode简介 Unicode是一个包含世界上所有语言字符的字符集，它为世界上每一个字符分配一个唯一的数字，官方术语叫 code point（码位）。Unicode的一个很大的优点是，码位的前256位和ISO-8859-1以及ASCII一样。大部分常用的字符通过一到两个字节就可以表示。

3、为什么需要UTF-8或者UTF-16等编码 虽然Unicode能够包容所有的字符集，但是我们直接看Unicode码很不方便，像看天书一样，我们对我们常用的文字最熟悉，所以就需要把我们常用的可读性强的文字和Unicode字符集一一对应。这个过程叫编码。常用的UTF-8、GBK、UTF-16等都是不同的编码方式，这些都是把我们看到的文字和Unicode字符集对应起来的规则。

4、UTF-8和UTF-16之间的区别

1、基于内存考虑的比较：

UTF-8: 1 byte: Standard ASCII 2 bytes: Arabic, Hebrew, most European scripts (most notably excluding Georgian) 3 bytes: BMP 4 bytes: All Unicode characters

UTF-16: 2 bytes: BMP 4 bytes: All Unicode characters

实例： UTF-8编码： 00100100 for "$" (one 8-bits);11000010 10100010 for "¢" (two 8-bits);11100010 10000010 10101100 for "€" (three 8-bits)

UTF-16编码： 00000000 00100100 for "$" (one 16-bits);11011000 01010010 11011111 01100010 for "?" (two 16-bits)

5、UTF-8和UTF-16的优缺点比较 UTF-8和UTF-16都是基于可变长度的编码方式。UTF-8最小是8 bit，UTF-16最少是16 bit。

UTF-8优点： 1.兼容基本的ASCII和US-ASCII. 2.No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too. 3.UTF-8 is independent of byte order, so you don't have to worry about Big Endian / Little Endian issue.

UTF-8缺点：

1.Many common characters have different length, which slows indexing by codepoint and calculating a codepoint count terribly. 2.Even though byte order doesn't matter, sometimes UTF-8 still has BOM (byte order mark) which serves to notify that the text is encoded in UTF-8, and also breaks compatibility with ASCII software even if the text only contains ASCII characters. Microsoft software (like Notepad) especially likes to add BOM to UTF-8.

UTF-16优点 1.BMP (basic multilingual plane) characters, including Latin, Cyrillic, most Chinese (the PRC made support for some codepoints outside BMP mandatory), most Japanese can be represented with 2 bytes. This speeds up indexing and calculating codepoint count in case the text does not contain supplementary characters. 2.Even if the text has supplementary characters, they are still represented by pairs of 16-bit values, which means that the total length is still divisible by two and allows to use 16-bit char as the primitive component of the string.

UTF-16缺点 1.Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory. 2.Using it as a fixed-length encoding “mostly works” in many common scenarios (especially in US / EU / countries with Cyrillic alphabets / Israel / Arab countries / Iran and many others), often leading to broken support where it doesn't. This means the programmers have to be aware of surrogate pairs and handle them properly in cases where it matters! 3.It's variable length, so counting or indexing codepoints is costly, though less than UTF-8.

In general, UTF-16 is usually better for in-memory representation while UTF-8 is extremely good for text files and network protocols.

实例参考：

"A" in ASCII is hex 0x41; in UTF-8 it is also 0x41; in UTF-16 it is 0x0041 "À" in Latin-1 is 0xC0; in UTF-8 it is 0xC3 0x80; in UTF-16 it is 0x00C0, The Tibetan letter ཨ in UTF-8 is 0xE0 0xBD 0xA8; it UTF-16 it is 0x0F68, This character*: http://www.fileformat.info/info/... in UTF-8 is 0xF0 0xA0 0x80 0x8B; in UTF-16 it is 0xD840 0xDC0B