汉字UTF8编码占用几个字节

最新推荐文章于 2020-07-10 22:14:30 发布

lein_wang

最新推荐文章于 2020-07-10 22:14:30 发布

阅读量2.9k

点赞数

分类专栏： PHP 文章标签：汉字UTF8占用字节

PHP 专栏收录该内容

102 篇文章 0 订阅

订阅专栏

点击打开链接http://en.wikipedia.org/wiki/UTF-8#Description

This table shows UTF-8 as it is since 2003 (the x characters are replaced by the bits of the code point):

UTF-8 (2003)
Number
of bytes Bits for
code point First
code point Last
code point Byte 1 Byte 2 Byte 3 Byte 4
1 7 U+0000 U+007F 0xxxxxxx
2 11 U+0080 U+07FF 110xxxxx 10xxxxxx
3 16 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
4 21 U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
The salient features of this scheme are as follows:

Backward compatibility: One-byte codes are used only for the ASCII values 0 through 127. In this case the UTF-8 code has the same value as the ASCII code. The high-order bit of these codes is always 0. This means that ASCII text is valid UTF-8, and UTF-8 can be used for parsers expecting 8-bit extended ASCII even if they are not designed for UTF-8.
Clear distinction between multi-byte and single-byte characters: Code points larger than 127 are represented by multi-byte sequences, composed of a leading byte and one or more continuation bytes. The leading byte has two or more high-order 1s followed by a 0, while continuation bytes all have '10' in the high-order position. Thus, no bytes representing ASCII characters appear in multi-byte sequences.
Clear indication of byte sequence length: Like in UTF-1, the first byte indicates the number of bytes in the sequence. Unlike in UTF-1, for multi-byte sequences it is simply the number of high-order 1s in the leading byte.
Prefix property: From the sequence length indication in the first byte (for both UTF-1 and UTF-8), a reader also knows where a sequence ends, which implies that no valid sequence is a prefix of any other. This means that a reader reading from a stream can instantaneously decode each individual fully received sequence, without first having to wait for either the first byte of a next sequence or an end-of-stream indication.
Self-synchronization: Unlike in UTF-1, the high-order bits of every byte determine the type of byte: single bytes (0xxxxxxx), leading bytes (11...xxx), and continuation bytes (10xxxxxx) do not share values. The start of a character can be found by backing up at most 3 bytes (5 bytes before RFC 3629 restriction, see above). Together with the prefix property, this makes the scheme self-synchronizing.
Code structure: The remaining bits of the encoding are used for the bits of the code point being encoded, padded with high-order 0s if necessary. The high-order bits go in the leading byte, lower-order bits in subsequent continuation bytes. The number of bytes in the encoding must be the minimum required to hold all the significant bits of the code point.
The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode. This covers the remainder of almost all Latin alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko alphabets, as well as Combining Diacritical Marks. Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use[13] including most Chinese, Japanese and Korean characters. Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).