汉字UTF8编码占用几个字节

点击打开链接http://en.wikipedia.org/wiki/UTF-8#Description

This table shows UTF-8 as it is since 2003 (the x characters are replaced by the bits of the code point):


UTF-8 (2003)
Number
of bytes Bits for
code point First
code point Last
code point Byte 1 Byte 2 Byte 3 Byte 4
1 7 U+0000 U+007F 0xxxxxxx
2 11 U+0080 U+07FF 110xxxxx 10xxxxxx
3 16 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
4 21 U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
The salient features of this scheme are as follows:


Backward compatibility: One-byte codes are used only for the ASCII values 0 through 127. In this case the UTF-8 code has the same value as the ASCII code. The high-order bit of these codes is always 0. This means that ASCII text is valid UTF-8, and UTF-8 can be used for parsers expecting 8-bit extended ASCII even if they are not designed for UTF-8.
Clear distinction between multi-byte and single-byte characters: Code points larger than 127 are represented by multi-byte sequences, composed of a leading byte and one or more continuation bytes. The leading byte has two or more high-order 1s followed by a 0, while continuation bytes all have '10' in the high-order position. Thus, no bytes representing ASCII characters appear in multi-byte sequences.
Clear indication of byte sequence length: Like in UTF-1, the first byte indicates the number of bytes in the sequence. Unlike in UTF-1, for multi-byte sequences it is simply the number of high-order 1s in the leading byte.
Prefix property: From the sequence length indication in the first byte (for both UTF-1 and UTF-8), a reader also knows where a sequence ends, which implies that no valid sequence is a prefix of any other. This means that a reader reading from a stream can instantaneously decode each individual fully received sequence, without first having to wait for either the first byte of a next sequence or an end-of-stream indication.
Self-synchronization: Unlike in UTF-1, the high-order bits of every byte determine the type of byte: single bytes (0xxxxxxx), leading bytes (11...xxx), and continuation bytes (10xxxxxx) do not share values. The start of a character can be found by backing up at most 3 bytes (5 bytes before RFC 3629 restriction, see above). Together with the prefix property, this makes the scheme self-synchronizing.
Code structure: The remaining bits of the encoding are used for the bits of the code point being encoded, padded with high-order 0s if necessary. The high-order bits go in the leading byte, lower-order bits in subsequent continuation bytes. The number of bytes in the encoding must be the minimum required to hold all the significant bits of the code point.
The first 128 characters (US-ASCII) need one byte. The next 1,920 characters need two bytes to encode. This covers the remainder of almost all Latin alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac, Thaana and N'Ko alphabets, as well as Combining Diacritical Marks. Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use[13] including most Chinese, Japanese and Korean characters. Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值