unicode和utf-8_Unicode和UTF-8简介

unicode和utf-8

Unicode is an industry standard for consistent encoding of written text.

Unicode是用于对书面文本进行一致编码的行业标准

There are lots of character sets which are used by computers, but Unicode is the first of its kind to aim to support every single written language on earth (and beyond!).

计算机使用许多字符集,但是Unicode是第一个旨在支持地球上(甚至更远!)每一种书面语言的字符集。

Its aim is to provide a unique number to identify every character for every language, on any platform.

其目的是提供一个唯一的编号,以在任何平台上标识每种语言的每个字符。

Unicode maps every character to a specific code, called code point. A code point takes the form of U+<hex-code>, ranging from U+0000 to U+10FFFF.

Unicode将每个字符映射到称为代码点的特定代码。 代码点采用U+<hex-code> ,范围从U+0000U+10FFFF

An example code point looks like this: U+004F. Its meaning depends on the character encoding used.

示例代码点如下所示: U+004F 。 其含义取决于所使用的字符编码。

Unicode defines different characters encodings, the most used ones being UTF-8, UTF-16 and UTF-32.

Unicode定义了不同的字符编码 ,最常用的是UTF-8,UTF-16和UTF-32。

UTF-8 is definitely the most popular encoding in the Unicode family, especially on the Web. This document is written in UTF-8, for example.

UTF-8绝对是Unicode系列中最流行的编码,尤其是在Web上。 例如,本文档以UTF-8编写。

Currently there are more than 135.000 different characters implemented, with space for more than 1.1 millions.

目前,已实施了135.000个不同的字符,空间超过110万个。

剧本 (Scripts)

All the Unicode supported characters are grouped into sections called scripts.

所有Unicode支持的字符都被分为称为scripts的部分。

There is a script for every different character set:

每个不同的字符集都有一个脚本:

  • Latin (contains all ASCII + all the other western world characters)

    拉丁语(包含所有ASCII +所有其他西方字符)
  • Korean

    韩语
  • Old Hungarian

    老匈牙利人
  • Hebrew

    希伯来语
  • Greek

    希腊语
  • Armenian

    亚美尼亚人
  • …and so on!

    …等等!

The full list is defined in the ISO 15924 standard.

完整列表在ISO 15924标准中定义。

See more on scripts: https://en.wikipedia.org/wiki/Script_(Unicode)

有关脚本的更多信息,请参见: https : //en.wikipedia.org/wiki/Script_(Unicode)

飞机 (Planes)

In addition to scripts, there is another way that Unicode organizes its characters: planes.

除了脚本之外,Unicode还可以通过另一种方式组织其字符: planes

Instead of grouping them by type, it checks the code point value:

而不是按类型对它们进行分组,而是检查代码点值:

PlaneRange
0U+0000 - U+FFFF
1U+10000 - U+1FFFF
2U+20000 - U+2FFFF
14U+E0000 - U+EFFFF
15U+F0000 - U+FFFFF
16U+100000 - U+10FFFF
飞机 范围
0 U + 0000-U + FFFF
1个 U + 10000-U + 1FFFF
2 U + 20000-U + 2FFFF
14 U + E0000-U + EFFFF
15 U + F0000-U + FFFFF
16 U + 100000-U + 10FFFF

There are 17 planes.

有17架飞机。

The first is special, it’s called Basic Multilingual Plane, or BMP, and contains most of the modern characters and symbols, from the Latin, Cyrillic, Greek scripts.

第一个是特殊的,它称为Basic Multilingual PlaneBMP ,其中包含来自拉丁语,西里尔语和希腊语脚本的大多数现代字符和符号。

The other 16 planes are called astral planes. Worth noting that planes 3 to 13 are currently empty.

其他16个平面称为星体平面 。 值得注意的是,飞机3至13目前是空的。

The code points contained in astral planes are called astral code points.

星体平面中包含的代码点称为星体代码点

Astral code points are all points higher than U+10000.

所有星体代码点均高于U+10000

代码单位 (Code units)

Code points are internally stored as code units. A code unit is the bit representation of a character, and it’s length varies depending on the character encoding

代码点在内部存储为代码单位 。 代码单位是字符的位表示形式,其长度取决于字符编码

UTF-32 uses a 32-bit code unit.

UTF-32使用32位代码单元。

UTF-8 uses an 8-bit code unit, and UTF-16 uses a 16-bit code unit. If a code point needs a larger size, it will be represented by 2 (or more, in UTF-8) code units.

UTF-8使用8位代码单元,而UTF-16使用16位代码单元。 如果代码点需要更大的尺寸,则将以2个(或以UTF-8为单位)的代码单位表示。

字素 (Graphemes)

A grapheme is a symbol that represents a unit of a writing system. It’s basically your idea of a character and how it should look like.

字素是代表书写系统单位的符号。 基本上,这就是对角色及其外观的看法。

字形 (Glyphs)

A glyph is a graphic representation of a grapheme: how it is visually displayed on screen, the actual appearance on the display.

字形是一个字素的图形表示:它在屏幕上的视觉显示方式,以及显示器上的实际外观。

顺序 (Sequences)

Unicode lets you combine different characters to form a grapheme.

Unicode使您可以组合不同的字符以形成字素。

For example it’s the case of accented characters: the letter é can be expressed by using a combination of the letter e (U+0065) and the unicode character named “COMBINING ACUTE ACCENT” (U+0301):

例如,带重音的字符就是这种情况:字母é可以通过使用字母e ( U+0065 )和名为“ COMBINING ACUTE ACCENT”( U+0301 )的Unicode字符的组合来表示:

"U+0065U+0301" ➡️ "é"

U+0301 in this case is what is described as a combining mark, one character that applies to the previous one to form a different grapheme.

在这种情况下, U+0301被称为组合标记 ,一个字符适用于前一个字符以形成不同的字素。

正常化 (Normalization)

A characters can be sometimes represented using different combinations of code points.

有时可以使用不同的代码点组合来表示字符。

For example it’s the case of accented characters: the letter é can be expressed both as U+00E9 and also as combining e (U+0065) and the unicode character named “COMBINING ACUTE ACCENT” (U+0301):

例如,在重音字符的情况下:字母é既可以表示为U+00E9 ,也可以表示为组合e ( U+0065 )和名为“ COMBINING ACUTE ACCENT”的统一字符( U+0301 ):

U+00E9       ➡️ "é"
U+0065U+0301 ➡️ "é"

The normalization process analyzes a string for those kind of ambiguities, and generates a string with the canonical representation of any character.

规范化过程将分析字符串是否存在此类歧义,并生成具有任何字符的规范表示形式的字符串。

Without normalization, perfectly equal strings to the eye will be considered different because their internal representation changes:

如果不进行标准化,则完全相等的字符串将被视为不同,因为它们的内部表示形式发生了变化:

表情符号 (Emojis)

Emojis are Unicode astral plane characters, and they provide a way to have images on your screen without actually having real images, just font glyphs.

表情符号是Unicode星体平面字符,它们提供了一种在屏幕上显示图像的方式,而实际上没有真正的图像,而只是字体字形。

As an example, the 🐶 symbol is encoded as U+1F436.

例如,example符号被编码为U+1F436

前128个字符 (The first 128 characters)

The first 128 characters of Unicode are the same as the ASCII character set.

Unicode的前128个字符与ASCII字符集相同。

The first 32 characters, U+0000-U+001F (0-31) are called Control Codes.

前32个字符U+0000 - U+001F (0-31)被称为控制代码

They are an inheritance from the past and most of them are now obsolete. They were used for teletype machines, something that existed before the fax.

它们是过去的继承,现在大多数已过时。 它们用于电传打字机,这是传真之前存在的东西。

Characters from U+0020 (32) to U+007E (126) contain numbers, letters and some symbols:

从U + 0020(32)到U + 007E(126)的字符包含数字,字母和一些符号:

UnicodeASCII codeGlyph
U+002032(space)
U+002133!
U+002234
U+002335#
U+002436$
U+002537%
U+002638&
U+002739
U+002840(
U+002941)
U+002A42*
U+002B43+
U+002C44,
U+002D45-
U+002E46.
U+002F47/
U+0030480
U+0031491
U+0032502
U+0033513
U+0034524
U+0035535
U+0036546
U+0037557
U+0038568
U+0039579
U+003A58:
U+003B59;
U+003C60<
U+003D61=
U+003E62>
U+003F63?
U+004064@
U+004165A
U+004266B
U+004367C
U+004468D
U+004569E
U+004670F
U+004771G
U+004872H
U+004973I
U+004A74J
U+004B75K
U+004C76L
U+004D77M
U+004E78N
U+004F79O
U+005080P
U+005181Q
U+005282R
U+005383S
U+005484T
U+005585U
U+005686V
U+005787W
U+005888X
U+005989Y
U+005A90Z
U+005B91[
U+005C92
U+005D93]
U+005E94^
U+005F95_
U+006096`
U+006197a
U+006298b
U+006399c
U+0064100d
U+0065101e
U+0066102f
U+0067103g
U+0068104h
U+0069105i
U+006A106j
U+006B107k
U+006C108l
U+006D109m
U+006E110n
U+006F111o
U+0070112p
U+0071113q
U+0072114r
U+0073115s
U+0074116t
U+0075117u
U+0076118v
U+0077119w
U+0078120x
U+0079121y
U+007A122z
U+007B123{
U+007C124
U+007D125}
U+007E126~
统一码 ASCII码 雕文
U + 0020 32 (空间)
U + 0021 33
U + 0022 34
U + 0023 35
U + 0024 36 $
U + 0025 37
U + 0026 38
U + 0027 39 '
U + 0028 40 (
U + 0029 41 )
U + 002A 42 *
U + 002B 43 +
U + 002C 44
U + 002D 45 --
U + 002E 46
U + 002F 47 /
U + 0030 48 0
U + 0031 49 1个
U + 0032 50 2
U + 0033 51 3
U + 0034 52 4
U + 0035 53 5
U + 0036 54 6
U + 0037 55 7
U + 0038 56 8
U + 0039 57 9
U + 003A 58
U + 003B 59 ;
U + 003C 60 <
U + 003D 61 =
U + 003E 62 >
U + 003F 63
U + 0040 64 @
U + 0041 65 一个
U + 0042 66
U + 0043 67 C
U + 0044 68 d
U + 0045 69 Ë
U + 0046 70 F
U + 0047 71 G
U + 0048 72 H
U + 0049 73 一世
U + 004A 74 Ĵ
U + 004B 75 ķ
U + 004C 76 大号
U + 004D 77 中号
U + 004E 78 ñ
U + 004F 79 Ø
U + 0050 80 P
U + 0051 81
U + 0052 82 [R
U + 0053 83 小号
U + 0054 84 Ť
U + 0055 85 ü
U + 0056 86 V
U + 0057 87 w ^
U + 0058 88 X
U + 0059 89 ÿ
U + 005A 90 ž
U + 005B 91 [
U + 005C 92
U + 005D 93 ]
U + 005E 94 ^
U + 005F 95 _
U + 0060 96 `
U + 0061 97 一个
U + 0062 98 b
U + 0063 99 C
U + 0064 100 d
U + 0065 101 Ë
U + 0066 102 F
U + 0067 103 G
U + 0068 104 H
U + 0069 105 一世
U + 006A 106 Ĵ
U + 006B 107 ķ
U + 006C 108
U + 006D 109
U + 006E 110 ñ
U + 006F 111 Ø
U + 0070 112 p
U + 0071 113 q
U + 0072 114 [R
U + 0073 115 s
U + 0074 116 Ť
U + 0075 117 ü
U + 0076 118 v
U + 0077 119 w
U + 0078 120 X
U + 0079 121 ÿ
U + 007A 122 ž
U + 007B 123 {
U + 007C 124
U + 007D 125 }
U + 007E 126
  • Numbers go from U+0030 to U+0039

    数字从U+0030U+0039

  • Uppercase letters go from U+0041 to U+005A

    大写字母从U+0041U+005A

  • Lowercase letters go from U+0061 to U+007A

    小写字母从U+0061U+007A

U+007F (127) is the delete character.

U + 007F(127)是删除字符。

Everything going forward is outside the realm of ASCII, and is part of Unicode exclusively.

将来的一切都超出了ASCII的范围,并且是Unicode的一部分。

You can find the whole list on Wikipedia: https://en.wikipedia.org/wiki/List_of_Unicode_characters

您可以在Wikipedia上找到整个列表: https : //en.wikipedia.org/wiki/List_of_Unicode_characters

Unicode编码 (Unicode encodings)

UTF-8 (UTF-8)

UTF-8 is a variable width character encoding, and it can encode every character covered by Unicode, using from 1 to 4 8-bit bytes.

UTF-8是一种可变宽度的字符编码,它可以使用1到4个8位字节来编码Unicode覆盖的每个字符。

It was originally designed by Ken Thompson and Rob Pike in 1992. Those names are familiar to those with any interest in the Go programming language, as they were two of the original creators of that as well.

它最初是由Ken Thompson和Rob Pike于1992年设计的。那些对Go编程语言感兴趣的人都熟悉这些名称,因为它们也是该语言的两个原始创建者。

It’s recommended by the W3C as the default encoding in HTML files, and stats indicate that it’s used on 91,3% of all web pages, as of April 2018.

W3C建议将其作为HTML文件中的默认编码,并且统计数据表明,截至2018年4月,已在所有网页的91,3%上使用它。

At the time of its introduction, ASCII was the most popular character encoding in the western world. In ASCII all letters, digits and symbols were assigned a number, and this number. Being fixed to 8 bits, it could only represent a maximum of 255 characters, and it was enough.

在其引入之时,ASCII是西方世界中最流行的字符编码。 在ASCII中,所有字母,数字和符号均分配有一个数字,以及该数字。 固定为8位,最多只能表示255个字符,这就足够了。

UTF-8 was designed to be backward compatible with ASCII. This was very important for its adoption, as ASCII was much older (1963) and widespread, and moving to UTF-8 came almost transparently.

UTF-8设计为向后兼容ASCII。 这对于采用它非常重要,因为ASCII年代更早(1963年)并且广泛使用,并且向UTF-8迁移几乎是透明的。

The first 128 characters of UTF-8 map exactly to ASCII. Why 128? Because ASCII uses 7-bit encoding, which allows up to 128 combinations. Why 7 bits? We now take 8 bits for granted, but back in the day when ASCII was conceived, 7 bit systems were popular as well.

UTF-8的前128个字符完全映射到ASCII。 为什么是128? 因为ASCII使用7位编码,所以最多允许128种组合。 为什么是7位? 现在,我们认为8位是理所当然的,但是在ASCII诞生之初,7位系统也很流行。

Being 100% compatible with ASCII makes UTF-8 also very efficient, because the most frequently used characters in the western languages are encoded with 1 byte only.

与ASCII 100%兼容使UTF-8也非常有效,因为西方语言中最常用的字符仅用1个字节编码。

Here is the map of the bytes usage:

这是字节使用情况的映射:

Number of bytesStartEnd
1U+0000U+007F
2U+0080U+07FF
3U+0800U+FFFF
4U+10000U+10FFFF
字节数 开始 结束
1个 U+0000 U+007F
2 U+0080 U+07FF
3 U+0800 U+FFFF
4 U+10000 U+10FFFF

Remember that in ASCII the characters were encoded as numbers? If the letter A in ASCII was represented with the number 65, using UTF-8 it’s encoded as U+0041.

还记得ASCII中的字符被编码为数字吗? 如果ASCII字母A用数字65表示,则使用UTF-8编码为U+0041

Why not U+0065 you ask? Well because unicode uses an hexadecimal base, and instead of 10 you have U+000A and so on (basically, you have a set of 16 digits instead of 10)

为什么不问U+0065 ? 好吧,因为Unicode使用的是十六进制基数,所以您使用U+000A而不是10 ,以此类推(基本上,您有一组16位数字而不是10位)

Take a look at this video, which brilliantly explains this UTF-8 and ASCII compatibility.

观看此视频 ,它很好地解释了UTF-8和ASCII的兼容性。

UTF-16 (UTF-16)

UTF-16 is another very popular Unicode encoding. For example, it’s how Java internally represents any character. It’s also one of the 2 encodings JavaScript uses internally, along with UCS-2. It’s used by many other systems as well, like Windows.

UTF-16是另一种非常流行的Unicode编码。 例如,这就是Java在内部表示任何字符的方式。 它也是JavaScript内部使用的两种编码之一 ,以及UCS-2 。 它也被许多其他系统使用,例如Windows。

UTF-16 is a variable length encoding system, like UTF-8, but uses 2 bytes (16 bits) as the minimum for any character representation. As such, it’s backwards incompatible with the ASCII standard.

UTF-16是一种可变长度编码系统,与UTF-8类似,但是对于任何字符表示形式,最少使用2个字节(16位)。 因此,它向后与ASCII标准不兼容。

Code points in the Basic Multilingual Plane (BMP) are stored using 2 bytes. Code points in astral planes are stored using 4 bytes.

基本多语言平面(BMP)中的代码点使用2个字节存储。 星体平面中的代码点使用4个字节存储。

UTF-32 (UTF-32)

UTF-8 uses a minimum of 1 byte, UTF-16 uses a minimum of 2 bytes.

UTF-8至少使用1个字节,UTF-16至少使用2个字节。

UTF-32 always uses 4 bytes, without optimizing for space usage, and as such it wastes a lot of bandwidth.

UTF-32始终使用4个字节,而没有针对空间使用进行优化,因此浪费了很多带宽。

This constrain makes it faster to operate on because you have less to check, as you can assume 4 bytes for all characters.

此约束使操作更加快捷,因为您无需检查,因为您可以假设所有字符为4个字节。

It’s not as popular as UTF-8 and UTF-16, but it has its applications.

它不像UTF-8和UTF-16那样流行,但是它有其应用程序。

翻译自: https://flaviocopes.com/unicode/

unicode和utf-8

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值