unicode和utf-8_Unicode和UTF-8简介

最新推荐文章于 2022-10-20 17:44:02 发布

cuk0051

最新推荐文章于 2022-10-20 17:44:02 发布

阅读量943

点赞数

文章标签： python java linux 人工智能编程语言

原文链接：https://flaviocopes.com/unicode/

版权

unicode和utf-8

Unicode is an industry standard for consistent encoding of written text.

Unicode是用于对书面文本进行一致编码的行业标准 。

There are lots of character sets which are used by computers, but Unicode is the first of its kind to aim to support every single written language on earth (and beyond!).

计算机使用许多字符集，但是Unicode是第一个旨在支持地球上(甚至更远！)每一种书面语言的字符集。

Its aim is to provide a unique number to identify every character for every language, on any platform.

其目的是提供一个唯一的编号，以在任何平台上标识每种语言的每个字符。

Unicode maps every character to a specific code, called code point. A code point takes the form of U+<hex-code>, ranging from U+0000 to U+10FFFF.

Unicode将每个字符映射到称为代码点的特定代码。代码点采用U+<hex-code> ，范围从U+0000到U+10FFFF 。

An example code point looks like this: U+004F. Its meaning depends on the character encoding used.

示例代码点如下所示： U+004F 。其含义取决于所使用的字符编码。

Unicode defines different characters encodings, the most used ones being UTF-8, UTF-16 and UTF-32.

Unicode定义了不同的字符编码 ，最常用的是UTF-8，UTF-16和UTF-32。

UTF-8 is definitely the most popular encoding in the Unicode family, especially on the Web. This document is written in UTF-8, for example.

UTF-8绝对是Unicode系列中最流行的编码，尤其是在Web上。例如，本文档以UTF-8编写。

Currently there are more than 135.000 different characters implemented, with space for more than 1.1 millions.

目前，已实施了135.000个不同的字符，空间超过110万个。

剧本 (Scripts)

All the Unicode supported characters are grouped into sections called scripts.

所有Unicode支持的字符都被分为称为scripts的部分。

There is a script for every different character set:

每个不同的字符集都有一个脚本：

Latin (contains all ASCII + all the other western world characters)
拉丁语(包含所有ASCII +所有其他西方字符)
Korean
韩语
Old Hungarian
老匈牙利人
Hebrew
希伯来语
Greek
希腊语
Armenian
亚美尼亚人
…and so on!
…等等！

The full list is defined in the ISO 15924 standard.

完整列表在ISO 15924标准中定义。

See more on scripts: https://en.wikipedia.org/wiki/Script_(Unicode)

有关脚本的更多信息，请参见： https : //en.wikipedia.org/wiki/Script_(Unicode)

飞机 (Planes)

In addition to scripts, there is another way that Unicode organizes its characters: planes.

除了脚本之外，Unicode还可以通过另一种方式组织其字符： planes 。

Instead of grouping them by type, it checks the code point value:

而不是按类型对它们进行分组，而是检查代码点值：

Plane	Range
0	U+0000 - U+FFFF
1	U+10000 - U+1FFFF
2	U+20000 - U+2FFFF
…	…
14	U+E0000 - U+EFFFF
15	U+F0000 - U+FFFFF
16	U+100000 - U+10FFFF

飞机	范围
0	U + 0000-U + FFFF
1个	U + 10000-U + 1FFFF
2	U + 20000-U + 2FFFF
…	…
14	U + E0000-U + EFFFF
15	U + F0000-U + FFFFF
16	U + 100000-U + 10FFFF

There are 17 planes.

有17架飞机。

The first is special, it’s called Basic Multilingual Plane, or BMP, and contains most of the modern characters and symbols, from the Latin, Cyrillic, Greek scripts.

第一个是特殊的，它称为Basic Multilingual Plane或BMP ，其中包含来自拉丁语，西里尔语和希腊语脚本的大多数现代字符和符号。

The other 16 planes are called astral planes. Worth noting that planes 3 to 13 are currently empty.

其他16个平面称为星体平面 。值得注意的是，飞机3至13目前是空的。

The code points contained in astral planes are called astral code points.

星体平面中包含的代码点称为星体代码点 。

Astral code points are all points higher than U+10000.

所有星体代码点均高于U+10000 。

代码单位 (Code units)

Code points are internally stored as code units. A code unit is the bit representation of a character, and it’s length varies depending on the character encoding

代码点在内部存储为代码单位 。代码单位是字符的位表示形式，其长度取决于字符编码

UTF-32 uses a 32-bit code unit.

UTF-32使用32位代码单元。

UTF-8 uses an 8-bit code unit, and UTF-16 uses a 16-bit code unit. If a code point needs a larger size, it will be represented by 2 (or more, in UTF-8) code units.

UTF-8使用8位代码单元，而UTF-16使用16位代码单元。如果代码点需要更大的尺寸，则将以2个(或以UTF-8为单位)的代码单位表示。

字素 (Graphemes)

A grapheme is a symbol that represents a unit of a writing system. It’s basically your idea of a character and how it should look like.

字素是代表书写系统单位的符号。基本上，这就是您对角色及其外观的看法。

字形 (Glyphs)

A glyph is a graphic representation of a grapheme: how it is visually displayed on screen, the actual appearance on the display.

字形是一个字素的图形表示：它在屏幕上的视觉显示方式，以及显示器上的实际外观。

顺序 (Sequences)

Unicode lets you combine different characters to form a grapheme.

Unicode使您可以组合不同的字符以形成字素。

For example it’s the case of accented characters: the letter é can be expressed by using a combination of the letter e (U+0065) and the unicode character named “COMBINING ACUTE ACCENT” (U+0301):

例如，带重音的字符就是这种情况：字母é可以通过使用字母e ( U+0065 )和名为“ COMBINING ACUTE ACCENT”( U+0301 )的Unicode字符的组合来表示：

"U+0065U+0301" ➡️ "é"

U+0301 in this case is what is described as a combining mark, one character that applies to the previous one to form a different grapheme.

在这种情况下， U+0301被称为组合标记 ，一个字符适用于前一个字符以形成不同的字素。

正常化 (Normalization)

A characters can be sometimes represented using different combinations of code points.

有时可以使用不同的代码点组合来表示字符。

For example it’s the case of accented characters: the letter é can be expressed both as U+00E9 and also as combining e (U+0065) and the unicode character named “COMBINING ACUTE ACCENT” (U+0301):

例如，在重音字符的情况下：字母é既可以表示为U+00E9 ，也可以表示为组合e ( U+0065 )和名为“ COMBINING ACUTE ACCENT”的统一字符( U+0301 )：

U+00E9       ➡️ "é"
U+0065U+0301 ➡️ "é"

The normalization process analyzes a string for those kind of ambiguities, and generates a string with the canonical representation of any character.

规范化过程将分析字符串是否存在此类歧义，并生成具有任何字符的规范表示形式的字符串。

Without normalization, perfectly equal strings to the eye will be considered different because their internal representation changes:

如果不进行标准化，则完全相等的字符串将被视为不同，因为它们的内部表示形式发生了变化：

表情符号 (Emojis)

Emojis are Unicode astral plane characters, and they provide a way to have images on your screen without actually having real images, just font glyphs.

表情符号是Unicode星体平面字符，它们提供了一种在屏幕上显示图像的方式，而实际上没有真正的图像，而只是字体字形。

As an example, the 🐶 symbol is encoded as U+1F436.

例如，example符号被编码为U+1F436 。

前128个字符 (The first 128 characters)

The first 128 characters of Unicode are the same as the ASCII character set.

Unicode的前128个字符与ASCII字符集相同。

The first 32 characters, U+0000-U+001F (0-31) are called Control Codes.

前32个字符U+0000 - U+001F (0-31)被称为控制代码 。

They are an inheritance from the past and most of them are now obsolete. They were used for teletype machines, something that existed before the fax.

它们是过去的继承，现在大多数已过时。它们用于电传打字机，这是传真之前存在的东西。

Characters from U+0020 (32) to U+007E (126) contain numbers, letters and some symbols:

从U + 0020(32)到U + 007E(126)的字符包含数字，字母和一些符号：

Unicode	ASCII code	Glyph
U+0020	32	(space)
U+0021	33	!
U+0022	34	“
U+0023	35	#
U+0024	36	$
U+0025	37	%
U+0026	38	&
U+0027	39	‘
U+0028	40	(
U+0029	41	)
U+002A	42	*
U+002B	43	+
U+002C	44	,
U+002D	45	-
U+002E	46	.
U+002F	47	/
U+0030	48	0
U+0031	49	1
U+0032	50	2
U+0033	51	3
U+0034	52	4
U+0035	53	5
U+0036	54	6
U+0037	55	7
U+0038	56	8
U+0039	57	9
U+003A	58	:
U+003B	59	;
U+003C	60	<
U+003D	61	=
U+003E	62	>
U+003F	63	?
U+0040	64	@
U+0041	65	A
U+0042	66	B
U+0043	67	C
U+0044	68	D
U+0045	69	E
U+0046	70	F
U+0047	71	G
U+0048	72	H
U+0049	73	I
U+004A	74	J
U+004B	75	K
U+004C	76	L
U+004D	77	M
U+004E	78	N
U+004F	79	O
U+0050	80	P
U+0051	81	Q
U+0052	82	R
U+0053	83	S
U+0054	84	T
U+0055	85	U
U+0056	86	V
U+0057	87	W
U+0058	88	X
U+0059	89	Y
U+005A	90	Z
U+005B	91	[
U+005C	92
U+005D	93	]
U+005E	94	^
U+005F	95	_
U+0060	96	`
U+0061	97	a
U+0062	98	b
U+0063	99	c
U+0064	100	d
U+0065	101	e
U+0066	102	f
U+0067	103	g
U+0068	104	h
U+0069	105	i
U+006A	106	j
U+006B	107	k
U+006C	108	l
U+006D	109	m
U+006E	110	n
U+006F	111	o
U+0070	112	p
U+0071	113	q
U+0072	114	r
U+0073	115	s
U+0074	116	t
U+0075	117	u
U+0076	118	v
U+0077	119	w
U+0078	120	x
U+0079	121	y
U+007A	122	z
U+007B	123	{
U+007C	124
U+007D	125	}
U+007E	126	~

统一码	ASCII码	雕文
U + 0020	32	(空间)
U + 0021	33	！
U + 0022	34	“
U + 0023	35	＃
U + 0024	36	$
U + 0025	37	％
U + 0026	38	和
U + 0027	39	'
U + 0028	40	(
U + 0029	41	)
U + 002A	42	*
U + 002B	43	+
U + 002C	44	，
U + 002D	45	--
U + 002E	46	。
U + 002F	47	/
U + 0030	48	0
U + 0031	49	1个
U + 0032	50	2
U + 0033	51	3
U + 0034	52	4
U + 0035	53	5
U + 0036	54	6
U + 0037	55	7
U + 0038	56	8
U + 0039	57	9
U + 003A	58	：
U + 003B	59	;
U + 003C	60	<
U + 003D	61	=
U + 003E	62	>
U + 003F	63	？
U + 0040	64	@
U + 0041	65	一个
U + 0042	66	乙
U + 0043	67	C
U + 0044	68	d
U + 0045	69	Ë
U + 0046	70	F
U + 0047	71	G
U + 0048	72	H
U + 0049	73	一世
U + 004A	74	Ĵ
U + 004B	75	ķ
U + 004C	76	大号
U + 004D	77	中号
U + 004E	78	ñ
U + 004F	79	Ø
U + 0050	80	P
U + 0051	81	问
U + 0052	82	[R
U + 0053	83	小号
U + 0054	84	Ť
U + 0055	85	ü
U + 0056	86	V
U + 0057	87	w ^
U + 0058	88	X
U + 0059	89	ÿ
U + 005A	90	ž
U + 005B	91	[
U + 005C	92
U + 005D	93	]
U + 005E	94	^
U + 005F	95	_
U + 0060	96	`
U + 0061	97	一个
U + 0062	98	b
U + 0063	99	C
U + 0064	100	d
U + 0065	101	Ë
U + 0066	102	F
U + 0067	103	G
U + 0068	104	H
U + 0069	105	一世
U + 006A	106	Ĵ
U + 006B	107	ķ
U + 006C	108	升
U + 006D	109	米
U + 006E	110	ñ
U + 006F	111	Ø
U + 0070	112	p
U + 0071	113	q
U + 0072	114	[R
U + 0073	115	s
U + 0074	116	Ť
U + 0075	117	ü
U + 0076	118	v
U + 0077	119	w
U + 0078	120	X
U + 0079	121	ÿ
U + 007A	122	ž
U + 007B	123	{
U + 007C	124
U + 007D	125	}
U + 007E	126	〜

Numbers go from U+0030 to U+0039
数字从U+0030到U+0039
Uppercase letters go from U+0041 to U+005A
大写字母从U+0041到U+005A
Lowercase letters go from U+0061 to U+007A
小写字母从U+0061到U+007A

U+007F (127) is the delete character.

U + 007F(127)是删除字符。

Everything going forward is outside the realm of ASCII, and is part of Unicode exclusively.

将来的一切都超出了ASCII的范围，并且是Unicode的一部分。

You can find the whole list on Wikipedia: https://en.wikipedia.org/wiki/List_of_Unicode_characters

您可以在Wikipedia上找到整个列表： https : //en.wikipedia.org/wiki/List_of_Unicode_characters

Unicode编码 (Unicode encodings)

UTF-8 (UTF-8)

UTF-8 is a variable width character encoding, and it can encode every character covered by Unicode, using from 1 to 4 8-bit bytes.

UTF-8是一种可变宽度的字符编码，它可以使用1到4个8位字节来编码Unicode覆盖的每个字符。

It was originally designed by Ken Thompson and Rob Pike in 1992. Those names are familiar to those with any interest in the Go programming language, as they were two of the original creators of that as well.

它最初是由Ken Thompson和Rob Pike于1992年设计的。那些对Go编程语言感兴趣的人都熟悉这些名称，因为它们也是该语言的两个原始创建者。

It’s recommended by the W3C as the default encoding in HTML files, and stats indicate that it’s used on 91,3% of all web pages, as of April 2018.

W3C建议将其作为HTML文件中的默认编码，并且统计数据表明，截至2018年4月，已在所有网页的91,3％上使用它。

At the time of its introduction, ASCII was the most popular character encoding in the western world. In ASCII all letters, digits and symbols were assigned a number, and this number. Being fixed to 8 bits, it could only represent a maximum of 255 characters, and it was enough.

在其引入之时，ASCII是西方世界中最流行的字符编码。在ASCII中，所有字母，数字和符号均分配有一个数字，以及该数字。固定为8位，最多只能表示255个字符，这就足够了。

UTF-8 was designed to be backward compatible with ASCII. This was very important for its adoption, as ASCII was much older (1963) and widespread, and moving to UTF-8 came almost transparently.

UTF-8设计为向后兼容ASCII。这对于采用它非常重要，因为ASCII年代更早(1963年)并且广泛使用，并且向UTF-8迁移几乎是透明的。

The first 128 characters of UTF-8 map exactly to ASCII. Why 128? Because ASCII uses 7-bit encoding, which allows up to 128 combinations. Why 7 bits? We now take 8 bits for granted, but back in the day when ASCII was conceived, 7 bit systems were popular as well.

UTF-8的前128个字符完全映射到ASCII。为什么是128？因为ASCII使用7位编码，所以最多允许128种组合。为什么是7位？现在，我们认为8位是理所当然的，但是在ASCII诞生之初，7位系统也很流行。

Being 100% compatible with ASCII makes UTF-8 also very efficient, because the most frequently used characters in the western languages are encoded with 1 byte only.

与ASCII 100％兼容使UTF-8也非常有效，因为西方语言中最常用的字符仅用1个字节编码。

Here is the map of the bytes usage:

这是字节使用情况的映射：

Number of bytes	Start	End
1	`U+0000`	`U+007F`
2	`U+0080`	`U+07FF`
3	`U+0800`	`U+FFFF`
4	`U+10000`	`U+10FFFF`

字节数	开始	结束
1个	`U+0000`	`U+007F`
2	`U+0080`	`U+07FF`
3	`U+0800`	`U+FFFF`
4	`U+10000`	`U+10FFFF`

Remember that in ASCII the characters were encoded as numbers? If the letter A in ASCII was represented with the number 65, using UTF-8 it’s encoded as U+0041.

还记得ASCII中的字符被编码为数字吗？如果ASCII字母A用数字65表示，则使用UTF-8编码为U+0041 。

Why not U+0065 you ask? Well because unicode uses an hexadecimal base, and instead of 10 you have U+000A and so on (basically, you have a set of 16 digits instead of 10)

为什么不问U+0065 ？好吧，因为Unicode使用的是十六进制基数，所以您使用U+000A而不是10 ，以此类推(基本上，您有一组16位数字而不是10位)

Take a look at this video, which brilliantly explains this UTF-8 and ASCII compatibility.

观看此视频，它很好地解释了UTF-8和ASCII的兼容性。

UTF-16 (UTF-16)

UTF-16 is another very popular Unicode encoding. For example, it’s how Java internally represents any character. It’s also one of the 2 encodings JavaScript uses internally, along with UCS-2. It’s used by many other systems as well, like Windows.

UTF-16是另一种非常流行的Unicode编码。例如，这就是Java在内部表示任何字符的方式。它也是JavaScript内部使用的两种编码之一，以及UCS-2 。它也被许多其他系统使用，例如Windows。

UTF-16 is a variable length encoding system, like UTF-8, but uses 2 bytes (16 bits) as the minimum for any character representation. As such, it’s backwards incompatible with the ASCII standard.

UTF-16是一种可变长度编码系统，与UTF-8类似，但是对于任何字符表示形式，最少使用2个字节(16位)。因此，它向后与ASCII标准不兼容。

Code points in the Basic Multilingual Plane (BMP) are stored using 2 bytes. Code points in astral planes are stored using 4 bytes.

基本多语言平面(BMP)中的代码点使用2个字节存储。 星体平面中的代码点使用4个字节存储。

UTF-32 (UTF-32)

UTF-8 uses a minimum of 1 byte, UTF-16 uses a minimum of 2 bytes.

UTF-8至少使用1个字节，UTF-16至少使用2个字节。

UTF-32 always uses 4 bytes, without optimizing for space usage, and as such it wastes a lot of bandwidth.

UTF-32始终使用4个字节，而没有针对空间使用进行优化，因此浪费了很多带宽。

This constrain makes it faster to operate on because you have less to check, as you can assume 4 bytes for all characters.

此约束使操作更加快捷，因为您无需检查，因为您可以假设所有字符为4个字节。

It’s not as popular as UTF-8 and UTF-16, but it has its applications.

它不像UTF-8和UTF-16那样流行，但是它有其应用程序。

翻译自: https://flaviocopes.com/unicode/

unicode和utf-8

cuk0051

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
unicode和utf-8_Unicode和UTF-8简介

unicode和utf-8Scripts 剧本 Planes 飞机 Code units 代码单位 Graphemes 字素 Glyphs 字形 Sequences 顺序 Normalization 正常化 Emojis 表情符号 The first 128 characters 前128个字符 Unicode encodings Unicode编码 UTF-8 UTF-8 UTF-16 U...
复制链接

扫一扫