Unicode字符编码的解释

For a computer to be able to store text and numbers that humans can understand, there needs to be a code that transforms characters into numbers. The Unicode standard defines such a code by using character encoding.

为了使计算机能够存储人类可以理解的文本和数字,需要有一个将字符转换为数字的代码。 Unicode标准通过使用字符编码定义了这样的代码。

The reason character encoding is so important is so that every device can display the same information. A custom character encoding scheme might work brilliantly on one computer, but problems will occur when if you send that same text to someone else. It won't know what you're talking about unless it understands the encoding scheme too.

字符编码之所以如此重要,是因为每个设备都可以显示相同的信息。 自定义字符编码方案在一台计算机上可能效果很好,但是如果将相同的文本发送给其他人,则会出现问题。 除非它也了解编码方案,否则它不会知道您在说什么。

字符编码 ( Character Encoding )

All character encoding does is assign a number to every character that can be used. You could make a character encoding right now.

所有字符编码所做的就是为可以使用的每个字符分配一个数字。 您可以立即进行字符编码。

For example, I could say that the letter A becomes the number 13, a=14, 1=33, #=123, and so on.

例如,我可以说字母A变成数字13,a = 14、1 = 33,#= 123,依此类推。

This is where industry-wide standards come in. If the whole computer industry uses the same character encoding scheme, every computer can display the same characters.

这就是行业标准的出处。如果整个计算机行业都使用相同的字符编码方案,则每台计算机都可以显示相同的字符。

什么是Unicode? ( What Is Unicode? )

ASCII (American Standard Code for Information Interchange) became the first widespread encoding scheme. However, it's limited to only 128 character definitions. This is fine for the most common English characters, numbers, and punctuation, but is a bit limiting for the rest of the world.

ASCII(美国信息交换标准代码)成为第一个广泛使用的编码方案。 但是,它仅限于128个字符定义。 这对于大多数常见的英文字符,数字和标点符号来说是可以的,但对于世界其他地方则有所限制。

Naturally, the rest of the world wants the same encoding scheme for their characters too. However, for a little, while depending on where you were, there might have been a different character displayed for the same ASCII code.

自然,世界其他地区也希望它们的字符使用相同的编码方案。 但是,根据您所处的位置,有一段时间,对于相同的ASCII码,可能会显示不同的字符。

In the end, the other parts of the world began creating their own encoding schemes, and things started to get a little bit confusing. Not only were the coding schemes of different lengths, programs needed to figure out which encoding scheme they were supposed to use.

最后,世界其他地方开始创建自己的编码方案,事情开始变得有些混乱。 不仅是不同长度的编码方案,程序还需要弄清楚它们应该使用哪种编码方案。

It became apparent that a new character encoding scheme was needed, which is when the Unicode standard was created. The objective of Unicode is to unify all the different encoding schemes so that the confusion between computers can be limited as much as possible.

很明显,需要一种新的字符编码方案,那就是创建Unicode标准时。 Unicode的目标是统一所有不同的编码方案,以便可以最大程度地限制计算机之间的混乱。

These days, the Unicode standard defines values for over 128,000 characters and can be seen at the Unicode Consortium. It has several character encoding forms:

如今,Unicode标准定义了超过128,000个字符的值,可以在Unicode Consortium上看到。 它具有几种字符编码形式:

  • UTF-8: Only uses one byte (8 bits) to encode English characters. It can use a sequence of bytes to encode other characters. UTF-8 is widely used in email systems and on the internet.

    UTF-8:仅使用一个字节(8位)来编码英文字符。 它可以使用字节序列来编码其他字符。 UTF-8广泛用于电子邮件系统和Internet。

  • UTF-16: Uses two bytes (16 bits) to encode the most commonly used characters. If needed, the additional characters can be represented by a pair of 16-bit numbers.

    UTF-16:使用两个字节(16位)编码最常用的字符。 如果需要,附加字符可以用一对16位数字表示。

  • UTF-32: Uses four bytes (32 bits) to encode the characters. It became apparent that as the Unicode standard grew, a 16-bit number is too small to represent all the characters. UTF-32 is capable of representing every Unicode character as one number.

    UTF-32:使用四个字节(32位)对字符进行编码。 很明显,随着Unicode标准的发展,一个16位的数字太小了,无法代表所有字符。 UTF-32能够将每个Unicode字符表示为一个数字。

Note: UTF means Unicode Transformation Unit.

注意: UTF表示Unicode转换单位。

代码点 ( Code Points )

A code point is the value that a character is given in the Unicode standard. The values according to Unicode are written as hexadecimal numbers and have a prefix of U+.

代码点是Unicode标准中给定字符的值。 根据Unicode的值用十六进制数表示,并且前缀为U +

For example, to encode the characters we looked at earlier:

例如,要编码我们前面看过的字符:

  • A is U+0041

    A是U + 0041

  • a is U+0061

    一个是U + 0061

  • 1 is U+0031

    1是U + 0031

  • # is U+0023

    #是U + 0023

These code points are split into 17 different sections called planes, identified by numbers 0 through 16. Each plane holds 65,536 code points. The first plane, 0, holds the most commonly used characters and is known as the Basic Multilingual Plane (BMP).

这些代码点被分为17个不同的部分,称为平面,由数字0到16标识。每个平面包含65,536个代码点。 第一个平面0包含最常用的字符,被称为基本多语言平面(BMP)。

代码单位 ( Code Units )

The encoding schemes are made up of code units, which are used to provide an index for where a character is positioned on a plane.

编码方案由代码单元组成,用于为字符在平面上的位置提供索引。

Consider UTF-16 as an example. Each 16-bit number is a code unit. The code units can be transformed into code points. For instance, the flat note symbol ♭ has a code point of U+1D160 and lives on the second plane of the Unicode standard (Supplementary Ideographic Plane). It would be encoded using the combination of the 16-bit code units U+D834 and U+DD60.

以UTF-16为例。 每个16位数字都是一个代码单元。 代码单元可以转换为代码点。 例如,扁平音符符号♭的代码点为U + 1D160,并且位于Unicode标准的第二平面(补充表意文字平面)上。 它将使用16位代码单元U + D834和U + DD60的组合进行编码。

For the BMP, the values of the code points and code units are identical. This allows a shortcut for UTF-16 that saves a lot of storage space. It only needs to use one 16-bit number to represent those characters.

对于BMP,代码点和代码单位的值相同。 这为UTF-16提供了一种快捷方式,可节省大量存储空间。 它只需要使用一个16位数字来表示这些字符。

Java如何使用Unicode? ( How Does Java Use Unicode? )

Java was created around the time when the Unicode standard had values defined for a much smaller set of characters. Back then, it was felt that 16-bits would be more than enough to encode all the characters that would ever be needed. With that in mind, Java was designed to use UTF-16. The char data type was originally used to represent a 16-bit Unicode code point.

Java是在Unicode标准为较小的字符集定义值的时候创建的。 那时,人们认为16位足以编码所有需要的字符。 考虑到这一点,Java被设计为使用UTF-16。 char数据类型最初用于表示16位Unicode代码点。

Since Java SE v5.0, the char represents a code unit. It makes little difference for representing characters that are in the Basic Multilingual Plane because the value of the code unit is the same as the code point. However, it does mean that for the characters on the other planes, two chars are needed.

从Java SE v5.0开始,char表示一个代码单元。 由于代码单位的值与代码点相同,因此表示基本多语言平面中的字符几乎没有区别。 但是,这确实意味着对于其他平面上的字符,需要两个字符。

The important thing to remember is that a single char data type can no longer represent all the Unicode characters.

要记住的重要一点是,单个char数据类型不能再表示所有Unicode字符。

翻译自: https://www.thoughtco.com/what-is-unicode-2034272

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值