ASCII和Unicode有什么区别?

本文翻译自:What's the difference between ASCII and Unicode?

Can I know the exact difference between Unicode and ASCII? 我可以知道Unicode和ASCII之间的确切区别吗?

ASCII has a total of 128 characters (256 in the extended set). ASCII总共有128个字符(扩展集中为256个字符)。

Is there any size specification for Unicode characters? Unicode字符有任何大小规格吗?


#1楼

参考:https://stackoom.com/question/1Ibzu/ASCII和Unicode有什么区别


#2楼

ASCII defines 128 characters, which map to the numbers 0–127. ASCII定义128个字符,映射到数字0-127。 Unicode defines (less than) 2 21 characters, which, similarly, map to numbers 0–2 21 (though not all numbers are currently assigned, and some are reserved). Unicode定义(少于)2 21个字符,类似地,映射到数字0-2 21 (尽管并非所有数字当前都已分配,有些是保留的)。

Unicode is a superset of ASCII, and the numbers 0–127 have the same meaning in ASCII as they have in Unicode. Unicode是ASCII的超集,数字0-127在ASCII中具有与Unicode中相同的含义。 For example, the number 65 means "Latin capital 'A'". 例如,数字65表示“拉丁语资本'A'”。

Because Unicode characters don't generally fit into one 8-bit byte, there are numerous ways of storing Unicode characters in byte sequences, such as UTF-32 and UTF-8. 由于Unicode字符通常不适合一个8位字节,因此有许多方法可以在字节序列中存储Unicode字符,例如UTF-32和UTF-8。


#3楼

ASCII has 128 code positions, allocated to graphic characters and control characters (control codes). ASCII有128个代码位,分配给图形字符和控制字符(控制代码)。

Unicode has 1,114,112 code positions. Unicode有1,114,112个代码位。 About 100,000 of them have currently been allocated to characters, and many code points have been made permanently noncharacters (ie not used to encode any character ever), and most code points are not yet assigned. 目前已将大约100,000个字符分配给字符,并且许多代码点已经永久地成为非字符(即,不用于对任何字符进行编码),并且大多数代码点尚未分配。

The only things that ASCII and Unicode have in common are: 1) They are character codes. ASCII和Unicode 的共同点是:1)它们是字符代码。 2) The 128 first code positions of Unicode have been defined to have the same meanings as in ASCII, except that the code positions of ASCII control characters are just defined as denoting control characters, with names corresponding to their ASCII names, but their meanings are not defined in Unicode. 2)Unicode的128个第一个代码位置被定义为具有与ASCII相同的含义,除了ASCII控制字符的代码位置被定义为表示控制字符,其名称对应于它们的ASCII名称,但它们的含义是没有在Unicode中定义。

Sometimes, however, Unicode is characterized (even in the Unicode standard!) as “wide ASCII”. 但是,有时,Unicode(甚至在Unicode标准中)被表征为“宽ASCII”。 This is a slogan that mainly tries to convey the idea that Unicode is meant to be a universal character code the same way as ASCII once was (though the character repertoire of ASCII was hopelessly insufficient for universal use), as opposite to using different codes in different systems and applications and for different languages. 这是一个口号,主要是试图传达这样一种观点,即Unicode与ASCII曾经是一样的通用字符代码(虽然ASCII的字符库绝对不能用于普遍使用),与使用不同的代码相反。不同的系统和应用程序以及不同的语言。

Unicode as such defines only the “logical size” of characters: Each character has a code number in a specific range. Unicode本身仅定义字符的“逻辑大小”:每个字符都具有特定范围内的代码编号。 These code numbers can be presented using different transfer encodings, and internally, in memory, Unicode characters are usually represented using one or two 16-bit quantities per character, depending on character range, sometimes using one 32-bit quantity per character. 这些代码编号可以使用不同的传输编码来表示,而在内部,在内存中,Unicode字符通常使用每个字符一个或两个16位数量来表示,具体取决于字符范围,有时每个字符使用一个32位数量。


#4楼

ASCII has 128 code points, 0 through 127. It can fit in a single 8-bit byte, the values 128 through 255 tended to be used for other characters. ASCII有128个代码点,0到127.它可以放在一个8位字节中,值128到255倾向于用于其他字符。 With incompatible choices, causing the code page disaster. 具有不兼容的选择,导致代码页发生灾难。 Text encoded in one code page cannot be read correctly by a program that assumes or guessed at another code page. 在一个代码页中编码的文本无法由在另一个代码页上假定或猜到的程序正确读取。

Unicode came about to solve this disaster. Unicode即将解决这场灾难。 Version 1 started out with 65536 code points, commonly encoded in 16 bits. 版本1以65536个代码点开始,通常以16位编码。 Later extended in version 2 to 1.1 million code points. 后来在第2版扩展到110万个代码点。 The current version is 6.3, using 110,187 of the available 1.1 million code points. 当前版本为6.3,使用了110,187个可用的110万个代码点。 That doesn't fit in 16 bits anymore. 这不再适合16位。

Encoding in 16-bits was common when v2 came around, used by Microsoft and Apple operating systems for example. 当v2出现时,16位编码很常见,例如微软和Apple操作系统使用。 And language runtimes like Java. 像Java这样的语言运行时。 The v2 spec came up with a way to map those 1.1 million code points into 16-bits. v2规范提出了将这110万个代码点映射到16位的方法。 An encoding called UTF-16, a variable length encoding where one code point can take either 2 or 4 bytes. 一种称为UTF-16的编码,一种可变长度编码,其中一个代码点可以采用2或4个字节。 The original v1 code points take 2 bytes, added ones take 4. 原始的v1代码点占用2个字节,添加的占用4个字节。

Another variable length encoding that's very common, used in *nix operating systems and tools is UTF-8, a code point can take between 1 and 4 bytes, the original ASCII codes take 1 byte the rest take more. 在* nix操作系统和工具中使用的另一种非常常见的可变长度编码是UTF-8,代码点可以占用1到4个字节,原始的ASCII代码占用1个字节,其余的占用更多。 The only non-variable length encoding is UTF-32, takes 4 bytes for a code point. 唯一的非可变长度编码是UTF-32,代码点需要4个字节。 Not often used since it is pretty wasteful. 不经常使用,因为它非常浪费。 There are other ones, like UTF-1 and UTF-7, widely ignored. 还有其他一些,如UTF-1和UTF-7,被广泛忽视。

An issue with the UTF-16/32 encodings is that the order of the bytes will depend on the endian-ness of the machine that created the text stream. UTF-16/32编码的一个问题是字节的顺序将取决于创建文本流的机器的字节顺序。 So add to the mix UTF-16BE, UTF-16LE, UTF-32BE and UTF-32LE. 所以加入混合UTF-16BE,UTF-16LE,UTF-32BE和UTF-32LE。

Having these different encoding choices brings back the code page disaster to some degree, along with heated debates among programmers which UTF choice is "best". 拥有这些不同的编码选择会在一定程度上带来代码页灾难,以及UTF选择“最佳”的程序员之间激烈的争论。 Their association with operating system defaults pretty much draws the lines. 他们与操作系统默认的关联几乎可以说明问题。 One counter-measure is the definition of a BOM, the Byte Order Mark, a special codepoint (U+FEFF, zero width space) at the beginning of a text stream that indicates how the rest of the stream is encoded. 一个反措施是BOM的定义,字节顺序标记,文本流开头的特殊代码点(U + FEFF,零宽度空间),指示如何对流的其余部分进行编码。 It indicates both the UTF encoding and the endianess and is neutral to a text rendering engine. 它表示UTF编码和endianess,对文本呈现引擎是中性的。 Unfortunately it is optional and many programmers claim their right to omit it so accidents are still pretty common. 不幸的是,这是可选的,许多程序员声称他们有权省略它,所以事故仍然很常见。


#5楼

ASCII定义了128个字符,因为Unicode包含超过120,000个字符的全部字符串。


#6楼

Understanding why ASCII and Unicode were created in the first place helped me understand the differences between the two. 理解为什么首先创建ASCII和Unicode有助于我理解两者之间的差异。

ASCII, Origins ASCII,起源

As stated in the other answers, ASCII uses 7 bits to represent a character. 如其他答案中所述,ASCII使用7位来表示字符。 By using 7 bits, we can have a maximum of 2^7 (= 128) distinct combinations * . 通过使用7位,我们可以具有最多2 ^ 7(= 128)个不同的组合* Which means that we can represent 128 characters maximum. 这意味着我们最多可以代表128个字符。

Wait, 7 bits? 等等,7位? But why not 1 byte (8 bits)? 但为什么不是1字节(8位)?

The last bit (8th) is used for avoiding errors as parity bit . 最后一位(第8位)用于避免错误作为奇偶校验位 This was relevant years ago. 这与多年前有关。

Most ASCII characters are printable characters of the alphabet such as abc, ABC, 123, ?&!, etc. The others are control characters such as carriage return, line feed , tab, etc. 大多数ASCII字符是字母表中的可打印字符,例如abc,ABC,123,?和!等。其他字符控制字符,例如回车符,换行符 ,制表符等。

See below the binary representation of a few characters in ASCII: 请参阅下面ASCII中几个字符的二进制表示:

0100101 -> % (Percent Sign - 37)
1000001 -> A (Capital letter A - 65)
1000010 -> B (Capital letter B - 66)
1000011 -> C (Capital letter C - 67)
0001101 -> Carriage Return (13)

See the full ASCII table over here . 在此处查看完整的ASCII表。

ASCII was meant for English only. ASCII仅适用于英语。

What? 什么? Why English only? 为什么只有英文? So many languages out there! 那里有很多语言!

Because the center of the computer industry was in the USA at that time. 因为当时计算机行业的中心在美国。 As a consequence, they didn't need to support accents or other marks such as á, ü, ç, ñ, etc. (aka diacritics ). 因此,他们不需要支持口音或其他标记,如á,ü,ç,ñ等(又名变音符号 )。

ASCII Extended ASCII扩展

Some clever people started using the 8th bit (the bit used for parity) to encode more characters to support their language (to support "é", in French, for example). 一些聪明的人开始使用第8位(用于奇偶校验的位)来编码更多字符以支持他们的语言(例如,支持“é”,例如法语)。 Just using one extra bit doubled the size of the original ASCII table to map up to 256 characters (2^8 = 256 characters). 只需使用一个额外的位,就可以将原始ASCII表的大小加倍,最多可以映射256个字符(2 ^ 8 = 256个字符)。 And not 2^7 as before (128). 而不是像以前那样2 ^ 7(128)。

10000010 -> é (e with acute accent - 130)
10100000 -> á (a with acute accent - 160)

The name for this "ASCII extended to 8 bits and not 7 bits as before" could be just referred as "extended ASCII" or "8-bit ASCII". 这个“ASCII扩展到8位而不是之前的7位”的名称可以简称为“扩展ASCII”或“8位ASCII”。

As @Tom pointed out in his comment below there is no such thing as " extended ASCII " yet this is an easy way to refer to this 8th-bit trick. 正如@Tom在下面的评论中所指出的,没有“ 扩展ASCII ”这样的东西,但这是一个简单的方法来引用这个第8位技巧。 There are many variations of the 8-bit ASCII table, for example, the ISO 8859-1, also called ISO Latin-1 . 8位ASCII表有许多变体,例如ISO 8859-1,也称为ISO Latin-1

Unicode, The Rise Unicode,The Rise

ASCII Extended solves the problem for languages that are based on the Latin alphabet... what about the others needing a completely different alphabet? ASCII Extended解决了基于拉丁字母的语言的问题......其他需要完全不同的字母表的人呢? Greek? 希腊语? Russian? 俄语? Chinese and the likes? 中国人喜欢?

We would have needed an entirely new character set... that's the rational behind Unicode. 我们需要一个全新的字符集......这是Unicode背后的理性。 Unicode doesn't contain every character from every language, but it sure contains a gigantic amount of characters ( see this table ). Unicode不包含每种语言的每个字符,但它确实包含大量字符( 请参阅此表 )。

You cannot save text to your hard drive as "Unicode". 您无法将文本作为“Unicode”保存到硬盘驱动器。 Unicode is an abstract representation of the text. Unicode是文本的抽象表示。 You need to "encode" this abstract representation. 您需要“编码”此抽象表示。 That's where an encoding comes into play. 这就是编码发挥作用的地方。

Encodings: UTF-8 vs UTF-16 vs UTF-32 编码:UTF-8 vs UTF-16 vs UTF-32

This answer does a pretty good job at explaining the basics: 这个答案在解释基础知识方面做得非常好:

  • UTF-8 and UTF-16 are variable length encodings. UTF-8和UTF-16是可变长度编码。
  • In UTF-8, a character may occupy a minimum of 8 bits. 在UTF-8中,字符可能占用最少8位。
  • In UTF-16, a character length starts with 16 bits. 在UTF-16中,字符长度以16位开头。
  • UTF-32 is a fixed length encoding of 32 bits. UTF-32是32位的固定长度编码。

UTF-8 uses the ASCII set for the first 128 characters. UTF-8使用ASCII集作为前128个字符。 That's handy because it means ASCII text is also valid in UTF-8. 这很方便,因为它意味着ASCII文本在UTF-8中也有效。

Mnemonics: 口诀:

  • UTF- 8 : minimum 8 bits. UTF- 8 :最小8位。
  • UTF- 16 : minimum 16 bits. UTF- 16 :最小16位。
  • UTF- 32 : minimum and maximum 32 bits. UTF- 32 :最小和最大32位。

Note: 注意:

Why 2^7? 为什么2 ^ 7?

This is obvious for some, but just in case. 这对一些人来说是显而易见的,但以防万一。 We have seven slots available filled with either 0 or 1 ( Binary Code ). 我们有七个插槽可用0或1( 二进制代码 )填充。 Each can have two combinations. 每个可以有两种组合。 If we have seven spots, we have 2 * 2 * 2 * 2 * 2 * 2 * 2 = 2^7 = 128 combinations. 如果我们有七个点,我们有2 * 2 * 2 * 2 * 2 * 2 * 2 = 2 ^ 7 = 128个组合。 Think about this as a combination lock with seven wheels, each wheel having two numbers only. 把它想象成一个带七个轮子的密码锁,每个轮子只有两个数字。

Source: Wikipedia , this great blog post and Mocki where I initially posted this summary. 来源: 维基百科这篇伟大的博客文章Mocki ,我最初发布此摘要。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值