UTF-8和ISO-8859-1有什么区别?

本文翻译自:What is the difference between UTF-8 and ISO-8859-1?

UTF-8ISO-8859-1有什么区别?


#1楼

参考:https://stackoom.com/question/TZhR/UTF-和ISO-有什么区别


#2楼

ISO-8859-1 is a legacy standards from back in 1980s. ISO-8859-1是20世纪80年代的传统标准。 It can only represent 256 characters so only suitable for some languages in western world. 它只能代表256个字符,因此只适用于西方世界的某些语言。 Even for many supported languages, some characters are missing. 即使对于许多支持的语言,也缺少一些字符。 If you create a text file in this encoding and try copy/paste some Chinese characters, you will see weird results. 如果您使用此编码创建文本文件并尝试复制/粘贴一些中文字符,您将看到奇怪的结果。 So in other words, don't use it. 换句话说,不要使用它。 Unicode has taken over the world and UTF-8 is pretty much the standards these days unless you have some legacy reasons (like HTTP headers which needs to compatible with everything). Unicode已经占据了全世界,UTF-8几乎就是现在的标准,除非你有一些遗留的原因(比如需要与所有东西兼容的HTTP头)。


#3楼

UTF UTF

UTF is a family of multi-byte encoding schemes that can represent Unicode code points which can be reperesentative of up to 2^31 [roughly 2 billion] characters. UTF是一系列多字节编码方案,可以表示Unicode代码点,可以代表最多2 ^ 31 [大约20亿]个字符。 UTF-8 is a flexible encoding system that uses between 1 and 4 bytes to represent the first 2^21 [roughly 2 million] code points. UTF-8是一种灵活的编码系统,使用1到4个字节来表示前2 ^ 21 [大约200万]个代码点。

Long story short: any character with a code point/ordinal representation below 127, aka 7-bit-safe ASCII is represented by the same 1-byte sequence as most other single-byte encodings. 长话短说:任何具有低于127的代码点/序数表示的字符,即7位安全的ASCII由与大多数其他单字节编码相同的1字节序列表示。 Any character with a code point above 127 is represented by a sequence of two or more bytes, with the particular of encoding best explained here . 代码点大于127的任何字符都由两个或更多字节的序列表示,其中特定的编码在此处解释得最好。

ISO-8859 ISO-8859

ISO-8859 is a family of single-byte encoding schemes used to represent alphabets that can be represented within the range of 127 to 255. These various alphabets are defined as "parts" in the format ISO-8859- n , the most familiar of these likely being ISO-8859-1 aka 'Latin-1'. ISO-8859是一系列单字节编码方案,用于表示可以在127到255范围内表示的字母表。这些不同的字母表被定义为ISO-8859- n格式的“部分”,最熟悉的这些可能是ISO-8859-1又名'Latin-1'。 As with UTF-8, 7-bit-safe ASCII remains unaffected regardless of the encoding family used. 与UTF-8一样,无论使用何种编码系列,7位安全ASCII都不受影响。

The drawback to this encoding scheme is its inability to accommodate languages comprised of more than 128 symbols, or to safely display more than one family of symbols at one time. 这种编码方案的缺点是它不能容纳由超过128个符号组成的语言,或者一次安全地显示多个符号系列。 As well, ISO-8859 encodings have fallen out of favor with the rise of UTF. 同样,随着UTF的兴起,ISO-8859编码已经失宠。 The ISO "Working Group" in charge of it having disbanded in 2004, leaving maintenance up to its parent subcommittee. 负责该工作组的ISO“工作组”于2004年解散,将维护工作留给其母公司小组委员会。


#4楼

My reason for researching this question was from the perspective, is in what way are they compatible. 我研究这个问题的原因是从视角来看,它们是以什么方式兼容的。 Latin1 charset (iso-8859) is 100% compatible to be stored in a utf8 datastore. Latin1 charset(iso-8859)100%兼容,可存储在utf8数据存储区中。 All ascii & extended-ascii chars will be stored as single-byte. 所有ascii和extended-ascii字符都将存储为单字节。

Going the other way, from utf8 to Latin1 charset may or may not work. 另一方面,从utf8到Latin1 charset可能会也可能不会起作用。 If there are any 2-byte chars (chars beyond extended-ascii 255) they will not store in a Latin1 datastore. 如果有任何2字节字符(超出扩展-ascii 255的字符),它们将不存储在Latin1数据存储区中。


#5楼

From another perspective, files that both unicode and ascii encodings fail to read because they have a byte 0xc0 in them, seem to get read by iso-8859-1 properly. 从另一个角度来看,unicode和ascii编码都无法读取的文件,因为它们中有一个字节0xc0 ,似乎可以被iso-8859-1正确读取。 The caveat is that the file shouldn't have unicode characters in it of course. 需要注意的是,文件当然不应该包含unicode字符。


#6楼

  • ASCII: 7 bits. ASCII:7位。 128 code points. 128个代码点。

  • ISO-8859-1: 8 bits. ISO-8859-1:8位。 256 code points. 256个代码点。

  • UTF-8: 8-32 bits (1-4 bytes). UTF-8:8-32位(1-4字节)。 1,112,064 code points. 1,112,064个代码点。

Both ISO-8859-1 and UTF-8 are backwards compatible with ASCII, but UTF-8 is not backwards compatible with ISO-8859-1: ISO-8859-1和UTF-8都向后兼容ASCII,但UTF-8不向后兼容ISO-8859-1:

#!/usr/bin/env python3

c = chr(0xa9)
print(c)
print(c.encode('utf-8'))
print(c.encode('iso-8859-1'))

Output: 输出:

©
b'\xc2\xa9'
b'\xa9'
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值