字符集和整理意味着什么?

本文翻译自:What does character set and collation mean exactly?

I can read the MySQL documentation and it's pretty clear. 我可以阅读MySQL文档,它非常清楚。 But, how does one decide which character set to use? 但是,如何决定使用哪个字符集? On what data does collation have an effect? 整理有哪些数据会产生影响?

I'm asking for an explanation of the two and how to choose them. 我要求解释这两个以及如何选择它们。


#1楼

参考:https://stackoom.com/question/1QmP/字符集和整理意味着什么


#2楼

From MySQL docs : 来自MySQL 文档

A character set is a set of symbols and encodings. 字符集是一组符号和编码。 A collation is a set of rules for comparing characters in a character set. 排序规则是一组用于比较字符集中字符的规则。 Let's make the distinction clear with an example of an imaginary character set. 让我们用虚构字符集的例子来区分清楚。

Suppose that we have an alphabet with four letters: 'A', 'B', 'a', 'b'. 假设我们有一个带有四个字母的字母:'A','B','a','b'。 We give each letter a number: 'A' = 0, 'B' = 1, 'a' = 2, 'b' = 3. The letter 'A' is a symbol, the number 0 is the encoding for 'A', and the combination of all four letters and their encodings is a character set. 我们给每个字母一个数字:'A'= 0,'B'= 1,'a'= 2,'b'= 3.字母'A'是一个符号,数字0是'A'的编码,所有四个字母及其编码的组合是一个字符集。

Now, suppose that we want to compare two string values, 'A' and 'B'. 现在,假设我们想比较两个字符串值'A'和'B'。 The simplest way to do this is to look at the encodings: 0 for 'A' and 1 for 'B'. 最简单的方法是查看编码:0表示“A”,1表示“B”。 Because 0 is less than 1, we say 'A' is less than 'B'. 因为0小于1,我们说'A'小于'B'。 Now, what we've just done is apply a collation to our character set. 现在,我们刚刚完成的是对我们的字符集应用排序规则。 The collation is a set of rules (only one rule in this case): "compare the encodings." 排序规则是一组规则(在这种情况下只有一条规则):“比较编码”。 We call this simplest of all possible collations a binary collation. 我们称这种最简单的所有可能的归类都是二进制整理。

But what if we want to say that the lowercase and uppercase letters are equivalent? 但是如果我们想说小写和大写字母是等价的呢? Then we would have at least two rules: (1) treat the lowercase letters 'a' and 'b' as equivalent to 'A' and 'B'; 然后我们将至少有两条规则:(1)将小写字母'a'和'b'视为等同于'A'和'B'; (2) then compare the encodings. (2)然后比较编码。 We call this a case-insensitive collation. 我们称之为不区分大小写的排序规则。 It's a little more complex than a binary collation. 它比二进制排序规则复杂一点。

In real life, most character sets have many characters: not just 'A' and 'B' but whole alphabets, sometimes multiple alphabets or eastern writing systems with thousands of characters, along with many special symbols and punctuation marks. 在现实生活中,大多数字符集都有许多字符:不只是'A'和'B',而是整个字母,有时是多个字母或具有数千个字符的东部书写系统,以及许多特殊符号和标点符号。 Also in real life, most collations have many rules: not just case insensitivity but also accent insensitivity (an "accent" is a mark attached to a character as in German 'ö') and multiple-character mappings (such as the rule that 'ö' = 'OE' in one of the two German collations). 同样在现实生活中,大多数校对都有许多规则:不仅不区分大小写,而且还有重音不敏感(“重音”是德语'ö'中附加到字符上的标记)和多字符映射(例如' ö'='OE'在德国两个校对中的一个)。


#3楼

A character encoding is a way to encode characters so that they fit in memory. 字符编码是一种对字符进行编码以使其适合内存的方法。 That is, if the charset is ISO-8859-15, the euro symbol, €, will be encoded as 0xa4, and in UTF-8, it will be 0xe282ac. 也就是说,如果字符集是ISO-8859-15,欧元符号€将被编码为0xa4,而在UTF-8中,它将被编码为0xe282ac。

The collation is how to compare characters, in latin9, there are letters as e é è ê f , if sorted by their binary representation, it will go ef é ê è but if the collation is set to, for example, French, you'll have them in the order you thought they would be, which is all of e é è ê are equal, and then f . 整理是如何比较字符,在latin9,有字母e é è ê f ,如果排序由二进制表示,它会去ef é ê è但是当核对设定,例如,法语,你”按照你认为的那样顺序拥有它们,这是所有的e é è ê是相等的,然后是f


#4楼

A character set is a subset of all written glyphs. 字符集是所有书写字形的子集。 A character encoding specifies how those characters are mapped to numeric values. 字符编码指定这些字符如何映射到数值。 Some character encodings, like UTF-8 and UTF-16, can encode any character in the Universal Character Set. 某些字符编码(如UTF-8和UTF-16)可以编码通用字符集中的任何字符。 Others, like US-ASCII or ISO-8859-1 can only encode a small subset, since they use 7 and 8 bits per character, respectively. 其他像US-ASCII或ISO-8859-1只能编码一个小子集,因为它们分别使用每个字符7和8位。 Because many standards specify both a character set and a character encoding, the term "character set" is often substituted freely for "character encoding". 由于许多标准都指定了字符集和字符编码,因此术语“字符集”通常可以自由地替换为“字符编码”。

A collation comprises rules that specify how characters can be compared for sorting. 排序规则包括指定如何比较字符以进行排序的规则。 Collations rules can be locale-specific: the proper order of two characters varies from language to language. 排序规则可以是特定于语言环境的:两个字符的正确顺序因语言而异。

Choosing a character set and collation comes down to whether your application is internationalized or not. 选择字符集和整理归结为您的应用程序是否已国际化。 If not, what locale are you targeting? 如果没有,您定位的是哪个区域设置?

In order to choose what character set you want to support, you have to consider your application. 为了选择您想要支持的字符集,您必须考虑您的应用程序。 If you are storing user-supplied input, it might be hard to foresee all the locales in which your software will eventually be used. 如果要存储用户提供的输入,则可能很难预见最终将使用软件的所有语言环境。 To support them all, it might be best to support the UCS (Unicode) from the start. 为了支持它们,最好从一开始就支持UCS(Unicode)。 However, there is a cost to this; 但是,这需要付出代价; many western European characters will now require two bytes of storage per character instead of one. 许多西欧角色现在每个角色需要两个字节的存储而不是一个。

Choosing the right collation can help performance if your database uses the collation to create an index, and later uses that index to provide sorted results. 如果数据库使用排序规则来创建索引,并且稍后使用该索引提供排序结果,则选择正确的排序规则可以帮助提高性能。 However, since collation rules are often locale-specific, that index will be worthless if you need to sort results according to the rules of another locale. 但是,由于排序规则通常是特定于语言环境的,因此如果需要根据其他语言环境的规则对结果进行排序,则该索引将毫无价值。


#5楼

我建议使用utf8mb4_unicode_ci ,它基于Unicode标准进行排序和比较,可以在很多种语言中进行准确排序。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值