文章直通车:
utf8mb4 和 utf8
utf8mb4排序规则
一、先了解下 utf8mb4 和 utf8
参考MySQL文档:
utf8mb4: A UTF-8 encoding of the Unicode character set using one to four bytes per character.
utf8mb3: A UTF-8 encoding of the Unicode character set using one to three bytes per character.
utf8: An alias for utf8mb3.
Note
The utf8mb3 character set is deprecated and will be removed in a future MySQL release. Please use utf8mb4 instead. Although utf8 is currently an alias for utf8mb3, at some point utf8 will become a reference to utf8mb4. To avoid ambiguity about the meaning of utf8, consider specifying utf8mb4 explicitly for character set references instead of utf8.
UTF-8是使用1~4个字节,一种变长的编码格式。(字符编码 )
mb4即 most bytes 4,使用4个字节来表示完整的UTF-8。而MySQL中的utf8是utfmb3,只有三个字节,节省空间但不能表达全部的UTF-8(比如emoji表情),只能支持“基本多文种平面”(Basic Multilingual Plane,BMP)。
所以推荐使用utf8mb4。
二、utf8mb4排序规则:utf8mb4_unicode_ci、utf8mb4_general_ci、utf8mb4_bin
utf8mb4_unicode_ci 和 utf8mb4_general_ci 的对比:
To further illustrate, the following equalities hold in both utf8_general_ci and utf8_unicode_ci (for the effect of this in comparisons or searches, see Section 10.8.6, “Examples of the Effect of Collation”):
Ä = A
Ö = O
Ü = U
A difference between the collations is that this is true for utf8_general_ci:
ß = s
Whereas this is true for utf8_unicode_ci, which supports the German DIN-1 ordering (also known as dictionary order):
ß = ss
MySQL implements utf8 language-specific collations if the ordering with utf8_unicode_ci does not work well for a language. For example, utf8_unicode_ci works fine for German dictionary order and French, so there is no need to create special utf8 collations.
utf8_general_ci also is satisfactory for both German and French, except that ß is equal to s, and not to ss. If this is acceptable for your application, you should use utf8_general_ci because it is faster. If this is not acceptable (for example, if you require German dictionary order), use utf8_unicode_ci because it is more accurate.
utf8mb4_general_ci, utf8mb4_unicode_ci:ci即case insensitive,不区分大小写。
准确性:
utf8mb4_unicode_ci 是基于标准的Unicode来排序和比较,能够在各种语言之间精确排序
utf8mb4_general_ci 没有实现Unicode排序规则,在遇到某些特殊语言或者字符集,排序结果可能不一致。但是,在绝大多数情况下,这些特殊字符的顺序并不需要那么精确
性能:
utf8mb4_general_ci 在比较和排序的时候更快
utf8mb4_unicode_ci 在特殊情况下,Unicode排序规则为了能够处理特殊字符的情况,实现了略微复杂的排序算法。但是在绝大多数情况下发,不会发生此类复杂比较。相比选择哪一种collation,使用者更应该关心字符集与排序规则在db里需要统一。
utf8mb4_bin:将字符串每个字符用二进制数据编译存储,区分大小写,而且可以存二进制的内容。
参考:
https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html
https://dev.mysql.com/doc/refman/8.0/en/charset-collation-effect.html