MySQL utf8mb4排序规则

最新推荐文章于 2024-09-05 11:37:12 发布

world_ding

最新推荐文章于 2024-09-05 11:37:12 发布

阅读量1.6w

点赞数 7

分类专栏：技术交流文章标签： MYSQL utf8mb4 utf8mb4 utf8mb4_unicode_ci utf8mb4_general_ci

本文链接：https://blog.csdn.net/world_ding/article/details/96447413

版权

技术交流专栏收录该内容

11 篇文章 0 订阅

订阅专栏

文章直通车：

utf8mb4 和 utf8
utf8mb4排序规则

一、先了解下 utf8mb4 和 utf8

参考MySQL文档:

utf8mb4: A UTF-8 encoding of the Unicode character set using one to four bytes per character.
utf8mb3: A UTF-8 encoding of the Unicode character set using one to three bytes per character.
utf8: An alias for utf8mb3.

Note

The utf8mb3 character set is deprecated and will be removed in a future MySQL release. Please use utf8mb4 instead. Although utf8 is currently an alias for utf8mb3, at some point utf8 will become a reference to utf8mb4. To avoid ambiguity about the meaning of utf8, consider specifying utf8mb4 explicitly for character set references instead of utf8.

UTF-8是使用1~4个字节，一种变长的编码格式。（字符编码）

mb4即 most bytes 4，使用4个字节来表示完整的UTF-8。而MySQL中的utf8是utfmb3，只有三个字节，节省空间但不能表达全部的UTF-8（比如emoji表情），只能支持“基本多文种平面”（Basic Multilingual Plane，BMP）。

所以推荐使用utf8mb4。

二、utf8mb4排序规则：utf8mb4_unicode_ci、utf8mb4_general_ci、utf8mb4_bin

utf8mb4_unicode_ci 和 utf8mb4_general_ci 的对比：

To further illustrate, the following equalities hold in both utf8_general_ci and utf8_unicode_ci (for the effect of this in comparisons or searches, see Section 10.8.6, “Examples of the Effect of Collation”):

Ä = A

Ö = O

Ü = U

A difference between the collations is that this is true for utf8_general_ci:

ß = s

Whereas this is true for utf8_unicode_ci, which supports the German DIN-1 ordering (also known as dictionary order):

ß = ss

MySQL implements utf8 language-specific collations if the ordering with utf8_unicode_ci does not work well for a language. For example, utf8_unicode_ci works fine for German dictionary order and French, so there is no need to create special utf8 collations.

utf8_general_ci also is satisfactory for both German and French, except that ß is equal to s, and not to ss. If this is acceptable for your application, you should use utf8_general_ci because it is faster. If this is not acceptable (for example, if you require German dictionary order), use utf8_unicode_ci because it is more accurate.

utf8mb4_general_ci, utf8mb4_unicode_ci：ci即case insensitive，不区分大小写。

准确性：

utf8mb4_unicode_ci 是基于标准的Unicode来排序和比较，能够在各种语言之间精确排序
utf8mb4_general_ci 没有实现Unicode排序规则，在遇到某些特殊语言或者字符集，排序结果可能不一致。但是，在绝大多数情况下，这些特殊字符的顺序并不需要那么精确

性能：

utf8mb4_general_ci 在比较和排序的时候更快
utf8mb4_unicode_ci 在特殊情况下，Unicode排序规则为了能够处理特殊字符的情况，实现了略微复杂的排序算法。但是在绝大多数情况下发，不会发生此类复杂比较。相比选择哪一种collation，使用者更应该关心字符集与排序规则在db里需要统一。

utf8mb4_bin：将字符串每个字符用二进制数据编译存储，区分大小写，而且可以存二进制的内容。

参考：