mysql兼容emoj_mysql 兼容 emoji

How to support full Unicode in MySQL databases

Published 30th July 2012 · tagged

with

Alternative title: The things we do to store

U+01F4A9 PILE OF POO (💩)

correctly.

Are you using MySQL’s utf8 charset

in your databases? In this write-up I’ll explain why you should

switch toutf8mb4 instead,

and how to do it.

UTF-8

The UTF-8 encoding can represent every symbol

in the Unicode character set, which ranges from U+000000 to

U+10FFFF. That’s 1,114,111 possible symbols. (Not all of these

Unicode code points have been assigned characters yet, but that

doesn’t stop UTF-8 from being able to encode them.)

UTF-8 is a variable-width encoding; it encodes each symbol using

one to four 8-bit bytes. Symbols with lower numerical code point

values are encoded using fewer bytes. This way, UTF-8 is optimized

for the common case where ASCII characters and

other BMP

symbols (whose code points range from U+000000

to U+00FFFF) are used — while still allowing astral symbols (whose

code points range from U+010000 to U+10FFFF) to be stored.

MySQL’s utf8

For a long time, I was using MySQL’s utf8 charset

for databases, tables, and columns, assuming it mapped to the UTF-8

encoding described above. By using utf8, I’d be able to

store any symbol I want in my database — or so I thought.

While writing about JavaScript’s

internal character encoding, I noticed that there was no way to

insert the U+1D306 TETRAGRAM FOR CENTRE (𝌆) symbol into the

MySQL database behind this site. The column I was trying to update

had the utf8_unicode_ci collation,

and the connection charset was set to utf8.

mysql> SET NAMES utf8; # just to emphasize that the connection charset is set to `utf8`

Query OK, 0 rows affected (0.00 sec)

mysql> UPDATE database_name.table_name SET column_name = 'foo𝌆bar' WHERE id = 9001;

Query OK, 1 row affected, 1 warning (0.00 sec)

Rows matched: 1 Changed: 1 Warnings: 1

mysql> SELECT column_name FROM database_name.table_name WHERE id = 9001;

+-------------+

| column_name |

+-------------+

| foo |

+-------------+

1 row in set (0.00 sec)

The content got truncated at the first astral

Unicode symbol, in this case 𝌆 —

so, attempting to insertfoo𝌆bar actually

inserted foo instead,

resulting in data loss. MySQL returned a warning message, too:

mysql> SHOW WARNINGS;

+---------+------+------------------------------------------------------------------------------+

| Level | Code | Message |

+---------+------+------------------------------------------------------------------------------+

| Warning | 1366 | Incorrect string value: '\xF0\x9D\x8C\x86' for column 'column_name' at row 1 |

+---------+------+------------------------------------------------------------------------------+

1 row in set (0.00 sec)

Turns out MySQL’s utf8 charset

only partially implements proper UTF-8 encoding. It can only store

UTF-8-encoded symbols that consist of one to three bytes; encoded

symbols that take up four bytes aren’t supported.

Since astral symbols (whose code points range from U+010000 to

U+10FFFF) each consist of four bytes in UTF-8, you cannot store

them using MySQL’s utf8 implementation.

This doesn’t just affect the 𝌆 character,

but more important symbols like U+01F4A9 PILE OF POO (💩) as well. In

total, that’s 1,048,575 possible code points you can’t use. In

fact, MySQL’s utf8 only

allows you to store 5.88% (0x00FFFF/0x10FFFF) of

all possible Unicode code points. Proper UTF-8 can encode 100% of

all Unicode code points.

TL;DR MySQL’s utf8 encoding

is awkwardly named, as it’s different from proper UTF-8 encoding.

It doesn’t offer full Unicode support.

MySQL’s utf8mb4

Luckily, MySQL

5.5.3 (released in early 2010) introduced a

new encoding called utf8mb4 which

maps to proper UTF-8 and thus fully supports Unicode, including

astral symbols.

Switching from MySQL’s utf8 to utf8mb4

Step 1: Create a backup

Create a backup of all the databases on the server you want to

upgrade. Safety first!

Step 2: Upgrade the MySQL server

Upgrade the MySQL server to v5.5.3+, or ask your server

administrator to do it for you.

Step 3: Modify databases, tables, and columns

Change the character set and collation properties of the databases,

tables, and columns to use utf8mb4instead

of utf8.

# For each database:ALTER DATABASE database_name CHARACTER SET=utf8mb4 COLLATE=utf8mb4_unicode_ci;# For each table:ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;# For each column:ALTER TABLE table_name CHANGE column_name column_name VARCHAR(191)CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;# (Don’t blindly copy-paste this! The exact statement depends on the column type, maximum length, and other properties. The above line is just an example for a `VARCHAR` column.)

Since utf8mb4 is

fully backwards compatible with utf8,

no mojibake or

other forms of data loss should occur. (But you

have a

backup, right?)

Step 4: Check the maximum length of columns and index keys

This is probably the most tedious part of the whole upgrading

process.

When converting from utf8 to utf8mb4, the maximum

length of a column or index key is unchanged in terms

of bytes. Therefore, it is smaller in

terms of characters, because the maximum length of a character is

now four bytes instead of three.

For example, a TINYTEXT column

can hold up to 255 bytes, which correlates to 85 three-byte or 63

four-byte characters. Let’s say you have

a TINYTEXT column

that uses utf8 but must

be able to contain more than 63 characters. Given this requirement,

you can’t convert this column to utf8mb4 unless

you also change the data type to a longer type such

as TEXT —

because if you’d try to fill it with four-byte characters, you’d

only be able to enter 63 characters, but not more.

The same goes for index keys. The InnoDB storage

engine has a maximum index length of 767 bytes, so forutf8 or utf8mb4 columns,

you can index a maximum of 255 or 191 characters, respectively. If

you currently have utf8 columns

with indexes longer than 191 characters, you will need to index a

smaller number of characters when

using utf8mb4. (Because of

this, I had to change some indexed VARCHAR(255) columns

toVARCHAR(191).)

Step 5: Modify connection, client, and server character sets

In your application code, set the connection character set

to utf8mb4. This can be

done by simply replacing any variants

of SET NAMES

utf8 with SET NAMES utf8mb4. If

your old SET

NAMES statement specified the

collation, make sure to change that as well,

e.g. SET NAMES utf8 COLLATE

utf8_unicode_ci becomes SET NAMES utf8mb4 COLLATE

utf8mb4_unicode_ci.

Make sure to set the client and server character set as well. I

have the following in my

[client]default-character-set=utf8mb4[mysql]default-character-set=utf8mb4[mysqld]character-set-client-handshake=FALSE

character-set-server=utf8mb4

collation-server=utf8mb4_unicode_ci

You can easily confirm these settings work correctly:

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';

+--------------------------+--------------------+

| Variable_name | Value |

+--------------------------+--------------------+

| character_set_client | utf8mb4 |

| character_set_connection | utf8mb4 |

| character_set_database | utf8mb4 |

| character_set_filesystem | binary |

| character_set_results | utf8mb4 |

| character_set_server | utf8mb4 |

| character_set_system | utf8 |

| collation_connection | utf8mb4_unicode_ci |

| collation_database | utf8mb4_unicode_ci |

| collation_server | utf8mb4_unicode_ci |

+--------------------------+--------------------+

10 rows in set (0.00 sec)

As you can see, all the relevant options are set

to utf8mb4, except

for character_set_filesystem which

should be binary unless

you’re on a file system that supports multi-byte UTF-8-encoded

characters in file names, and character_set_system which

is always

Note: The default character set

and collation can be configured at

Step 6: Repair and optimize all tables

After upgrading the MySQL server and making the necessary changes

explained above, make sure to repair and optimize all databases and

tables. I didn’t do this right away after upgrading (I didn’t think

it was necessary, as everything seemed to work fine at first

glance), and ran into some weird bugs

where UPDATE statements

didn’t have any effect, even though no errors were thrown.

You could run the following MySQL queries for each table you want

to repair and optimize:

# For each tableREPAIR TABLE table_name;OPTIMIZE TABLE table_name;

Luckily, this can easily be done in one go using the

command-line mysqlcheck utility:

$ mysqlcheck -u root -p --auto-repair --optimize --all-databases

This will prompt for the root user’s password, after which all

tables in all databases will be repaired and optimized.

Summary

Never use utf8 in

MySQL — always use utf8mb4 instead.

Updating your databases and code might take some time, but it’s

definitely worth the effort. Why would you arbitrarily limit the

set of symbols that can be used in your database? Why would you

lose data every time a user enters an astral symbol as part of a

comment or message or whatever it is you store in your database?

There’s no reason not to strive for full Unicode support

everywhere. Do the right thing, and

use utf8mb4. 🍻

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值