Mysql数据导入Elasticsearch最佳实践

最新推荐文章于 2024-08-05 14:21:54 发布

senlin1202

最新推荐文章于 2024-08-05 14:21:54 发布

阅读量3.8k

点赞数

分类专栏： LINUX mysql

LINUX 同时被 2 个专栏收录

148 篇文章 4 订阅

订阅专栏

mysql

50 篇文章 0 订阅

订阅专栏

1. 前言

Elasticsearch（ES）可用于全文检索、日志分析、指标分析、APM等众多场景，而且搭建部署容易，后期弹性扩容、故障处理简单。ES在一定程度上实现了一套系统支持多个场景的希望，大幅度降低使用多套专用系统的运维成本（当然ES不是万能的，不能满足事务等场景）。正是因为其通用性和易用性，ES自2010年发布首个版本以来得到爆发式的发展，广泛应用于各类互联网公司的不同业务场景，在数据库的专业排名中（DB Engines）上升至第8位。

对于很多用户，想要将Mysql中的数据导入到ES中，而又找不到一种较好的方法，笔者这里给出一种简单快捷的方式，轻松将Mysql中的数据同步到ES。

2. 工具介绍 --- go-mysql-elasticsearch

go-mysql-elasticsearch是一款开源的高性能的Mysql数据同步ES的工具，其由go语言开发，编译及使用非常简单。go-mysql-elasticsearch的原理很简单，首先使用mysqldump获取当前MySQL的数据，然后在通过此时binlog的name和position获取增量数据，再根据binlog构建restful api写入数据到ES中。下面笔者将会给出详细的使用步骤。

3. Mysql数据同步ES步骤

3.1 Mysql样例数据构建

既然读者有Mysql导入ES的需求，那Mysql的安装就不用多说了。这里笔者为了整个流程的完整性，就从样例数据的灌入开始，笔者用go写了一个小工具，生成一些样例数据并灌入到Mysql中，表结构如下：

mysql> desc test_table;

+-----------+-------------+------+-----+---------+----------------+

+-----------+-------------+------+-----+---------+----------------+

| id | int(11) | NO | PRI | NULL | auto_increment |

+-----------+-------------+------+-----+---------+----------------+

以上创建了一个名为test_table的表，然后向该表灌入2000条样例数据，部分数据如下所示：

mysql> select * from test_table;

+------+------------+-----------+-------------+-----------+

+------+------------+-----------+-------------+-----------+

| 1 | 1527676339 | 0.23 | 192.168.1.1 | beijing |

| 2 | 1527676399 | 0.78 | 192.168.1.2 | shanghai |

| 3 | 1527676459 | 0.2 | 192.168.1.3 | guangzhou |

| 4 | 1527676519 | 0.47 | 192.168.1.4 | shanghai |

| 5 | 1527676579 | 0.13 | 192.168.1.5 | beijing |

| 6 | 1527676639 | 0.15 | 192.168.1.1 | beijing |

| 7 | 1527676699 | 0.07 | 192.168.1.2 | shanghai |

| 8 | 1527676759 | 0.17 | 192.168.1.3 | guangzhou |

| 9 | 1527676819 | 0.94 | 192.168.1.4 | shanghai |

| 10 | 1527676879 | 0.06 | 192.168.1.5 | beijing |

至此，Mysql端的样例数据准备完毕。

3.2 go-mysql-elasticsearch使用

由于go-mysql-elasticsearch是用go语言开发，因此首先安装go，官方要求的版本是1.6以上，go的安装非常简单，参考官方文档，下载：https://golang.org/dl/，安装：https://golang.org/doc/install#install，然后开始安装 go-mysql-elasticsearch，整个步骤如下：

$ go get github.com/siddontang/go-mysql-elasticsearch

$ cd $GOPATH/src/github.com/siddontang/go-mysql-elasticsearch

$ make

工具安装好后，需要进行一些合理地配置我们才能愉快地使用，下面笔者将会给出一个配置范例，并给予相应地注释说明：

# 注意：go-mysql-elasticsearch的默认配置文件在go-mysql-elasticsearch/etc/river.toml

# MySQL address, user and password

# user must have replication privilege in MySQL.

my_addr = "127.0.0.1:3306"

my_user = "root"

my_pass = "123456"

my_charset = "utf8"

# Set true when elasticsearch use https

#es_https = false

# ES 地址

es_addr = "9.6.174.42:13982"

# 如果使用的是带权限的ES，需要设置用户名和密码

#es_user = "root"

#es_pass = "changeme"

# Path to store data, like master.info, if not set or empty,

# we must use this to support breakpoint resume syncing.

# TODO: support other storage, like etcd.

data_dir = "./var" # 存储的是binlog的名字及位置

# Inner Http status address

stat_addr = "127.0.0.1:12800"

# pseudo server id like a slave

server_id = 1001

# mysql or mariadb

flavor = "mysql"

# mysqldump execution path

# if not set or empty, ignore mysqldump.

mysqldump = "mysqldump"

# minimal items to be inserted in one bulk

bulk_size = 512

# force flush the pending requests if we don't have enough items >= bulk_size

flush_bulk_time = "200ms"

# Ignore table without primary key

skip_no_pk_table = true # 这里需要注意，go-mysql-elasticsearch会

# MySQL data source

[[source]]

schema = "mysql_es"

# Only below tables will be synced into ES.

# "t_[0-9]{4}" is a wildcard table format, you can use it if you have many sub tables, like table_0000 - table_1023

# I don't think it is necessary to sync all tables in a database.

tables = ["test_table*"]

[[rule]]

schema = "mysql_es" # Mysql数据库名

table = "test_table" # Mysql表名

index = "test_index" # ES中index名

type = "doc" # 文档类型

以上配置，为笔者测试所使用的配置，如果用户有更高级的需求可以参考官方文档，合理进行配置。配置ok后，我们来运行go-mysql-elasticsearch，如下所示：

$ ./bin/go-mysql-elasticsearch -config=./etc/river.toml

2018/05/31 21:43:44 INFO create BinlogSyncer with config {1001 mysql 127.0.0.1 3306 root utf8 false false <nil> false false 0 0s 0s 0}

2018/05/31 21:43:44 INFO run status http server 127.0.0.1:12800

2018/05/31 21:43:44 INFO skip dump, use last binlog replication pos (mysql-bin.000002, 194296) or GTID %!s(<nil>)

2018/05/31 21:43:44 INFO begin to sync binlog from position (mysql-bin.000002, 194296)

2018/05/31 21:43:44 INFO register slave for master server 127.0.0.1:3306

2018/05/31 21:43:44 INFO start sync binlog at binlog file (mysql-bin.000002, 194296)

2018/05/31 21:43:44 INFO rotate to (mysql-bin.000002, 194296)

2018/05/31 21:43:44 INFO rotate binlog to (mysql-bin.000002, 194296)

2018/05/31 21:43:44 INFO save position (mysql-bin.000002, 194296)

这里需要注意，由于go-mysql-elasticsearch需要利用binlog，而且binlog一定要变成row-based format格式，同时需要用到canal组件来同步数据（canal模拟mysql slave的交互协议，伪装自己为mysql slave，向mysql master发送dump协议），因此在Mysql必须配置如下参数：

# 以下参数需要配置，否则必踩坑

log_bin=mysql-bin

binlog_format = ROW

server-id=1

现在，我们来看看ES中是否成功导入了Mysql中的数据：

#命令：

GET test_index/_search?size=1000

{

"sort": [

{

"timestamp": {

"order": "desc"

}

"docvalue_fields": ["timestamp", "host_ip", "region", "cpu_usage"]

}

#结果：

{

"took": 8,

"timed_out": false,

"_shards": {

"total": 3,

"successful": 3,

"skipped": 0,

"failed": 0

"hits": {

"total": 2000,

"max_score": null,

"hits": [

{

"_index": "test_index",

"_type": "doc",

"_id": "2000",

"_score": null,

"fields": {

"host_ip": [

"192.168.1.5"

"region": [

"beijing"

"cpu_usage": [

0.05000000074505806

"timestamp": [

1527807286000

]

"sort": [

1527807286000

]

......

}

从total可以看出，2000条数据完全导入，至此，Mysql数据导入成功。

对于一些项目如果使用了分表机制，我们可以用通配符来匹配，这里假设我们需要同步test_table和test_table1两个表到Elasticsearch的同一个index下，只需将上述中的rule配置改为：

[[rule]]

schema = "mysql_es" # Mysql数据库名

table = "test_table*" # Mysql表名，这里的table必须在source下的tables里

index = "test_index" # Elasticsearch中index名

type = "doc" # 文档类型

为了验证配置是否生效，笔者在mysql中另外建了一张表test_table1，并插入三条测试数据：

mysql> select * from test_table1;

+------+------------+-----------+-------------+-----------+

+------+------------+-----------+-------------+-----------+

| 3333 | 1528960639 | 0.55 | 192.168.1.2 | chongqing |

| 4444 | 1528960649 | 0.56 | 192.168.1.3 | chengdu |

| 5555 | 1528960649 | 0.58 | 192.168.1.6 | shenzhen |

+------+------------+-----------+-------------+-----------+

然后清空go-mysql-elasticsearch下的var目录，重启程序，再看看ES中同步的数据：

"hits": {

"total": 2003,

"max_score": null,

"hits": [

{

"_index": "test_index",

"_type": "doc",

"_id": "5555",

"_score": null,

"fields": {

"host_ip": [

"192.168.1.6"

"region": [

"shenzhen"

"cpu_usage": [

0.5799999833106995

"timestamp": [

1528960649000

]

"sort": [

1528960649000

]

{

"_index": "test_index",

"_type": "doc",

"_id": "4444",

"_score": null,

"fields": {

"host_ip": [

"192.168.1.3"

"region": [

"chengdu"

"cpu_usage": [

0.5600000023841858

"timestamp": [

1528960649000

]

"sort": [

1528960649000

]

... ...

从上述结果可以看出，ES中有2003条数据，至此，test_table、test_table1中的数据都成功同步到ES。

小结

可以看到，使用 go-mysql-elasticsearch，我们仅需要在配置文件里面写规则，就能非常方便的将数据从 MySQL 同步给 ES。上面仅仅举了一些简单的例子，如果有更多的需求可以参考 go-mysql-elasticsearch的官方文档。

除了本文所介绍的工具外，这里再推荐两种工具，一个是 py-mysql-elasticsearch-sync，该工具是使用python语言编写，与go-mysql-elasticsearch的原理类似，都是利用binlog来实现数据的同步，安装及使用见官方文档https://github.com/zhongbiaodev/py-mysql-elasticsearch-sync；另一个工具是logstash，使用logstash同步数据时需要安装logstash-input-jdbc、logstash-output-elasticsearch两个插件，具体使用参考官方文档：https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html 和 https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html