DataX实现Mysql与Oracle、HDFS、Postgres的数据同步
第1章 Demo设计
架构图:
准备了4个示例,原理与模式基本都是一样的:
Mysql — data —> HDFS HDFS — data —> Mysql Oracle — data —> Mysql Postgresql — data —> Hive
第2章 快速部署
2.1 官方地址
下载地址:http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz 源码地址:https://github.com/alibaba/DataX
2.2 前置要求
Linux ( Centos7.6 ) JDK(1.8以上,推荐1.8) Python(推荐Python2.6.X)
2.3 安装
1)将下载好的datax.tar.gz上传到服务器的/opt/software
1 2
[root@node1 software]# ll total 789564 -rw-r--r-- 1 root root 808511380 Jul 28 16:49 datax.tar.gz
2)解压datax.tar.gz到/opt/module
1
[root@node1 software]$ tar -zxvf datax.tar.gz -C /opt/module/
3)运行自检脚本
1 2
[root@node1 bin]$ cd /opt/module/datax/bin/ [root@node1 bin]$ python datax.py /opt/module/datax/job/job.json
第3章 DEMO案例
3.1 DataX示例一:读Mysql数据到HDFS
3.1.1 查看官方模板
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
[root@node1 ~]$ python /opt/module/datax/bin/datax.py -r mysqlreader -w hdfswriter {<!-- --> "job": {<!-- --> "content": [ {<!-- --> "reader": {<!-- --> "name": "mysqlreader", "parameter": {<!-- --> "column": [], "connection": [ {<!-- --> "jdbcUrl": [], "table": [] } ], "password": "", "username": "", "where": "" } }, "writer": {<!-- --> "name": "hdfswriter", "parameter": {<!-- --> "column": [], "compress": "", "defaultFS": "", "fieldDelimiter": "", "fileName": "", "fileType": "", "path": "", "writeMode": "" } } } ], "setting": {<!-- --> "speed": {<!-- --> "channel": "" } } } }
如果你的数据库没有设置密码,那么就需要修改源码,下载dataX源码,做如下修改,并打成jar包datax-common-0.0.1-SNAPSHOT.jar :
将打好的jar包替换datax/lib/datax-common-0.0.1-SNAPSHOT.jar
3.1.2 准备数据
1)在Mysql中创建alert_current表,导入测试数据
3.1.3 编写配置文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
[root@node1 job]# vi Mysql2Hdfs.json {<!-- --> "job": {<!-- --> "content": [ {<!-- --> "reader": {<!-- --> "name": "mysqlreader", "parameter": {<!-- --> "column": [ "alert_id", "definition_id", "history_id", "maintenance_state", "original_timestamp", "latest_timestamp", "occurrences", "firmness" ], "connection": [ {<!-- --> "jdbcUrl": [ "jdbc:mysql://192.168.20.45:3306/sqooptest" ], "table": [ "alert_current" ] } ], "username": "root", "password":"" } }, "writer": {<!-- --> "name": "hdfswriter", "parameter": {<!-- --> "defaultFS": "hdfs://192.168.20.1:8020", "fieldDelimiter": "\t", "fileName": "alert_current.txt", "fileType": "text", "path": "/user/datax", "writeMode": "append", "column": [ {<!-- --> "name": "alert_id", "type": "bigint" }, {<!-- --> "name": "definition_id", "type": "bigint" }, {<!-- --> "name": "history_id", "type": "bigint" }, {<!-- --> "name": "maintenance_state", "type": "string" }, {<!-- --> "name": "original_timestamp", "type": "bigint" }, {<!-- --> "name": "latest_timestamp", "type": "bigint" }, {<!-- --> "name": "occurrences", "type": "bigint" }, {<!-- --> "name": "firmness", "type": "string" } ] } } } ], "setting": {<!-- --> "speed": {<!-- --> "channel": "1" } } } }
3.1.4 执行任务
[root@node1 datax]# bin/datax.py ./job/Mysql2Hdfs.json
:
3.1.5 查看HDFS
注:HdfsWriter实际执行时会在该文件名后添加随机的后缀作为每个线程写入实际文件名。
3.2 DataX示例二:读HDFS中数据到Mysql
基于3.2案例结果HDFS中的数据,导出到Mysql的alert_current_copy表中
3.2.1 编写配置文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
[root@node1 job]# vi Hdfs2Mysql.json {<!-- --> "job": {<!-- --> "content": [ {<!-- --> "reader": {<!-- --> "name": "hdfsreader", "parameter": {<!-- --> "column": ["*"], "defaultFS": "hdfs://192.168.20.1:8020", "encoding": "UTF-8", "fieldDelimiter": "\t", "fileType": "text", "path": "/user/datax/*" } }, "writer": {<!-- --> "name": "mysqlwriter", "parameter": {<!-- --> "column": [ "alert_id", "definition_id", "history_id", "maintenance_state", "original_timestamp", "latest_timestamp", "occurrences", "firmness" ], "connection": [ {<!-- --> "jdbcUrl":"jdbc:mysql://192.168.20.1:3306/sqooptest", "table": [ "alert_current_copy" ] } ], "password": "******", "username": "root", "writeMode": "update" } } } ], "setting": {<!-- --> "speed": {<!-- --> "channel": "1" } } } }
3.2.1 查看运行结果 运行结果
查看Mysql数据库的alert_current_copy表
3.3 DataX示例三: Oracle同步数据到Mysql
3.3.1准备工作
1)192.168.20.2 节点Oracle中选择稍微大一点的LineItem表(719M),查看数据量 select count(*) from lineitem;
2)192.168.20.1 节点Mysql中datax_test库中创建表LINEITEM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
create table LINEITEM ( l_orderkey integer not null, l_partkey integer not null, l_suppkey integer not null, l_linenumber integer not null, l_quantity decimal not null, l_extendedprice decimal(15,2) not null, l_discount decimal(15,2) not null, l_tax decimal(15,2) not null, l_returnflag varchar(1) not null, l_linestatus varchar(1) not null, l_shipdate timestamp not null, l_commitdate timestamp not null, l_receiptdate timestamp not null, l_shipinstruct varchar(25) not null, l_shipmode varchar(10) not null, l_comment varchar(44) not null );
3.3.2 编写配置文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
[root@node1 job]# vi oracle2Mysql.json {<!-- --> "core":{<!-- --> "transport":{<!-- --> "channel":{<!-- --> "speed":{<!-- --> "record":-1, "byte":-1, "batchSize":2048 } } } }, "job": {<!-- --> "content": [ {<!-- --> "reader": {<!-- --> "name": "oraclereader", "parameter": {<!-- --> "splitPk": "l_orderkey", "column": ["*"], "connection": [ {<!-- --> "jdbcUrl": ["jdbc:oracle:thin:@192.168.20.2:1521:orcl"], "table":["lineitem"] } ], "password": "******", "username": "username" } }, "writer": {<!-- --> "name": "mysqlwriter", "parameter": {<!-- --> "blockSizeInMB":128, "column": ["*"], "connection": [ {<!-- --> "jdbcUrl": "jdbc:mysql://192.168.20.1:3307/datax_test", "table": ["LINEITEM"] } ], "password": "******", "preSql": [], "session": [], "username": "root", "writeMode": "update" } } } ], "setting": {<!-- --> "speed": {<!-- --> "channel": "30" } } } }
3.3.3 运行并查看结果
1
[root@node1 datax]# bin/datax.py ./job/oracle2Mysql.json
速度仅有2M/S,采取以下优化措施: 1) writer增加参数:“blockSizeInMB”:128 2) reader增加参数:“splitPk”: “l_orderkey” 3) 增大channel值 再次运行速度可以达到6-7M/S, 如下图
结果总用时917s:
Jvm调优:
1 2
[root@node1 bin]# vi datax.py DEFAULT_JVM = "-Xms70g -Xmx70g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=%s/log" % (DATAX_HOME)
3.4 DataX示例四: Postgres同步数据到Hive
直接上配置文件:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
{<!-- --> "job": {<!-- --> "setting": {<!-- --> "speed": {<!-- --> "channel": 3 }, "errorLimit": {<!-- --> "record": 0, "percentage": 0.02 } }, "content": [ {<!-- --> "reader": {<!-- --> "name": "postgresqlreader", "parameter": {<!-- --> "connection": [ {<!-- --> "jdbcUrl": ["jdbc:postgresql://192.168.20.1:5432/testdb"], "table": ["stu"] } ], "password": "*******", "username": "postgres", "column": [ "id","name" ] } }, "writer": {<!-- --> "name": "hdfswriter", "parameter": {<!-- --> "column": [ {<!-- --> "name": "id", "type": "int" }, {<!-- --> "name": "name", "type": "string" } ], "compress": "", "defaultFS": "hdfs://192.168.20.1:8020", "fieldDelimiter": "\t", "fileName": "stu", "fileType": "text", "path": "/griffin/data/batch/stu", "writeMode": "append" } } } ] } }
小结:
1、本次示例的运行主机是32G -8核虚拟机,优化程度有限。 2、Mysql可以直接使用update参数做到增量同步,HDFS/Hive不可以,得需要借助shell脚本来实现,稍微麻烦一点。 3、Datax源码开源,在示例中需要取消程序对mysql的密码校验,就去修改了源码,总体来说还是很方便的。