前言
博主在工作的过程中有一天公司决定将数据迁移的新的项目上去,当我发现数据库中的表大于有4000多张表的时我顿时懵了下,这数据迁移人力物力消耗的也太大了吧(看DataX的设计)。所以我们可以借助阿里云开源的DataX来解决这个问题。
看完这篇掌握以下内容:
- 什么是DataX
- DataX的设计
- Datax框架设计、运行原理
- 快速入门

一、概论
1.1 什么是DataX
DataX 是阿里巴巴开源的一个异构数据源离线同步工具,致力于实现包括关系型数据库(MySQL、Oracle 等)、HDFS、Hive、ODPS、HBase、FTP 等各种异构数据源之间稳定高效的数据同步功能。

1.2 DataX 的设计
为了解决异构数据源同步问题,DataX 将复杂的网状的同步链路变成了星型数据链路,DataX 作为中间传输载体负责连接各种数据源。当需要接入一个新的数据源的时候,只需要将此数据源对接到 DataX,便能跟已有的数据源做到无缝数据同步。

1.3 框架设计

- Reader:数据采集模块,负责采集数据源的数据,将数据发给Framework。
- Wiriter: 数据写入模块,负责不断向Framwork取数据,并将数据写入到目的端。
- Framework:用于连接read和writer,作为两者的数据传输通道,并处理缓冲,流控,并发,数据转换等你核心技术问题。
##1.4 运行原理

- Job:单个作业的管理节点,负责数据清理、子任务划分、TaskGroup监控管理。
- Task:由Job切分而来,是DataX作业的最小单元,每个Task负责一部分数据的同步工作。
- Schedule:将Task组成TaskGroup,单个TaskGroup的并发数量为5。
- TaskGroup:负责启动Task。
1.4 Datax所支持的渠道
| 类型 | 数据源 | 读者 | 作家(写) | 文件 |
|---|---|---|---|---|
| RDBMS关系型数据库 | 的MySQL | √ | √ | 读,写 |
| 甲骨文 | √ | √ | 读,写 | |
| SQL服务器 | √ | √ | 读,写 | |
| PostgreSQL的 | √ | √ | 读,写 | |
| DRDS | √ | √ | 读,写 | |
| 通用RDBMS(支持所有关系型数据库) | √ | √ | 读,写 | |
| 阿里云数仓数据存储 | ODPS | √ | √ | 读,写 |
| 美国存托凭证 | √ | 写 | ||
| 开源软件 | √ | √ | 读,写 | |
| OCS | √ | √ | 读,写 | |
| NoSQL数据存储 | OTS | √ | √ | 读,写 |
| Hbase0.94 | √ | √ | 读,写 | |
| Hbase1.1 | √ | √ | 读,写 | |
| 凤凰4.x | √ | √ | 读,写 | |
| 凤凰5.x | √ | √ | 读,写 | |
| MongoDB | √ | √ | 读,写 | |
| 蜂巢 | √ | √ | 读,写 | |
| 卡桑德拉 | √ | √ | 读,写 | |
| 无结构化数据存储 | 文本文件 | √ | √ | 读,写 |
| 的FTP | √ | √ | 读,写 | |
| HDFS | √ | √ | 读,写 | |
| 弹性搜索 | √ | 写 | ||
| 时间序列数据库 | OpenTSDB | √ | 读 | |
| 技术开发局 | √ | √ | 读,写 |
二、快速入门
2.1 环境搭建
下载地址: http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
源码地址: https://github.com/alibaba/DataX
配置要求:
1) 将下载好的datax.tar.gz上传到服务器的任意节点,我这里上传到node01上的/exprot/soft
2)解压到/export/servers/
[root@node01 soft]# tar -zxvf datax.tar.gz -C ../servers/
3)运行自检脚本
[root@node01 soft]# cd /export/servers/datax
[root@node01 datax]# ll
总用量 36
drwxr-xr-x 2 62265 users 4096 10月 2 14:42 bin
drwxr-xr-x 2 62265 users 4096 10月 2 14:42 conf
drwxr-xr-x 2 62265 users 4096 10月 2 14:42 job
drwxr-xr-x 2 62265 users 4096 10月 2 14:42 lib
drwxr-xr-x 3 root root 4096 8月 8 22:39 log
drwxr-xr-x 3 root root 4096 8月 8 22:39 log_perf
drwxr-xr-x 4 62265 users 4096 10月 12 2019 plugin
drwxr-xr-x 2 62265 users 4096 10月 2 14:42 script
drwxr-xr-x 2 62265 users 4096 10月 2 14:42 tmp
[root@node01 datax]# bin/datax.py job/job.json
出现以下结果说明你得环境没有问题

2.2从Stream流读取数流打印到控制台
查看官方提高的模板
[root@node01 datax]# python bin/datax.py -r streamreader -w streamwriter
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
Please refer to the streamreader document:
https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md
Please refer to the streamwriter document:
https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md,k
Please save the following configuration as a json file and use
python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
to run the job.
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"column": [],
"sliceRecordCount": ""
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": ""
}
}
}
}
根据模板修改配置
[root@node01 datax]# vim job/stream2_stream.json
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"sliceRecordCount": 10,
"column": [
{
"type": "long",
"value": "10"
},
{
"type": "string",
"value": "hello,DataX"
}
]
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "UTF-8",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": 1
}
}
}
}
执行脚本(部分日志)
[root@node01 datax]# bin/datax.py job/stream2_stream.json
2020-10-02 15:31:00.714 [job-0] INFO JobContainer - PerfTrace not enable!
2020-10-02 15:31:00.714 [job-0] INFO StandAloneJobContainerCommunicator - Total 10 records, 130 bytes | Speed 13B/s, 1 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.006s | Percentage 10 0.00%
2020-10-02 15:31:00.715 [job-0] INFO JobContainer -
任务启动时刻 : 2020-10-02 15:30:50
任务结束时刻 : 2020-10-02 15:31:00
任务总计耗时 : 10s
任务平均流量 : 13B/s
记录写入速度 : 1rec/s
读出记录总数 : 10
读写失败总数 : 0
2.3读取MySQL中的数据写入到HDFS
准备
创建数据库和表并加载测试数据
create database test;
use test;
create table c_s(
id varchar(100) null,
c_id int null,
s_id varchar(20) null
);
INSERT INTO test.c_s (id, c_id, s_id) VALUES ('123', 1, '201967');
INSERT INTO test.c_s (id, c_id, s_id) VALUES ('123', 2, '201967');
INSERT INTO test.c_s (id, c_id, s_id) VALUES ('123', 3, '201967');
INSERT INTO test.c_s (id, c_id, s_id) VALUES ('123', 5, '201967');
INSERT INTO test.c_s (id, c_id, s_id) VALUES ('123', 6, '201967');
查看官方提高的模板
[root@node01 datax]# bin/datax.py -r mysqlreader -w hdfswriter
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
Please refer to the mysqlreader document:
https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md
Please refer to the hdfswriter document:
https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md
Please save the following configuration as a json file and use
python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
to run the job.
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": [],
"connection": [
{
"jdbcUrl": [],
"table": []
}
],
"password": "",
"username": "",
"where": ""
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"column": [],
"compress": "",
"defaultFS": "",
"fieldDelimiter": "",
"fileName": "",
"fileType": "",
"path": "",
"writeMode": ""
}
}
}
],
"setting": {
"speed": {
"channel": ""
}
}
}
}
根据官网模板进行修改
[root@node01 datax]# vim job/mysqlToHDFS.json
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": [
"id",
"c_id",
"s_id"
],
"connection": [
{
"jdbcUrl": [
"jdbc:mysql://node02:3306/test"
],
"table": [
"c_s"
]
}
],
"password": "123456",
"username": "root"
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"column": [
{
"name": "id",
"type": "string"
},
{
"name": "c_id",
"type": "int"
},
{
"name": "s_id",
"type": "string"
}
],
"defaultFS": "hdfs://node01:8020",
"fieldDelimiter": "\t",
"fileName": "c_s.txt",
"fileType": "text",
"path": "/",
"writeMode": "append"
}
}
}
],
"setting": {
"speed": {
"channel": "1"
}
}
}
}
MySQL的参数介绍

HDFS参数介绍

运行脚本
[root@node01 datax]# bin/datax.py job/mysqlToHDFS.json
2020-10-02 16:12:16.358 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /export/servers/datax/hook
2020-10-02 16:12:16.359 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
PS MarkSweep | 1 | 1 | 1 | 0.245s | 0.245s | 0.245s
PS Scavenge | 1 | 1 | 1 | 0.155s | 0.155s | 0.155s
2020-10-02 16:12:16.359 [job-0] INFO JobContainer - PerfTrace not enable!
2020-10-02 16:12:16.359 [job-0] INFO StandAloneJobContainerCommunicator - Total 5 records, 50 bytes | Speed 5B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%
2020-10-02 16:12:16.360 [job-0] INFO JobContainer -
任务启动时刻 : 2020-10-02 16:12:04
任务结束时刻 : 2020-10-02 16:12:16
任务总计耗时 : 12s
任务平均流量 : 5B/s
记录写入速度 : 0rec/s
读出记录总数 : 5
读写失败总数 : 0
2.4 读取HDFS中的数据写入到MySQL
准备工作
create database test;
use test;
create table c_s2(
id varchar(100) null,
c_id int null,
s_id varchar(20) null
);
查看官方提供的模板
[root@node01 datax]# bin/datax.py -r hdfsreader -w mysqlwriter
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
Please refer to the hdfsreader document:
https://github.com/alibaba/DataX/blob/master/hdfsreader/doc/hdfsreader.md
Please refer to the mysqlwriter document:
https://github.com/alibaba/DataX/blob/master/mysqlwriter/doc/mysqlwriter.md
Please save the following configuration as a json file and use
python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
to run the job.
{
"job": {
"content": [
{
"reader": {
"name": "hdfsreader",
"parameter": {
"column": [],
"defaultFS": "",
"encoding": "UTF-8",
"fieldDelimiter": ",",
"fileType": "orc",
"path": ""
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": [],
"connection": [
{
"jdbcUrl": "",
"table": []
}
],
"password": "",
"preSql": [],
"session": [],
"username": "",
"writeMode": ""
}
}
}
],
"setting": {
"speed": {
"channel": ""
}
}
}
}
根据官方提供模板进行修改
[root@node01 datax]# vim job/hdfsTomysql.json
{
"job": {
"content": [
{
"reader": {
"name": "hdfsreader",
"parameter": {
"column": [
"*"
],
"defaultFS": "hdfs://node01:8020",
"encoding": "UTF-8",
"fieldDelimiter": "\t",
"fileType": "text",
"path": "/c_s.txt"
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": [
"id",
"c_id",
"s_id"
],
"connection": [
{
"jdbcUrl": "jdbc:mysql://node02:3306/test",
"table": [
"c_s2"
]
}
],
"password": "123456",
"username": "root",
"writeMode": "replace"
}
}
}
],
"setting": {
"speed": {
"channel": "1"
}
}
}
}
脚本运行
[root@node01 datax]# bin/datax.py job/hdfsTomysql.json
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
PS MarkSweep | 1 | 1 | 1 | 0.026s | 0.026s | 0.026s
PS Scavenge | 1 | 1 | 1 | 0.015s | 0.015s | 0.015s
2020-10-02 16:57:13.152 [job-0] INFO JobContainer - PerfTrace not enable!
2020-10-02 16:57:13.152 [job-0] INFO StandAloneJobContainerCommunicator - Total 5 records, 50 bytes | Speed 5B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.033s | Percentage 100.00%
2020-10-02 16:57:13.153 [job-0] INFO JobContainer -
任务启动时刻 : 2020-10-02 16:57:02
任务结束时刻 : 2020-10-02 16:57:13
任务总计耗时 : 11s
任务平均流量 : 5B/s
记录写入速度 : 0rec/s
读出记录总数 : 5
读写失败总数 : 0
三、总结
好了到这里今天的分享基本就结束了,上述给大家操作了从MySQL到HDFS/从HDFS到MySQL对于喜欢的小伙伴们可以参考1.4根据实际情况配置,细心的小伙伴会发现难道我每一使用都要配置他们?对于这问题有大佬开发了一个Datax-web可以根据页面的配置进行数据的同步这样就大大减少了我们的学习成本。我们下期见~~~

2万+

被折叠的 条评论
为什么被折叠?



