DATAX快速上手非常详细

2 篇文章 0 订阅
1 篇文章 0 订阅

前言
         博主在工作的过程中有一天公司决定将数据迁移的新的项目上去,当我发现数据库中的表大于有4000多张表的时我顿时懵了下,这数据迁移人力物力消耗的也太大了吧(看DataX的设计)。所以我们可以借助阿里云开源的DataX来解决这个问题。
看完这篇掌握以下内容:

  • 什么是DataX
  • DataX的设计
  • Datax框架设计、运行原理
  • 快速入门

在这里插入图片描述

一、概论

1.1 什么是DataX

         DataX 是阿里巴巴开源的一个异构数据源离线同步工具,致力于实现包括关系型数据库(MySQL、Oracle 等)、HDFS、Hive、ODPS、HBase、FTP 等各种异构数据源之间稳定高效的数据同步功能。

在这里插入图片描述

1.2 DataX 的设计

         为了解决异构数据源同步问题,DataX 将复杂的网状的同步链路变成了星型数据链路,DataX 作为中间传输载体负责连接各种数据源。当需要接入一个新的数据源的时候,只需要将此数据源对接到 DataX,便能跟已有的数据源做到无缝数据同步
在这里插入图片描述

1.3 框架设计

在这里插入图片描述

  • Reader:数据采集模块,负责采集数据源的数据,将数据发给Framework。
  • Wiriter: 数据写入模块,负责不断向Framwork取数据,并将数据写入到目的端。
  • Framework:用于连接read和writer,作为两者的数据传输通道,并处理缓冲,流控,并发,数据转换等你核心技术问题。
    ##1.4 运行原理
    在这里插入图片描述
  • Job:单个作业的管理节点,负责数据清理、子任务划分、TaskGroup监控管理。
  • Task:由Job切分而来,是DataX作业的最小单元,每个Task负责一部分数据的同步工作。
  • Schedule:将Task组成TaskGroup,单个TaskGroup的并发数量为5。
  • TaskGroup:负责启动Task。

1.4 Datax所支持的渠道

类型数据源读者作家(写)文件
RDBMS关系型数据库的MySQL
           甲骨文        √        √    
SQL服务器
PostgreSQL的
DRDS
通用RDBMS(支持所有关系型数据库)
阿里云数仓数据存储ODPS
美国存托凭证
开源软件
OCS
NoSQL数据存储OTS
Hbase0.94
Hbase1.1
凤凰4.x
凤凰5.x
MongoDB
蜂巢
卡桑德拉
无结构化数据存储文本文件
的FTP
HDFS
弹性搜索
时间序列数据库OpenTSDB
技术开发局

二、快速入门

2.1 环境搭建

下载地址: http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
源码地址: https://github.com/alibaba/DataX

配置要求:

  • Linux
  • JDK(1.8以上 建议1.8) 下载
  • Python(推荐 Python2.6.X)下载
    安装:

1) 将下载好的datax.tar.gz上传到服务器的任意节点,我这里上传到node01上的/exprot/soft
2)解压到/export/servers/

[root@node01 soft]# tar -zxvf datax.tar.gz  -C ../servers/

3)运行自检脚本

[root@node01 soft]# cd /export/servers/datax
[root@node01 datax]# ll
总用量 36
drwxr-xr-x 2 62265 users 4096 10月  2 14:42 bin
drwxr-xr-x 2 62265 users 4096 10月  2 14:42 conf
drwxr-xr-x 2 62265 users 4096 10月  2 14:42 job
drwxr-xr-x 2 62265 users 4096 10月  2 14:42 lib
drwxr-xr-x 3 root  root  4096 8月   8 22:39 log
drwxr-xr-x 3 root  root  4096 8月   8 22:39 log_perf
drwxr-xr-x 4 62265 users 4096 10月 12 2019 plugin
drwxr-xr-x 2 62265 users 4096 10月  2 14:42 script
drwxr-xr-x 2 62265 users 4096 10月  2 14:42 tmp
[root@node01 datax]# bin/datax.py   job/job.json

出现以下结果说明你得环境没有问题
在这里插入图片描述

2.2从Stream流读取数流打印到控制台

查看官方提高的模板

[root@node01 datax]# python  bin/datax.py  -r streamreader -w streamwriter
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


Please refer to the streamreader document:
     https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md

Please refer to the streamwriter document:
     https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md,k

Please save the following configuration as a json file and  use
     python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
to run the job.

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "streamreader",
                    "parameter": {
                        "column": [],
                        "sliceRecordCount": ""
                    }
                },
                "writer": {
                    "name": "streamwriter",
                    "parameter": {
                        "encoding": "",
                        "print": true
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": ""
            }
        }
    }
}

根据模板修改配置

[root@node01 datax]# vim job/stream2_stream.json
{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "streamreader",
                    "parameter": {
                        "sliceRecordCount": 10,
                        "column": [
                            {
                                "type": "long",
                                "value": "10"
                            },
                            {
                                "type": "string",
                                "value": "hello,DataX"
                            }
                        ]
                    }
                },
                "writer": {
                    "name": "streamwriter",
                    "parameter": {
                        "encoding": "UTF-8",
                        "print": true
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": 1
            }
        }
    }
}

执行脚本(部分日志)

[root@node01 datax]# bin/datax.py  job/stream2_stream.json	
2020-10-02 15:31:00.714 [job-0] INFO  JobContainer - PerfTrace not enable!
2020-10-02 15:31:00.714 [job-0] INFO  StandAloneJobContainerCommunicator - Total 10 records, 130 bytes | Speed 13B/s, 1                                                                                     records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.006s | Percentage 10                                                                                    0.00%
2020-10-02 15:31:00.715 [job-0] INFO  JobContainer -
任务启动时刻                    : 2020-10-02 15:30:50
任务结束时刻                    : 2020-10-02 15:31:00
任务总计耗时                    :                 10s
任务平均流量                    :               13B/s
记录写入速度                    :              1rec/s
读出记录总数                    :                  10
读写失败总数                    :                   0

2.3读取MySQL中的数据写入到HDFS

准备
创建数据库和表并加载测试数据

create database test;
use test;
create table c_s(
   id   varchar(100) null,
    c_id int          null,
    s_id varchar(20)  null
);
INSERT INTO test.c_s (id, c_id, s_id) VALUES ('123', 1, '201967');
INSERT INTO test.c_s (id, c_id, s_id) VALUES ('123', 2, '201967');
INSERT INTO test.c_s (id, c_id, s_id) VALUES ('123', 3, '201967');
INSERT INTO test.c_s (id, c_id, s_id) VALUES ('123', 5, '201967');
INSERT INTO test.c_s (id, c_id, s_id) VALUES ('123', 6, '201967');

查看官方提高的模板

[root@node01 datax]# bin/datax.py -r mysqlreader -w hdfswriter

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


Please refer to the mysqlreader document:
     https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md

Please refer to the hdfswriter document:
     https://github.com/alibaba/DataX/blob/master/hdfswriter/doc/hdfswriter.md

Please save the following configuration as a json file and  use
     python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
to run the job.

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "column": [],
                        "connection": [
                            {
                                "jdbcUrl": [],
                                "table": []
                            }
                        ],
                        "password": "",
                        "username": "",
                        "where": ""
                    }
                },
                "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "column": [],
                        "compress": "",
                        "defaultFS": "",
                        "fieldDelimiter": "",
                        "fileName": "",
                        "fileType": "",
                        "path": "",
                        "writeMode": ""
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": ""
            }
        }
    }
}

根据官网模板进行修改

[root@node01 datax]# vim job/mysqlToHDFS.json
{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "column": [
                            "id",
                            "c_id",
                            "s_id"
                        ],
                        "connection": [
                            {
                                "jdbcUrl": [
                                    "jdbc:mysql://node02:3306/test"
                                ],
                                "table": [
                                    "c_s"
                                ]
                            }
                        ],
                        "password": "123456",
                        "username": "root"
                    }
                },
                "writer": {
                    "name": "hdfswriter",
                    "parameter": {
                        "column": [
                            {
                                "name": "id",
                                "type": "string"
                            },
                            {
                                "name": "c_id",
                                "type": "int"
                            },
                            {
                                "name": "s_id",
                                "type": "string"
                            }
                        ],
                        "defaultFS": "hdfs://node01:8020",
                        "fieldDelimiter": "\t",
                        "fileName": "c_s.txt",
                        "fileType": "text",
                        "path": "/",
                        "writeMode": "append"
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": "1"
            }
        }
    }
}

MySQL的参数介绍
在这里插入图片描述
HDFS参数介绍
在这里插入图片描述
运行脚本

[root@node01 datax]# bin/datax.py  job/mysqlToHDFS.json
2020-10-02 16:12:16.358 [job-0] INFO  HookInvoker - No hook invoked, because base dir not exists or is a file: /export/servers/datax/hook
2020-10-02 16:12:16.359 [job-0] INFO  JobContainer -
         [total cpu info] =>
                averageCpu                     | maxDeltaCpu                    | minDeltaCpu
                -1.00%                         | -1.00%                         | -1.00%


         [total gc info] =>
                 NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime
                 PS MarkSweep         | 1                  | 1                  | 1                  | 0.245s             | 0.245s             | 0.245s
                 PS Scavenge          | 1                  | 1                  | 1                  | 0.155s             | 0.155s             | 0.155s

2020-10-02 16:12:16.359 [job-0] INFO  JobContainer - PerfTrace not enable!
2020-10-02 16:12:16.359 [job-0] INFO  StandAloneJobContainerCommunicator - Total 5 records, 50 bytes | Speed 5B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 100.00%
2020-10-02 16:12:16.360 [job-0] INFO  JobContainer -
任务启动时刻                    : 2020-10-02 16:12:04
任务结束时刻                    : 2020-10-02 16:12:16
任务总计耗时                    :                 12s
任务平均流量                    :                5B/s
记录写入速度                    :              0rec/s
读出记录总数                    :                   5
读写失败总数                    :                   0

2.4 读取HDFS中的数据写入到MySQL

准备工作

create database test;
use test;
create table c_s2(
   id   varchar(100) null,
    c_id int          null,
    s_id varchar(20)  null
);

查看官方提供的模板

[root@node01 datax]# bin/datax.py -r hdfsreader -w mysqlwriter

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


Please refer to the hdfsreader document:
     https://github.com/alibaba/DataX/blob/master/hdfsreader/doc/hdfsreader.md

Please refer to the mysqlwriter document:
     https://github.com/alibaba/DataX/blob/master/mysqlwriter/doc/mysqlwriter.md

Please save the following configuration as a json file and  use
     python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
to run the job.

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "hdfsreader",
                    "parameter": {
                        "column": [],
                        "defaultFS": "",
                        "encoding": "UTF-8",
                        "fieldDelimiter": ",",
                        "fileType": "orc",
                        "path": ""
                    }
                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "column": [],
                        "connection": [
                            {
                                "jdbcUrl": "",
                                "table": []
                            }
                        ],
                        "password": "",
                        "preSql": [],
                        "session": [],
                        "username": "",
                        "writeMode": ""
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": ""
            }
        }
    }
}

根据官方提供模板进行修改

[root@node01 datax]# vim job/hdfsTomysql.json
{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "hdfsreader",
                    "parameter": {
                        "column": [
                            "*"
                        ],
                        "defaultFS": "hdfs://node01:8020",
                        "encoding": "UTF-8",
                        "fieldDelimiter": "\t",
                        "fileType": "text",
                        "path": "/c_s.txt"
                    }
                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "column": [
                            "id",
                            "c_id",
                            "s_id"
                        ],
                        "connection": [
                            {
                                "jdbcUrl": "jdbc:mysql://node02:3306/test",
                                "table": [
                                    "c_s2"
                                ]
                            }
                        ],
                        "password": "123456",
                        "username": "root",
                        "writeMode": "replace"
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": "1"
            }
        }
    }
}

脚本运行

[root@node01 datax]# bin/datax.py job/hdfsTomysql.json

         [total cpu info] =>
                averageCpu                     | maxDeltaCpu                    | minDeltaCpu
                -1.00%                         | -1.00%                         | -1.00%


         [total gc info] =>
                 NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime
                 PS MarkSweep         | 1                  | 1                  | 1                  | 0.026s             | 0.026s             | 0.026s
                 PS Scavenge          | 1                  | 1                  | 1                  | 0.015s             | 0.015s             | 0.015s

2020-10-02 16:57:13.152 [job-0] INFO  JobContainer - PerfTrace not enable!
2020-10-02 16:57:13.152 [job-0] INFO  StandAloneJobContainerCommunicator - Total 5 records, 50 bytes | Speed 5B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.033s | Percentage 100.00%
2020-10-02 16:57:13.153 [job-0] INFO  JobContainer -
任务启动时刻                    : 2020-10-02 16:57:02
任务结束时刻                    : 2020-10-02 16:57:13
任务总计耗时                    :                 11s
任务平均流量                    :                5B/s
记录写入速度                    :              0rec/s
读出记录总数                    :                   5
读写失败总数                    :                   0

三、总结

         好了到这里今天的分享基本就结束了,上述给大家操作了从MySQL到HDFS/从HDFS到MySQL对于喜欢的小伙伴们可以参考1.4根据实际情况配置,细心的小伙伴会发现难道我每一使用都要配置他们?对于这问题有大佬开发了一个Datax-web可以根据页面的配置进行数据的同步这样就大大减少了我们的学习成本。我们下期见~~~

评论 26
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

大数据老哥

欢迎支持

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值