Datax使用

Datax/Addax

1. 介绍

Addax(此前叫 DataX)是一款支持主流数据库 (Cassandra, ClickHouse, DBF, Hive, InfluxDB, Kudu, MySQL, Oracle, Presto(Trino), PostgreSQL, SQL Server) 的通用数据采集工具。

2. 框架设计

Addax是一个离线数据同步框架,采用Framework+plugin架构构建。
将数据源读取和写入抽象成为Reader/Writer插件。

  • Reader:Reader为数据采集模块,负责采集数据源的数据,将数据发送给Framework。

  • Writer: Writer为数据写入模块,负责不断向Framework取数据,并将数据写入到目的端。

  • Framework:Framework用于连接reader和writer,作为两者的数据传输通道,并处理缓冲,流控,并发,数据转换等核心技术问题。
    Addax Framework提供了简单的接口与插件交互,提供简单的插件接入机制,只需要任意加上一种插件,就能无缝对接其他数据源。

3. 运行要求

  • JDK1.8+
  • Python2.7 +/ python3.7+

4. 文档

database/filesystemreadwriteplugin(reader/writer)memo
Cassandracassandrareader/cassandrawriter
ClickHouse-----clickhousereader/clickhousewriter----
DB2rbdmsreader/rdbmswriternot fully tested
DBFdbfreader/dbfwriter----
ElasticSearchelasticsearchreader/elasticsearchwriteroriginally from @Kestrong
Excelexcelreader/excelwriter----
FTPftpreader/ftpwriter----
HBase 1.x(API)hbase11xreader/hbase11xwriteruse HBASE API
HBase 1.x(SQL)hbase11xsqlreader/hbase11xsqlwriteruse PhoenixPhoenix
HBase 2.x(API)xhbase20xreaderuse HBase API
HBase 2.x(SQL)hbase20xsqlreader/hbase20xsqlwritervia Phoenix
HDFShdfsreader/hdfswritersupport HDFS 2.0 or later
Hivexhivereader----
HTTPxhttpreadersupport RestFul API
Greenplumpostgresqlreader/greenplumwriter----
InfluxDBinfluxdbreader/influxdbwriterONLY support InfluxDB 1.x
InfluxDB2influxdb2reader/influxdb2writerONLY InfluxDB 2.0 or later
jsonxjsonfilereader----
kudukudureader/kuduwriter----
MongoDBmongodbreader/mongodbwriter----
MySQL/MariaDBmysqlreader/mysqlwriter----
Oracleoraclereader/oraclewriter----
PostgreSQLpostgresqlreader/postgresqlwriter----
Trinordbmsreader/rdbmswritertrino
Redisredisreader/rediswriter----
SQLitesqlitereader/sqlitewriter----
SQL Serversqlserverreader/sqlserverwriter----
TDenginetdenginereader/tdenginewriterTDengine
TDH Inceptor2rdbmsreader/rdbmswriterTranswarp TDH 5.1 or later
TEXTtextfilereader/textfilewriter----

5. 安装步骤

点击跳转参考地址

1. 安装jdk

2. 安装python2.7

sudo apt install python2.7

sudo apt install python-pip

sudo apt install python3-pip

3. 将addax-4.0.8.tar.gz解压

sudo tar -zxvf addax-4.0.8.tar.gz -C /opt/

4. 执行测试脚本

 cd /opt/addax-4.0.8.tar.gz/
 sudo ./bin/addax.sh ./job/job.json

如果没有报错,则会有类似以下输出

  ___      _     _
 / _ \    | |   | |
/ /_\ \ __| | __| | __ ___  __
|  _  |/ _` |/ _` |/ _` \ \/ /
| | | | (_| | (_| | (_| |>  <
\_| |_/\__,_|\__,_|\__,_/_/\_\

:: Addax version ::    (v4.0.3-SNAPSHOT)

2021-08-23 13:45:17.199 [        main] INFO  VMInfo               - VMInfo# operatingSystem class => com.sun.management.internal.OperatingSystemImpl
2021-08-23 13:45:17.223 [        main] INFO  Engine               -
{
    "content":
        {
            "reader":{
                "parameter":{
                    "column":[
                        {
                            "type":"string",
                            "value":"addax"
                        },
                        {
                            "type":"long",
                            "value":19890604
                        },
                        {
                            "type":"date",
                            "value":"1989-06-04 00:00:00"
                        },
                        {
                            "type":"bool",
                            "value":true
                        }
                    ],
                    "sliceRecordCount":10
                },
                "name":"streamreader"
            },
            "writer":{
                "parameter":{
                    "print":true
                },
                "name":"streamwriter"
            }
        }
}

2021-08-23 13:45:17.238 [        main] INFO  PerfTrace            - PerfTrace traceId=job_-1, isEnable=false, priority=0
2021-08-23 13:45:17.239 [        main] INFO  JobContainer         - Addax jobContainer starts job.
2021-08-23 13:45:17.240 [        main] INFO  JobContainer         - Set jobId = 0
2021-08-23 13:45:17.250 [       job-0] INFO  JobContainer         - Addax Reader.Job [streamreader] do prepare work .
2021-08-23 13:45:17.250 [       job-0] INFO  JobContainer         - Addax Writer.Job [streamwriter] do prepare work .
2021-08-23 13:45:17.251 [       job-0] INFO  JobContainer         - Job set Channel-Number to 1 channels.
2021-08-23 13:45:17.251 [       job-0] INFO  JobContainer         - Addax Reader.Job [streamreader] splits to [1] tasks.
2021-08-23 13:45:17.252 [       job-0] INFO  JobContainer         - Addax Writer.Job [streamwriter] splits to [1] tasks.
2021-08-23 13:45:17.276 [       job-0] INFO  JobContainer         - Scheduler starts [1] taskGroups.
2021-08-23 13:45:17.282 [ taskGroup-0] INFO  TaskGroupContainer   - taskGroupId=[0] start [1] channels for [1] tasks.
2021-08-23 13:45:17.287 [ taskGroup-0] INFO  Channel              - Channel set byte_speed_limit to -1, No bps activated.
2021-08-23 13:45:17.288 [ taskGroup-0] INFO  Channel              - Channel set record_speed_limit to -1, No tps activated.
addax   19890604    1989-06-04 00:00:00 true
addax   19890604    1989-06-04 00:00:00 true
addax   19890604    1989-06-04 00:00:00 true
addax   19890604    1989-06-04 00:00:00 true
addax   19890604    1989-06-04 00:00:00 true
addax   19890604    1989-06-04 00:00:00 true
addax   19890604    1989-06-04 00:00:00 true
addax   19890604    1989-06-04 00:00:00 true
addax   19890604    1989-06-04 00:00:00 true
addax   19890604    1989-06-04 00:00:00 true
2021-08-23 13:45:20.295 [       job-0] INFO  AbstractScheduler    - Scheduler accomplished all tasks.
2021-08-23 13:45:20.296 [       job-0] INFO  JobContainer         - Addax Writer.Job [streamwriter] do post work.
2021-08-23 13:45:20.297 [       job-0] INFO  JobContainer         - Addax Reader.Job [streamreader] do post work.
2021-08-23 13:45:20.302 [       job-0] INFO  JobContainer         - PerfTrace not enable!
2021-08-23 13:45:20.305 [       job-0] INFO  StandAloneJobContainerCommunicator - Total 10 records, 220 bytes | Speed 73B/s, 3 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.011s | Percentage 100.00%
2021-08-23 13:45:20.307 [       job-0] INFO  JobContainer         -
任务启动时刻                    : 2021-08-23 13:45:17
任务结束时刻                    : 2021-08-23 13:45:20
任务总计耗时                    :                  3s
任务平均流量                    :               73B/s
记录写入速度                    :              3rec/s
读出记录总数                    :                  10
读写失败总数                    :                   0

5. 示例

  • 创建两个数据库
  • 各创建一张user表
  • 表1中增加几条数据
  • 编写json配置
    {
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "username": "root",
                        "password": "123456",
                        "column": [
                            "id",
                            "name"
                        ],
                        "connection": [
                            {
                                "table": [
                                    "user"
                                ],
                                "jdbcUrl": [
                                    "jdbc:mysql://127.0.0.1:3306/mysqlreader?characterEncoding=utf8"
                                ]
                            }
                        ]
                    }
                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "username": "root",
                        "password": "123456",
                        "column": [
                            "id",
                            "name"
                        ],
                        "connection": [
                            {
                                "table": [
                                    "user"
                                ],
                                "jdbcUrl": "jdbc:mysql://127.0.0.1:3306/mysqlwriter?characterEncoding=utf8"
                            }
                        ]
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": 1,
                "byte": 104857600
            },
            "errorLimit": {
                "record": 10,
                "percentage": 0.05
            }
        }
    }
}

  • settings: 用来定义本次任务的一些控制参数,比如指定多少线程,最大错误率,最大错误记录条数等,这是可选配置。可配置内容如下:
    {
  "setting": {
    "speed": {
      "byte": -1,
      "record": 100,
      "channel": 1
    },
    "errorLimit": {
      "record": 0,
      "percentage": 0.02
    }
  }
}
  • reader: 用来配置数据读取所需要的相关信息,这是必填内容
  • writer: 用来配置写入数据所需要的相关信息,这是必填内容
  • transformer: 数据转换规则,如果需要对读取的数据在写入之前做一些变换,可以配置该项,否则可以不配置
    • name:数据库类型
    • username:数据库用户名
    • password:数据库密码
    • column:列名
    • connection:table:表名
      jdbcUrl:sql连接地址
  • 执行结果
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


2022-02-23 14:14:05.602 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2022-02-23 14:14:05.610 [main] INFO  Engine - the machine info  =>

        osInfo: Private Build 1.8 25.312-b07
        jvmInfo:        Linux amd64 5.13.0-30-generic
        cpu num:        1

        totalPhysicalMemory:    -0.00G
        freePhysicalMemory:     -0.00G
        maxFileDescriptorCount: -1
        currentOpenFileDescriptorCount: -1

        GC Names        [Copy, MarkSweepCompact]

        MEMORY_NAME                    | allocation_size                | init_size
        Eden Space                     | 273.06MB                       | 273.06MB
        Code Cache                     | 240.00MB                       | 2.44MB
        Survivor Space                 | 34.13MB                        | 34.13MB
        Compressed Class Space         | 1,024.00MB                     | 0.00MB
        Metaspace                      | -0.00MB                        | 0.00MB
        Tenured Gen                    | 682.69MB                       | 682.69MB


2022-02-23 14:14:05.629 [main] INFO  Engine -
{
        "content":[
                {
                        "reader":{
                                "name":"mysqlreader",
                                "parameter":{
                                        "column":[
                                                "id",
                                                "name"
                                        ],
                                        "connection":[
                                                {
                                                        "jdbcUrl":[
                                                                "jdbc:mysql://127.0.0.1:3306/mysqlreader?characterEncoding=utf8"
                                                        ],
                                                        "table":[
                                                                "user"
                                                        ]
                                                }
                                        ],
                                        "password":"******",
                                        "username":"root"
                                }
                        },
                        "writer":{
                                "name":"mysqlwriter",
                                "parameter":{
                                        "column":[
                                                "id",
                                                "name"
                                        ],
                                        "connection":[
                                                {
                                                        "jdbcUrl":"jdbc:mysql://127.0.0.1:3306/mysqlwriter?characterEncoding=utf8",
                                                        "table":[
                                                                "user"
                                                        ]
                                                }
                                        ],
                                        "password":"******",
                                        "username":"root"
                                }
                        }
                }
        ],
        "setting":{
                "errorLimit":{
                        "percentage":0.05,
                        "record":10
                },
                "speed":{
                        "byte":104857600,
                        "channel":1
                }
        }
}

2022-02-23 14:14:05.653 [main] WARN  Engine - prioriy set to 0, because NumberFormatException, the value is: null
2022-02-23 14:14:05.654 [main] INFO  PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2022-02-23 14:14:05.655 [main] INFO  JobContainer - DataX jobContainer starts job.
2022-02-23 14:14:05.661 [main] INFO  JobContainer - Set jobId = 0
2022-02-23 14:14:06.006 [job-0] INFO  OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:mysql://127.0.0.1:3306/mysqlreader?characterEncoding=utf8&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true.
2022-02-23 14:14:06.027 [job-0] INFO  OriginalConfPretreatmentUtil - table:[user] has columns:[id,name].
2022-02-23 14:14:06.241 [job-0] INFO  OriginalConfPretreatmentUtil - table:[user] all columns:[
id,name
].
2022-02-23 14:14:06.251 [job-0] INFO  OriginalConfPretreatmentUtil - Write data [
INSERT INTO %s (id,name) VALUES(?,?)
], which jdbcUrl like:[jdbc:mysql://127.0.0.1:3306/mysqlwriter?characterEncoding=utf8&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true]
2022-02-23 14:14:06.251 [job-0] INFO  JobContainer - jobContainer starts to do prepare ...
2022-02-23 14:14:06.255 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] do prepare work .
2022-02-23 14:14:06.257 [job-0] INFO  JobContainer - DataX Writer.Job [mysqlwriter] do prepare work .
2022-02-23 14:14:06.257 [job-0] INFO  JobContainer - jobContainer starts to do split ...
2022-02-23 14:14:06.258 [job-0] INFO  JobContainer - Job set Max-Byte-Speed to 104857600 bytes.
2022-02-23 14:14:06.265 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] splits to [1] tasks.
2022-02-23 14:14:06.267 [job-0] INFO  JobContainer - DataX Writer.Job [mysqlwriter] splits to [1] tasks.
2022-02-23 14:14:06.288 [job-0] INFO  JobContainer - jobContainer starts to do schedule ...
2022-02-23 14:14:06.298 [job-0] INFO  JobContainer - Scheduler starts [1] taskGroups.
2022-02-23 14:14:06.299 [job-0] INFO  JobContainer - Running by standalone Mode.
2022-02-23 14:14:06.318 [taskGroup-0] INFO  TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2022-02-23 14:14:06.324 [taskGroup-0] INFO  Channel - Channel set byte_speed_limit to -1, No bps activated.
2022-02-23 14:14:06.326 [taskGroup-0] INFO  Channel - Channel set record_speed_limit to -1, No tps activated.
2022-02-23 14:14:06.346 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2022-02-23 14:14:06.353 [0-0-0-reader] INFO  CommonRdbmsReader$Task - Begin to read record by Sql: [select id,name from user
] jdbcUrl:[jdbc:mysql://127.0.0.1:3306/mysqlreader?characterEncoding=utf8&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2022-02-23 14:14:06.380 [0-0-0-reader] INFO  CommonRdbmsReader$Task - Finished read record by Sql: [select id,name from user
] jdbcUrl:[jdbc:mysql://127.0.0.1:3306/mysqlreader?characterEncoding=utf8&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2022-02-23 14:14:06.747 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[409]ms
2022-02-23 14:14:06.748 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] completed it's tasks.
2022-02-23 14:14:16.333 [job-0] INFO  StandAloneJobContainerCommunicator - Total 3 records, 6 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.002s | Percentage 100.00%
2022-02-23 14:14:16.334 [job-0] INFO  AbstractScheduler - Scheduler accomplished all tasks.
2022-02-23 14:14:16.335 [job-0] INFO  JobContainer - DataX Writer.Job [mysqlwriter] do post work.
2022-02-23 14:14:16.335 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] do post work.
2022-02-23 14:14:16.336 [job-0] INFO  JobContainer - DataX jobId [0] completed successfully.
2022-02-23 14:14:16.337 [job-0] INFO  HookInvoker - No hook invoked, because base dir not exists or is a file: /opt/datax/hook
2022-02-23 14:14:16.342 [job-0] INFO  JobContainer -
         [total cpu info] =>
                averageCpu                     | maxDeltaCpu                    | minDeltaCpu
                -1.00%                         | -1.00%                         | -1.00%


         [total gc info] =>
                 NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime
                 Copy                 | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s
                 MarkSweepCompact     | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s

2022-02-23 14:14:16.343 [job-0] INFO  JobContainer - PerfTrace not enable!
2022-02-23 14:14:16.344 [job-0] INFO  StandAloneJobContainerCommunicator - Total 3 records, 6 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.002s | Percentage 100.00%
2022-02-23 14:14:16.350 [job-0] INFO  JobContainer -
任务启动时刻                    : 2022-02-23 14:14:05
任务结束时刻                    : 2022-02-23 14:14:16
任务总计耗时                    :                 10s
任务平均流量                    :                0B/s
记录写入速度                    :              0rec/s
读出记录总数                    :                   3
读写失败总数                    :                   0

6.不同数据源之间的迁移

  • 配置文件修改:
    • name
    • jdbcUrl

7.多表同步和多表定时增量同步

点击跳转参考地址

  • 资料中使用shell脚本 + 模板文件实现批量同步
  • 实际使用时需要将模板文件中的column属性也通过变量的方式传参

8.待解决问题

1. windows上使用失败

2. addax使用失败

使用addax时连接不上数据库,相同脚本在datax中正常运行

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值