datax工具简明使用

datax同步工具简明使用

1、工具下载

datax下载地址:https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz

[root@node3 data]# wget https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
--2024-04-08 11:53:32--  https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
正在解析主机 datax-opensource.oss-cn-hangzhou.aliyuncs.com (datax-opensource.oss-cn-hangzhou.aliyuncs.com)... 47.110.177.48
正在连接 datax-opensource.oss-cn-hangzhou.aliyuncs.com (datax-opensource.oss-cn-hangzhou.aliyuncs.com)|47.110.177.48|:443... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:853734462 (814M) [application/gzip]
正在保存至: “datax.tar.gz”

100%[===========================================================================================================================================================>] 853,734,462 41.3MB/s 用时 21s

2024-04-08 11:53:53 (38.9 MB/s) - 已保存 “datax.tar.gz” [853734462/853734462])

2、解压包并赋权

[root@node3 data]# tar -zxvf datax.tar.gz
[root@node3 data]# chmod 755 -R datax
[root@node3 data]# ll /data/datax
总用量 4
drwxr-xr-x 2 502 games  118 48 11:56 bin
drwxr-xr-x 2 502 games  136 48 11:56 conf
drwxr-xr-x 2 502 games   40 48 11:56 job
drwxr-xr-x 2 502 games 4096 48 11:56 lib
drwxr-xr-x 4 502 games   42 129 2021 log
drwxr-xr-x 4 502 games   42 129 2021 log_perf
drwxr-xr-x 4 502 games   66 48 11:56 plugin
drwxr-xr-x 2 502 games   42 48 11:56 script
drwxr-xr-x 2 502 games   44 1012 2019 tmp

3、datax帮助

[root@node3 bin]# python datax.py

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


Usage: datax.py [options] job-url-or-path

Options:
  -h, --help            show this help message and exit

  Product Env Options:
    Normal user use these options to set jvm parameters, job runtime mode
    etc. Make sure these options can be used in Product Env.

    -j <jvm parameters>, --jvm=<jvm parameters>
                        Set jvm parameters if necessary.
    --jobid=<job unique id>
                        Set job unique id when running by Distribute/Local
                        Mode.
    -m <job runtime mode>, --mode=<job runtime mode>
                        Set job runtime mode such as: standalone, local,
                        distribute. Default mode is standalone.
    -p <parameter used in job config>, --params=<parameter used in job config>
                        Set job parameter, eg: the source tableName you want
                        to set it by command, then you can use like this:
                        -p"-DtableName=your-table-name", if you have mutiple
                        parameters: -p"-DtableName=your-table-name
                        -DcolumnName=your-column-name".Note: you should config
                        in you job tableName with ${tableName}.
    -r <parameter used in view job config[reader] template>, --reader=<parameter used in view job config[reader] template>
                        View job config[reader] template, eg:
                        mysqlreader,streamreader
    -w <parameter used in view job config[writer] template>, --writer=<parameter used in view job config[writer] template>
                        View job config[writer] template, eg:
                        mysqlwriter,streamwriter

  Develop/Debug Options:
    Developer use these options to trace more details of DataX.

    -d, --debug         Set to remote debug mode.
    --loglevel=<log level>
                        Set log level such as: debug, info, all etc.

4、生成同步数据的配置文件

相关的reader或writer插件可参考:https://github.com/alibaba/DataX/blob/master/introduction.md
例如:mysql To mysql

[root@node3 bin]# python datax.py -r mysqlreader -w mysqlwriter

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


Please refer to the mysqlreader document:
     https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md

Please refer to the mysqlwriter document:
     https://github.com/alibaba/DataX/blob/master/mysqlwriter/doc/mysqlwriter.md

Please save the following configuration as a json file and  use
     python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
to run the job.

{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "column": [],
                        "connection": [
                            {
                                "jdbcUrl": [],
                                "table": []
                            }
                        ],
                        "password": "",
                        "username": "",
                        "where": ""
                    }
                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "column": [],
                        "connection": [
                            {
                                "jdbcUrl": "",
                                "table": []
                            }
                        ],
                        "password": "",
                        "preSql": [],
                        "session": [],
                        "username": "",
                        "writeMode": ""
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": ""
            }
        }
    }
}

5、datax同步启动命令

python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json

6、对datax启动的jvm进行修改

[root@node3 bin]# python datax.py --jvm="-Xms12G -Xmx15G" test.json

7、通过传参执行任务

同步的配置文件

[root@node3 bin]# cat m2m_parm.json
{
    "job": {
        "content": [
            {
                "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "column": ["*"],
                        "connection": [
                            {
                                "jdbcUrl": ["jdbc:mysql://192.168.122.28:3306/test?characterEncoding=utf8&useSSL=false"],
                                "table": ["$O_TABLE"]
                            }
                        ],
                        "password": "password",
                        "username": "root",
                        "where": ""
                    }
                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "column": ["*"],
                        "connection": [
                            {
                                "jdbcUrl": "jdbc:mysql://192.168.122.28:3306/test?rewriteBatchedStatements=true&characterEncoding=utf8&useSSL=false",
                                "table": ["$T_TABLE"]
                            }
                        ],
                        "password": "password",
                        "preSql": [],
                        "session": [],
                        "username": "root",
                        "writeMode": "insert"
                    }
                }
            }
        ],
        "setting": {
            "speed": {
                "channel": "3"
            }
        }
    }
}

执行同步测试任务

# 通过传参方式执行任务
[root@node3 bin]# python datax.py -p"-DO_TABLE=orders -DT_TABLE=orders_test" m2m_parm.json

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


2024-04-08 12:20:08.705 [main] INFO  MessageSource - JVM TimeZone: GMT+08:00, Locale: zh_CN
2024-04-08 12:20:08.708 [main] INFO  MessageSource - use Locale: zh_CN timeZone: sun.util.calendar.ZoneInfo[id="GMT+08:00",offset=28800000,dstSavings=0,useDaylight=false,transitions=0,lastRule=null]
2024-04-08 12:20:08.724 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2024-04-08 12:20:08.731 [main] INFO  Engine - the machine info  =>

        osInfo: Linux amd64 3.10.0-1160.el7.x86_64
        jvmInfo:        Oracle Corporation 1.8 25.144-b01
        cpu num:        4

        totalPhysicalMemory:    -0.00G
        freePhysicalMemory:     -0.00G
        maxFileDescriptorCount: -1
        currentOpenFileDescriptorCount: -1

        GC Names        [PS MarkSweep, PS Scavenge]

        MEMORY_NAME                    | allocation_size                | init_size
        PS Eden Space                  | 1,536.50MB                     | 512.50MB
        Code Cache                     | 240.00MB                       | 2.44MB
        Compressed Class Space         | 1,024.00MB                     | 0.00MB
        PS Survivor Space              | 85.00MB                        | 85.00MB
        PS Old Gen                     | 3,413.50MB                     | 1,365.50MB
        Metaspace                      | -0.00MB                        | 0.00MB


2024-04-08 12:20:08.750 [main] INFO  Engine -
{
        "content":[
                {
                        "reader":{
                                "name":"mysqlreader",
                                "parameter":{
                                        "column":[
                                                "*"
                                        ],
                                        "connection":[
                                                {
                                                        "jdbcUrl":[
                                                                "jdbc:mysql://192.168.122.28:3306/test?characterEncoding=utf8&useSSL=false"
                                                        ],
                                                        "table":[
                                                                "orders"
                                                        ]
                                                }
                                        ],
                                        "password":"********",
                                        "username":"root",
                                        "where":""
                                }
                        },
                        "writer":{
                                "name":"mysqlwriter",
                                "parameter":{
                                        "column":[
                                                "*"
                                        ],
                                        "connection":[
                                                {
                                                        "jdbcUrl":"jdbc:mysql://192.168.122.28:3306/test?rewriteBatchedStatements=true&characterEncoding=utf8&useSSL=false",
                                                        "table":[
                                                                "orders_test"
                                                        ]
                                                }
                                        ],
                                        "password":"********",
                                        "preSql":[

                                        ],
                                        "session":[

                                        ],
                                        "username":"root",
                                        "writeMode":"insert"
                                }
                        }
                }
        ],
        "setting":{
                "speed":{
                        "channel":"3"
                }
        }
}

2024-04-08 12:20:08.779 [main] INFO  PerfTrace - PerfTrace traceId=job_-1, isEnable=false
2024-04-08 12:20:08.779 [main] INFO  JobContainer - DataX jobContainer starts job.
2024-04-08 12:20:08.780 [main] INFO  JobContainer - Set jobId = 0
2024-04-08 12:20:09.439 [job-0] INFO  OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:mysql://192.168.122.28:3306/test?characterEncoding=utf8&useSSL=false&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true.
2024-04-08 12:20:09.441 [job-0] WARN  OriginalConfPretreatmentUtil - 您的配置文件中的列配置存在一定的风险. 因为您未配置读取数据库表的列,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置并作出修改.
2024-04-08 12:20:09.793 [job-0] INFO  OriginalConfPretreatmentUtil - table:[orders_test] all columns:[
order_id,customer_id,order_date,total_amount
].
2024-04-08 12:20:09.793 [job-0] WARN  OriginalConfPretreatmentUtil - 您的配置文件中的列配置信息存在风险. 因为您配置的写入数据库表的列为*,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置并作出修改.
2024-04-08 12:20:09.795 [job-0] INFO  OriginalConfPretreatmentUtil - Write data [
insert INTO %s (order_id,customer_id,order_date,total_amount) VALUES(?,?,?,?)
], which jdbcUrl like:[jdbc:mysql://192.168.122.28:3306/test?rewriteBatchedStatements=true&characterEncoding=utf8&useSSL=false&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&rewriteBatchedStatements=true&tinyInt1isBit=false]
2024-04-08 12:20:09.795 [job-0] INFO  JobContainer - jobContainer starts to do prepare ...
2024-04-08 12:20:09.796 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] do prepare work .
2024-04-08 12:20:09.796 [job-0] INFO  JobContainer - DataX Writer.Job [mysqlwriter] do prepare work .
2024-04-08 12:20:09.797 [job-0] INFO  JobContainer - jobContainer starts to do split ...
2024-04-08 12:20:09.797 [job-0] INFO  JobContainer - Job set Channel-Number to 3 channels.
2024-04-08 12:20:09.802 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] splits to [1] tasks.
2024-04-08 12:20:09.802 [job-0] INFO  JobContainer - DataX Writer.Job [mysqlwriter] splits to [1] tasks.
2024-04-08 12:20:09.835 [job-0] INFO  JobContainer - jobContainer starts to do schedule ...
2024-04-08 12:20:09.838 [job-0] INFO  JobContainer - Scheduler starts [1] taskGroups.
2024-04-08 12:20:09.841 [job-0] INFO  JobContainer - Running by standalone Mode.
2024-04-08 12:20:09.856 [taskGroup-0] INFO  TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2024-04-08 12:20:09.861 [taskGroup-0] INFO  Channel - Channel set byte_speed_limit to -1, No bps activated.
2024-04-08 12:20:09.862 [taskGroup-0] INFO  Channel - Channel set record_speed_limit to -1, No tps activated.
2024-04-08 12:20:09.873 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2024-04-08 12:20:09.880 [0-0-0-reader] INFO  CommonRdbmsReader$Task - Begin to read record by Sql: [select * from orders
] jdbcUrl:[jdbc:mysql://192.168.122.28:3306/test?characterEncoding=utf8&useSSL=false&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2024-04-08 12:20:09.907 [0-0-0-reader] INFO  CommonRdbmsReader$Task - Finished read record by Sql: [select * from orders
] jdbcUrl:[jdbc:mysql://192.168.122.28:3306/test?characterEncoding=utf8&useSSL=false&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2024-04-08 12:20:09.974 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[104]ms
2024-04-08 12:20:09.975 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] completed it's tasks.
2024-04-08 12:20:19.872 [job-0] INFO  StandAloneJobContainerCommunicator - Total 7 records, 130 bytes | Speed 13B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 100.00%
2024-04-08 12:20:19.872 [job-0] INFO  AbstractScheduler - Scheduler accomplished all tasks.
2024-04-08 12:20:19.873 [job-0] INFO  JobContainer - DataX Writer.Job [mysqlwriter] do post work.
2024-04-08 12:20:19.873 [job-0] INFO  JobContainer - DataX Reader.Job [mysqlreader] do post work.
2024-04-08 12:20:19.873 [job-0] INFO  JobContainer - DataX jobId [0] completed successfully.
2024-04-08 12:20:19.874 [job-0] INFO  HookInvoker - No hook invoked, because base dir not exists or is a file: /opt/datax/hook
2024-04-08 12:20:19.876 [job-0] INFO  JobContainer -
         [total cpu info] =>
                averageCpu                     | maxDeltaCpu                    | minDeltaCpu
                -1.00%                         | -1.00%                         | -1.00%


         [total gc info] =>
                 NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime
                 PS MarkSweep         | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s
                 PS Scavenge          | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s

2024-04-08 12:20:19.876 [job-0] INFO  JobContainer - PerfTrace not enable!
2024-04-08 12:20:19.877 [job-0] INFO  StandAloneJobContainerCommunicator - Total 7 records, 130 bytes | Speed 13B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.000s | Percentage 100.00%
2024-04-08 12:20:19.878 [job-0] INFO  JobContainer -
任务启动时刻                    : 2024-04-08 12:20:08
任务结束时刻                    : 2024-04-08 12:20:19
任务总计耗时                    :                 11s
任务平均流量                    :               13B/s
记录写入速度                    :              0rec/s
读出记录总数                    :                   7
读写失败总数                    :                   0

8、如何增加新的数据库支持?

datax提供了RDBMS底层接口,传统关系型数据库通常都能支持,比如GBase8a、GBase8s等等。

如何实现读支持

(1) 把对应驱动上传至${DATAX_HOME}/plugin/reader/rdbmsreader/libs目录下
(2) 修改${DATAX_HOME}/plugin/reader/rdbmsreader/plugin.json配置文件

[root@node3 rdbmsreader]# cat plugin.json
{
    "name": "rdbmsreader",
    "class": "com.alibaba.datax.plugin.reader.rdbmsreader.RdbmsReader",
    "description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.",
    "developer": "alibaba",
    "drivers":[ "com.sybase.jdbc3.jdbc.SybDriver", "com.edb.Driver", "com.ibm.db2.jcc.DB2Driver","com.gbase.jdbc.Driver"]
}

如何实现写支持

(1) 把对应驱动上传至${DATAX_HOME}/plugin/writer/rdbmswriter/libs目录下
(2) 修改${DATAX_HOME}/plugin/writer/rdbmswriter/plugin.json配置文件

[root@node3 rdbmswriter]# cat plugin.json
{
    "name": "rdbmswriter",
    "class": "com.alibaba.datax.plugin.reader.rdbmswriter.RdbmsWriter",
    "description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.",
    "developer": "alibaba",
    "drivers":["com.sybase.jdbc3.jdbc.SybDriver", "com.edb.Driver", "com.ibm.db2.jcc.DB2Driver","com.gbase.jdbc.Driver"]
}
  • 13
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值