datax同步工具简明使用
1、工具下载
datax下载地址:https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
[root@node3 data]# wget https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
--2024-04-08 11:53:32-- https://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
正在解析主机 datax-opensource.oss-cn-hangzhou.aliyuncs.com (datax-opensource.oss-cn-hangzhou.aliyuncs.com)... 47.110.177.48
正在连接 datax-opensource.oss-cn-hangzhou.aliyuncs.com (datax-opensource.oss-cn-hangzhou.aliyuncs.com)|47.110.177.48|:443... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:853734462 (814M) [application/gzip]
正在保存至: “datax.tar.gz”
100%[===========================================================================================================================================================>] 853,734,462 41.3MB/s 用时 21s
2024-04-08 11:53:53 (38.9 MB/s) - 已保存 “datax.tar.gz” [853734462/853734462])
2、解压包并赋权
[root@node3 data]# tar -zxvf datax.tar.gz
[root@node3 data]# chmod 755 -R datax
[root@node3 data]# ll /data/datax
总用量 4
drwxr-xr-x 2 502 games 118 4月 8 11:56 bin
drwxr-xr-x 2 502 games 136 4月 8 11:56 conf
drwxr-xr-x 2 502 games 40 4月 8 11:56 job
drwxr-xr-x 2 502 games 4096 4月 8 11:56 lib
drwxr-xr-x 4 502 games 42 12月 9 2021 log
drwxr-xr-x 4 502 games 42 12月 9 2021 log_perf
drwxr-xr-x 4 502 games 66 4月 8 11:56 plugin
drwxr-xr-x 2 502 games 42 4月 8 11:56 script
drwxr-xr-x 2 502 games 44 10月 12 2019 tmp
3、datax帮助
[root@node3 bin]# python datax.py
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
Usage: datax.py [options] job-url-or-path
Options:
-h, --help show this help message and exit
Product Env Options:
Normal user use these options to set jvm parameters, job runtime mode
etc. Make sure these options can be used in Product Env.
-j <jvm parameters>, --jvm=<jvm parameters>
Set jvm parameters if necessary.
--jobid=<job unique id>
Set job unique id when running by Distribute/Local
Mode.
-m <job runtime mode>, --mode=<job runtime mode>
Set job runtime mode such as: standalone, local,
distribute. Default mode is standalone.
-p <parameter used in job config>, --params=<parameter used in job config>
Set job parameter, eg: the source tableName you want
to set it by command, then you can use like this:
-p"-DtableName=your-table-name", if you have mutiple
parameters: -p"-DtableName=your-table-name
-DcolumnName=your-column-name".Note: you should config
in you job tableName with ${tableName}.
-r <parameter used in view job config[reader] template>, --reader=<parameter used in view job config[reader] template>
View job config[reader] template, eg:
mysqlreader,streamreader
-w <parameter used in view job config[writer] template>, --writer=<parameter used in view job config[writer] template>
View job config[writer] template, eg:
mysqlwriter,streamwriter
Develop/Debug Options:
Developer use these options to trace more details of DataX.
-d, --debug Set to remote debug mode.
--loglevel=<log level>
Set log level such as: debug, info, all etc.
4、生成同步数据的配置文件
相关的reader或writer插件可参考:https://github.com/alibaba/DataX/blob/master/introduction.md
例如:mysql To mysql
[root@node3 bin]# python datax.py -r mysqlreader -w mysqlwriter
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
Please refer to the mysqlreader document:
https://github.com/alibaba/DataX/blob/master/mysqlreader/doc/mysqlreader.md
Please refer to the mysqlwriter document:
https://github.com/alibaba/DataX/blob/master/mysqlwriter/doc/mysqlwriter.md
Please save the following configuration as a json file and use
python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
to run the job.
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": [],
"connection": [
{
"jdbcUrl": [],
"table": []
}
],
"password": "",
"username": "",
"where": ""
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": [],
"connection": [
{
"jdbcUrl": "",
"table": []
}
],
"password": "",
"preSql": [],
"session": [],
"username": "",
"writeMode": ""
}
}
}
],
"setting": {
"speed": {
"channel": ""
}
}
}
}
5、datax同步启动命令
python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
6、对datax启动的jvm进行修改
[root@node3 bin]# python datax.py --jvm="-Xms12G -Xmx15G" test.json
7、通过传参执行任务
同步的配置文件
[root@node3 bin]# cat m2m_parm.json
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": ["*"],
"connection": [
{
"jdbcUrl": ["jdbc:mysql://192.168.122.28:3306/test?characterEncoding=utf8&useSSL=false"],
"table": ["$O_TABLE"]
}
],
"password": "password",
"username": "root",
"where": ""
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": ["*"],
"connection": [
{
"jdbcUrl": "jdbc:mysql://192.168.122.28:3306/test?rewriteBatchedStatements=true&characterEncoding=utf8&useSSL=false",
"table": ["$T_TABLE"]
}
],
"password": "password",
"preSql": [],
"session": [],
"username": "root",
"writeMode": "insert"
}
}
}
],
"setting": {
"speed": {
"channel": "3"
}
}
}
}
执行同步测试任务
# 通过传参方式执行任务
[root@node3 bin]# python datax.py -p"-DO_TABLE=orders -DT_TABLE=orders_test" m2m_parm.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2024-04-08 12:20:08.705 [main] INFO MessageSource - JVM TimeZone: GMT+08:00, Locale: zh_CN
2024-04-08 12:20:08.708 [main] INFO MessageSource - use Locale: zh_CN timeZone: sun.util.calendar.ZoneInfo[id="GMT+08:00",offset=28800000,dstSavings=0,useDaylight=false,transitions=0,lastRule=null]
2024-04-08 12:20:08.724 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2024-04-08 12:20:08.731 [main] INFO Engine - the machine info =>
osInfo: Linux amd64 3.10.0-1160.el7.x86_64
jvmInfo: Oracle Corporation 1.8 25.144-b01
cpu num: 4
totalPhysicalMemory: -0.00G
freePhysicalMemory: -0.00G
maxFileDescriptorCount: -1
currentOpenFileDescriptorCount: -1
GC Names [PS MarkSweep, PS Scavenge]
MEMORY_NAME | allocation_size | init_size
PS Eden Space | 1,536.50MB | 512.50MB
Code Cache | 240.00MB | 2.44MB
Compressed Class Space | 1,024.00MB | 0.00MB
PS Survivor Space | 85.00MB | 85.00MB
PS Old Gen | 3,413.50MB | 1,365.50MB
Metaspace | -0.00MB | 0.00MB
2024-04-08 12:20:08.750 [main] INFO Engine -
{
"content":[
{
"reader":{
"name":"mysqlreader",
"parameter":{
"column":[
"*"
],
"connection":[
{
"jdbcUrl":[
"jdbc:mysql://192.168.122.28:3306/test?characterEncoding=utf8&useSSL=false"
],
"table":[
"orders"
]
}
],
"password":"********",
"username":"root",
"where":""
}
},
"writer":{
"name":"mysqlwriter",
"parameter":{
"column":[
"*"
],
"connection":[
{
"jdbcUrl":"jdbc:mysql://192.168.122.28:3306/test?rewriteBatchedStatements=true&characterEncoding=utf8&useSSL=false",
"table":[
"orders_test"
]
}
],
"password":"********",
"preSql":[
],
"session":[
],
"username":"root",
"writeMode":"insert"
}
}
}
],
"setting":{
"speed":{
"channel":"3"
}
}
}
2024-04-08 12:20:08.779 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false
2024-04-08 12:20:08.779 [main] INFO JobContainer - DataX jobContainer starts job.
2024-04-08 12:20:08.780 [main] INFO JobContainer - Set jobId = 0
2024-04-08 12:20:09.439 [job-0] INFO OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:mysql://192.168.122.28:3306/test?characterEncoding=utf8&useSSL=false&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true.
2024-04-08 12:20:09.441 [job-0] WARN OriginalConfPretreatmentUtil - 您的配置文件中的列配置存在一定的风险. 因为您未配置读取数据库表的列,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置并作出修改.
2024-04-08 12:20:09.793 [job-0] INFO OriginalConfPretreatmentUtil - table:[orders_test] all columns:[
order_id,customer_id,order_date,total_amount
].
2024-04-08 12:20:09.793 [job-0] WARN OriginalConfPretreatmentUtil - 您的配置文件中的列配置信息存在风险. 因为您配置的写入数据库表的列为*,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置并作出修改.
2024-04-08 12:20:09.795 [job-0] INFO OriginalConfPretreatmentUtil - Write data [
insert INTO %s (order_id,customer_id,order_date,total_amount) VALUES(?,?,?,?)
], which jdbcUrl like:[jdbc:mysql://192.168.122.28:3306/test?rewriteBatchedStatements=true&characterEncoding=utf8&useSSL=false&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&rewriteBatchedStatements=true&tinyInt1isBit=false]
2024-04-08 12:20:09.795 [job-0] INFO JobContainer - jobContainer starts to do prepare ...
2024-04-08 12:20:09.796 [job-0] INFO JobContainer - DataX Reader.Job [mysqlreader] do prepare work .
2024-04-08 12:20:09.796 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] do prepare work .
2024-04-08 12:20:09.797 [job-0] INFO JobContainer - jobContainer starts to do split ...
2024-04-08 12:20:09.797 [job-0] INFO JobContainer - Job set Channel-Number to 3 channels.
2024-04-08 12:20:09.802 [job-0] INFO JobContainer - DataX Reader.Job [mysqlreader] splits to [1] tasks.
2024-04-08 12:20:09.802 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] splits to [1] tasks.
2024-04-08 12:20:09.835 [job-0] INFO JobContainer - jobContainer starts to do schedule ...
2024-04-08 12:20:09.838 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2024-04-08 12:20:09.841 [job-0] INFO JobContainer - Running by standalone Mode.
2024-04-08 12:20:09.856 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2024-04-08 12:20:09.861 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2024-04-08 12:20:09.862 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
2024-04-08 12:20:09.873 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2024-04-08 12:20:09.880 [0-0-0-reader] INFO CommonRdbmsReader$Task - Begin to read record by Sql: [select * from orders
] jdbcUrl:[jdbc:mysql://192.168.122.28:3306/test?characterEncoding=utf8&useSSL=false&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2024-04-08 12:20:09.907 [0-0-0-reader] INFO CommonRdbmsReader$Task - Finished read record by Sql: [select * from orders
] jdbcUrl:[jdbc:mysql://192.168.122.28:3306/test?characterEncoding=utf8&useSSL=false&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2024-04-08 12:20:09.974 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[104]ms
2024-04-08 12:20:09.975 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks.
2024-04-08 12:20:19.872 [job-0] INFO StandAloneJobContainerCommunicator - Total 7 records, 130 bytes | Speed 13B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%
2024-04-08 12:20:19.872 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.
2024-04-08 12:20:19.873 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] do post work.
2024-04-08 12:20:19.873 [job-0] INFO JobContainer - DataX Reader.Job [mysqlreader] do post work.
2024-04-08 12:20:19.873 [job-0] INFO JobContainer - DataX jobId [0] completed successfully.
2024-04-08 12:20:19.874 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /opt/datax/hook
2024-04-08 12:20:19.876 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
PS MarkSweep | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
PS Scavenge | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
2024-04-08 12:20:19.876 [job-0] INFO JobContainer - PerfTrace not enable!
2024-04-08 12:20:19.877 [job-0] INFO StandAloneJobContainerCommunicator - Total 7 records, 130 bytes | Speed 13B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%
2024-04-08 12:20:19.878 [job-0] INFO JobContainer -
任务启动时刻 : 2024-04-08 12:20:08
任务结束时刻 : 2024-04-08 12:20:19
任务总计耗时 : 11s
任务平均流量 : 13B/s
记录写入速度 : 0rec/s
读出记录总数 : 7
读写失败总数 : 0
8、如何增加新的数据库支持?
datax提供了RDBMS底层接口,传统关系型数据库通常都能支持,比如GBase8a、GBase8s等等。
如何实现读支持
(1) 把对应驱动上传至${DATAX_HOME}/plugin/reader/rdbmsreader/libs目录下
(2) 修改${DATAX_HOME}/plugin/reader/rdbmsreader/plugin.json配置文件
[root@node3 rdbmsreader]# cat plugin.json
{
"name": "rdbmsreader",
"class": "com.alibaba.datax.plugin.reader.rdbmsreader.RdbmsReader",
"description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.",
"developer": "alibaba",
"drivers":[ "com.sybase.jdbc3.jdbc.SybDriver", "com.edb.Driver", "com.ibm.db2.jcc.DB2Driver","com.gbase.jdbc.Driver"]
}
如何实现写支持
(1) 把对应驱动上传至${DATAX_HOME}/plugin/writer/rdbmswriter/libs目录下
(2) 修改${DATAX_HOME}/plugin/writer/rdbmswriter/plugin.json配置文件
[root@node3 rdbmswriter]# cat plugin.json
{
"name": "rdbmswriter",
"class": "com.alibaba.datax.plugin.reader.rdbmswriter.RdbmsWriter",
"description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.",
"developer": "alibaba",
"drivers":["com.sybase.jdbc3.jdbc.SybDriver", "com.edb.Driver", "com.ibm.db2.jcc.DB2Driver","com.gbase.jdbc.Driver"]
}