DataX工具安装与使用
一、准备环境
1、创建用户和组和目录
groupadd -g 1400 datax
useradd -g datax -u 1400 datax
mkdir /datax
chown datax:datax /datax
2、系统环境
Linux
JDK(1.8以上,推荐1.8)
Python(推荐Python2.6.X)
Apache Maven 3.x (Compile DataX)
(1)升级JDK到(1.8.0_251)点击下载
安装:
rpm -ivh jdk-8u151-linux-x64.rpm
查看版本:
java -version
(2)安装 apache-maven 点击下载
解压缩到/datax:
jar -xvf apache-maven-3.5.2-bin.zip -d /datax
添加环境变量:
vi /home/datax/.bash_profile
alias mvn='/datax/apache-maven-3.5.2/bin/mvn'
查看版本:
mvn -version
alias的作用是给命令起一个别的名字(作用的是命令)
export的作用是设置一个变量(作用的是变量)
(3)查看python版本:python -V
(4)DataX 安装配置 点击下载
上传到/datax
目录
解压缩:
tar -zxvf datax.tar.gz
自检脚本:
python {YOUR_DATAX_HOME}/bin/datax.py {YOUR_DATAX_HOME}/job/job.json
自检脚本无法通过报错处理:https://developer.aliyun.com/ask/376622?spm=a2c6h.13148508.0.0.1d734f0eKYBCG7
二、DataX O2O(Oracle to Oracle)配置清单
如两个不通数据库之间使用datax数据迁移
可以通过命令查看配置模板: python datax.py -r {Sourcedb_READER} -w {Targetdb_WRITER}
示例:python /datax/datax/bin/datax.py -r oraclereader -w oraclewriter
查看Oracle到Oracle数据传输json文件模板
[datax@ceshi1 ~]$ python /datax/datax/bin/datax.py -r oraclereader -w oraclewriter
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
Please refer to the oraclereader document:
https://github.com/alibaba/DataX/blob/master/oraclereader/doc/oraclereader.md
Please refer to the oraclewriter document:
https://github.com/alibaba/DataX/blob/master/oraclewriter/doc/oraclewriter.md
Please save the following configuration as a json file and use
python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json
to run the job.
{
"job": {
"content": [
{
"reader": {
"name": "oraclereader",
"parameter": {
"column": [],
"connection": [
{
"jdbcUrl": [],
"table": []
}
],
"password": "",
"username": ""
}
},
"writer": {
"name": "oraclewriter",
"parameter": {
"column": [],
"connection": [
{
"jdbcUrl": "",
"table": []
}
],
"password": "",
"preSql": [],
"username": ""
}
}
}
],
"setting": {
"speed": {
"channel": ""
}
}
}
}
DataX Job配置文件oraclereader和oraclewriter配置项清单如下:
三、操作示例
主机 | 源端 | 目标端 |
---|---|---|
IP | 192.168.48.201(源端) | 192.168.48.130(目标端) |
端口 | 1521 | 1521 |
实例名 | orcl | orcl1 |
用户/密码 | nice/nice | hr/hr |
表名 | STUDY | DATAX |
两表表结构如下:
SQL> desc datax
Name Null? Type
--------------------------------- --------
STUID NOT NULL NUMBER(10)
STUNAME NOT NULL VARCHAR2(20)
源端在/datax/datax/job
目录下用oraclereader to oraclewriter模板配置json文件
vi test.json
{
"job": {
"content": [
{
"reader": {
"name": "oraclereader",
"parameter": {
"column": [
"STUID",
"STUNAME"
],
"connection": [
{
"jdbcUrl": ["jdbc:oracle:thin:@192.168.48.201:1521:orcl"],
"table": ["STUDY"]
}
],
"password": "nice",
"username": "nice"
}
},
"writer": {
"name": "oraclewriter",
"parameter": {
"column": [
"STUID",
"STUNAME"
],
"connection": [
{
"jdbcUrl": "jdbc:oracle:thin:@192.168.48.130:1521:orcl1",
"table": ["DATAX"]
}
],
"password": "hr",
"preSql": ["delete from DATAX"], ##同步前清空DATAX表
"username": "hr"
}
}
}
],
"setting": {
"speed": {
"channel": "4" ##并行数,不加会报错
}
}
}
}
使用python执行json文件输出如下:
[datax@shuaige job]$ python /datax/datax/bin/datax.py /datax/datax/job/test.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2021-03-11 11:11:15.356 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2021-03-11 11:11:15.364 [main] INFO Engine - the machine info =>
osInfo: Oracle Corporation 1.8 25.251-b08
jvmInfo: Linux amd64 2.6.32-358.el6.x86_64
cpu num: 2
totalPhysicalMemory: -0.00G
freePhysicalMemory: -0.00G
maxFileDescriptorCount: -1
currentOpenFileDescriptorCount: -1
GC Names [PS MarkSweep, PS Scavenge]
MEMORY_NAME | allocation_size | init_size
PS Eden Space | 256.00MB | 256.00MB
Code Cache | 240.00MB | 2.44MB
Compressed Class Space | 1,024.00MB | 0.00MB
PS Survivor Space | 42.50MB | 42.50MB
PS Old Gen | 683.00MB | 683.00MB
Metaspace | -0.00MB | 0.00MB
2021-03-11 11:11:15.387 [main] INFO Engine -
{
"content":[
{
"reader":{
"name":"oraclereader",
"parameter":{
"column":[
"STUID",
"STUNAME"
],
"connection":[
{
"jdbcUrl":[
"jdbc:oracle:thin:@192.168.48.201:1521:orcl"
],
"table":[
"STUDY"
]
}
],
"password":"****",
"username":"nice"
}
},
"writer":{
"name":"oraclewriter",
"parameter":{
"column":[
"STUID",
"STUNAME"
],
"connection":[
{
"jdbcUrl":"jdbc:oracle:thin:@192.168.48.130:1521:orcl1",
"table":[
"DATAX"
]
}
],
"password":"**",
"preSql":[
"delete from DATAX"
],
"username":"hr"
}
}
}
],
"setting":{
"speed":{
"channel":"4"
}
}
}
2021-03-11 11:11:15.409 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null
2021-03-11 11:11:15.411 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2021-03-11 11:11:15.412 [main] INFO JobContainer - DataX jobContainer starts job.
2021-03-11 11:11:15.414 [main] INFO JobContainer - Set jobId = 0
2021-03-11 11:11:15.747 [job-0] INFO OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:oracle:thin:@192.168.48.201:1521:orcl.
2021-03-11 11:11:15.938 [job-0] INFO OriginalConfPretreatmentUtil - table:[STUDY] has columns:[STUID,STUNAME].
2021-03-11 11:11:16.190 [job-0] INFO OriginalConfPretreatmentUtil - table:[DATAX] all columns:[
STUID,STUNAME
].
2021-03-11 11:11:16.226 [job-0] INFO OriginalConfPretreatmentUtil - Write data [
INSERT INTO %s (STUID,STUNAME) VALUES(?,?)
], which jdbcUrl like:[jdbc:oracle:thin:@192.168.48.130:1521:orcl1]
2021-03-11 11:11:16.228 [job-0] INFO JobContainer - jobContainer starts to do prepare ...
2021-03-11 11:11:16.229 [job-0] INFO JobContainer - DataX Reader.Job [oraclereader] do prepare work .
2021-03-11 11:11:16.231 [job-0] INFO JobContainer - DataX Writer.Job [oraclewriter] do prepare work .
2021-03-11 11:11:16.263 [job-0] INFO CommonRdbmsWriter$Job - Begin to execute preSqls:[delete from DATAX]. context info:jdbc:oracle:thin:@192.168.48.130:1521:orcl1.
2021-03-11 11:11:16.270 [job-0] INFO JobContainer - jobContainer starts to do split ...
2021-03-11 11:11:16.271 [job-0] INFO JobContainer - Job set Channel-Number to 4 channels.
2021-03-11 11:11:16.283 [job-0] INFO JobContainer - DataX Reader.Job [oraclereader] splits to [1] tasks.
2021-03-11 11:11:16.287 [job-0] INFO JobContainer - DataX Writer.Job [oraclewriter] splits to [1] tasks.
2021-03-11 11:11:16.326 [job-0] INFO JobContainer - jobContainer starts to do schedule ...
2021-03-11 11:11:16.334 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2021-03-11 11:11:16.337 [job-0] INFO JobContainer - Running by standalone Mode.
2021-03-11 11:11:16.349 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2021-03-11 11:11:16.434 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2021-03-11 11:11:16.434 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
2021-03-11 11:11:16.447 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2021-03-11 11:11:16.455 [0-0-0-reader] INFO CommonRdbmsReader$Task - Begin to read record by Sql: [select STUID,STUNAME from STUDY
] jdbcUrl:[jdbc:oracle:thin:@192.168.48.201:1521:orcl].
2021-03-11 11:11:16.499 [0-0-0-reader] INFO CommonRdbmsReader$Task - Finished read record by Sql: [select STUID,STUNAME from STUDY
] jdbcUrl:[jdbc:oracle:thin:@192.168.48.201:1521:orcl].
2021-03-11 11:11:16.749 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[306]ms
2021-03-11 11:11:16.750 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks.
2021-03-11 11:11:26.438 [job-0] INFO StandAloneJobContainerCommunicator - Total 2 records, 18 bytes | Speed 1B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%
2021-03-11 11:11:26.438 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.
2021-03-11 11:11:26.438 [job-0] INFO JobContainer - DataX Writer.Job [oraclewriter] do post work.
2021-03-11 11:11:26.438 [job-0] INFO JobContainer - DataX Reader.Job [oraclereader] do post work.
2021-03-11 11:11:26.438 [job-0] INFO JobContainer - DataX jobId [0] completed successfully.
2021-03-11 11:11:26.439 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /datax/datax/hook
2021-03-11 11:11:26.440 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
PS MarkSweep | 1 | 1 | 1 | 0.048s | 0.048s | 0.048s
PS Scavenge | 1 | 1 | 1 | 0.025s | 0.025s | 0.025s
2021-03-11 11:11:26.441 [job-0] INFO JobContainer - PerfTrace not enable!
2021-03-11 11:11:26.442 [job-0] INFO StandAloneJobContainerCommunicator - Total 2 records, 18 bytes | Speed 1B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00%
2021-03-11 11:11:26.443 [job-0] INFO JobContainer -
任务启动时刻 : 2021-03-11 11:11:15
任务结束时刻 : 2021-03-11 11:11:26
任务总计耗时 : 11s
任务平均流量 : 1B/s
记录写入速度 : 0rec/s
读出记录总数 : 2
读写失败总数 : 0
[datax@shuaige job]$
执行成功,查看目标端数据库
SQL> select * from datax;
STUID STUNAME
---------- --------------------
1 xiaohong
2 xiaoming
SQL>
数据传输成功,如果在执行json文件报错,一定要先看好json文件格式,用户密码大小写,[ ]," "等,一定要加并行数,"channel"参数至少为“1”,“preSql”:[“delete from DATAX”] SQL条件可以不加,但是有主键冲突的行会报错传输失败,也可以换其他sql。
至此结束,谢谢。