一.系统要求
- Linux
- JDK(1.8以上,推荐1.8)
- Python(推荐Python 2.6.x)
- Apache Maven 3.x(编译DataX时才需要)
此处使用二进制安装包的方式安装,所以无需使用Maven,相关软件配置信息如下:
[root@10-31-1-119 ~]# java -version
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)
[root@10-31-1-119 ~]#
[root@10-31-1-119 ~]# python -V
Python 2.7.5
[root@10-31-1-119 ~]# cat /etc/redhat-release
CentOS Linux release 7.8.2003 (Core)
[root@10-31-1-119 ~]#
二.下载及安装
此处使用下载二进制的方式来安装DataX
2.1 下载
wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
2.2 安装
DataX是绿色版的,下载下来解压即可使用。
-
bin目录
有datax.py这个启动脚本
-
conf目录
conf是配置目录,一般将参数信息放到***.json文件里面
-
job目录
存放运行的job
-
lib目录
存放一些依赖的包
-
plugin目录
存放异构数据源的读和写的jar包
-
script目录
存放readme.md文件
三.启动datax
3.1 创建作业的配置文件
cd $datax_home
cd bin
vi stream2stream.json
将一下内容拷贝到json文件中
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"sliceRecordCount": 10,
"column": [
{
"type": "long",
"value": "10"
},
{
"type": "string",
"value": "hello,你好,世界-DataX"
}
]
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "UTF-8",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": 5
}
}
}
}
3.2 启动datax
cd $datax_home
cd bin
python datax.py ./stream2stream.json
测试记录:
[root@10-31-1-119 bin]# python datax.py ./stream2stream.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2021-11-22 17:27:56.774 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2021-11-22 17:27:56.783 [main] INFO Engine - the machine info =>
osInfo: Oracle Corporation 1.8 25.242-b08
jvmInfo: Linux amd64 3.10.0-1127.el7.x86_64
cpu num: 8
totalPhysicalMemory: -0.00G
freePhysicalMemory: -0.00G
maxFileDescriptorCount: -1
currentOpenFileDescriptorCount: -1
GC Names [PS MarkSweep, PS Scavenge]
MEMORY_NAME | allocation_size | init_size
PS Eden Space | 256.00MB | 256.00MB
Code Cache | 240.00MB | 2.44MB
Compressed Class Space | 1,024.00MB | 0.00MB
PS Survivor Space | 42.50MB | 42.50MB
PS Old Gen | 683.00MB | 683.00MB
Metaspace | -0.00MB | 0.00MB
2021-11-22 17:27:56.797 [main] INFO Engine -
{
"content":[
{
"reader":{
"name":"streamreader",
"parameter":{
"column":[
{
"type":"long",
"value":"10"
},
{
"type":"string",
"value":"hello,你好,世界-DataX"
}
],
"sliceRecordCount":10
}
},
"writer":{
"name":"streamwriter",
"parameter":{
"encoding":"UTF-8",
"print":true
}
}
}
],
"setting":{
"speed":{
"channel":5
}
}
}
2021-11-22 17:27:56.812 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null
2021-11-22 17:27:56.814 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2021-11-22 17:27:56.814 [main] INFO JobContainer - DataX jobContainer starts job.
2021-11-22 17:27:56.815 [main] INFO JobContainer - Set jobId = 0
2021-11-22 17:27:56.827 [job-0] INFO JobContainer - jobContainer starts to do prepare ...
2021-11-22 17:27:56.827 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do prepare work .
2021-11-22 17:27:56.827 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do prepare work .
2021-11-22 17:27:56.827 [job-0] INFO JobContainer - jobContainer starts to do split ...
2021-11-22 17:27:56.828 [job-0] INFO JobContainer - Job set Channel-Number to 5 channels.
2021-11-22 17:27:56.828 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] splits to [5] tasks.
2021-11-22 17:27:56.829 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] splits to [5] tasks.
2021-11-22 17:27:56.850 [job-0] INFO JobContainer - jobContainer starts to do schedule ...
2021-11-22 17:27:56.859 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2021-11-22 17:27:56.860 [job-0] INFO JobContainer - Running by standalone Mode.
2021-11-22 17:27:56.873 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [5] channels for [5] tasks.
2021-11-22 17:27:56.878 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2021-11-22 17:27:56.878 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
2021-11-22 17:27:56.889 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[1] attemptCount[1] is started
2021-11-22 17:27:56.891 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[4] attemptCount[1] is started
2021-11-22 17:27:56.894 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[2] attemptCount[1] is started
2021-11-22 17:27:56.896 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[3] attemptCount[1] is started
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
2021-11-22 17:27:56.898 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
2021-11-22 17:27:56.999 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[101]ms
2021-11-22 17:27:57.000 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[1] is successed, used[113]ms
2021-11-22 17:27:57.000 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[2] is successed, used[107]ms
2021-11-22 17:27:57.000 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[3] is successed, used[105]ms
2021-11-22 17:27:57.000 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[4] is successed, used[109]ms
2021-11-22 17:27:57.001 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks.
2021-11-22 17:28:06.882 [job-0] INFO StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.001s | All Task WaitReaderTime 0.000s | Percentage 100.00%
2021-11-22 17:28:06.883 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.
2021-11-22 17:28:06.885 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do post work.
2021-11-22 17:28:06.886 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do post work.
2021-11-22 17:28:06.886 [job-0] INFO JobContainer - DataX jobId [0] completed successfully.
2021-11-22 17:28:06.888 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /home/software/datax/hook
2021-11-22 17:28:06.891 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
PS MarkSweep | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
PS Scavenge | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
2021-11-22 17:28:06.892 [job-0] INFO JobContainer - PerfTrace not enable!
2021-11-22 17:28:06.893 [job-0] INFO StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.001s | All Task WaitReaderTime 0.000s | Percentage 100.00%
2021-11-22 17:28:06.893 [job-0] INFO JobContainer -
任务启动时刻 : 2021-11-22 17:27:56
任务结束时刻 : 2021-11-22 17:28:06
任务总计耗时 : 10s
任务平均流量 : 95B/s
记录写入速度 : 5rec/s
读出记录总数 : 50
读写失败总数 : 0
[root@10-31-1-119 bin]#
参考:
- https://github.com/alibaba/DataX/blob/master/userGuid.md