Datax/Addax
1. 介绍
Addax(此前叫 DataX)是一款支持主流数据库 (Cassandra, ClickHouse, DBF, Hive, InfluxDB, Kudu, MySQL, Oracle, Presto(Trino), PostgreSQL, SQL Server) 的通用数据采集工具。
2. 框架设计
Addax是一个离线数据同步框架,采用Framework+plugin架构构建。
将数据源读取和写入抽象成为Reader/Writer插件。
-
Reader:Reader为数据采集模块,负责采集数据源的数据,将数据发送给Framework。
-
Writer: Writer为数据写入模块,负责不断向Framework取数据,并将数据写入到目的端。
-
Framework:Framework用于连接reader和writer,作为两者的数据传输通道,并处理缓冲,流控,并发,数据转换等核心技术问题。
Addax Framework提供了简单的接口与插件交互,提供简单的插件接入机制,只需要任意加上一种插件,就能无缝对接其他数据源。
3. 运行要求
- JDK1.8+
- Python2.7 +/ python3.7+
4. 文档
database/filesystem | read | write | plugin(reader/writer) | memo |
---|---|---|---|---|
Cassandra | ✓ | ✓ | cassandrareader/cassandrawriter | |
ClickHouse | ✓ | ----- | clickhousereader/clickhousewriter | ---- |
DB2 | ✓ | ✓ | rbdmsreader/rdbmswriter | not fully tested |
DBF | ✓ | ✓ | dbfreader/dbfwriter | ---- |
ElasticSearch | ✓ | ✓ | elasticsearchreader/elasticsearchwriter | originally from @Kestrong |
Excel | ✓ | ✓ | excelreader/excelwriter | ---- |
FTP | ✓ | ✓ | ftpreader/ftpwriter | ---- |
HBase 1.x(API) | ✓ | ✓ | hbase11xreader/hbase11xwriter | use HBASE API |
HBase 1.x(SQL) | ✓ | ✓ | hbase11xsqlreader/hbase11xsqlwriter | use PhoenixPhoenix |
HBase 2.x(API) | ✓ | x | hbase20xreader | use HBase API |
HBase 2.x(SQL) | ✓ | ✓ | hbase20xsqlreader/hbase20xsqlwriter | via Phoenix |
HDFS | ✓ | ✓ | hdfsreader/hdfswriter | support HDFS 2.0 or later |
Hive | ✓ | x | hivereader | ---- |
HTTP | ✓ | x | httpreader | support RestFul API |
Greenplum | ✓ | ✓ | postgresqlreader/greenplumwriter | ---- |
InfluxDB | ✓ | ✓ | influxdbreader/influxdbwriter | ONLY support InfluxDB 1.x |
InfluxDB2 | ✓ | ✓ | influxdb2reader/influxdb2writer | ONLY InfluxDB 2.0 or later |
json | ✓ | x | jsonfilereader | ---- |
kudu | ✓ | ✓ | kudureader/kuduwriter | ---- |
MongoDB | ✓ | ✓ | mongodbreader/mongodbwriter | ---- |
MySQL/MariaDB | ✓ | ✓ | mysqlreader/mysqlwriter | ---- |
Oracle | ✓ | ✓ | oraclereader/oraclewriter | ---- |
PostgreSQL | ✓ | ✓ | postgresqlreader/postgresqlwriter | ---- |
Trino | ✓ | ✓ | rdbmsreader/rdbmswriter | trino |
Redis | ✓ | ✓ | redisreader/rediswriter | ---- |
SQLite | ✓ | ✓ | sqlitereader/sqlitewriter | ---- |
SQL Server | ✓ | ✓ | sqlserverreader/sqlserverwriter | ---- |
TDengine | ✓ | ✓ | tdenginereader/tdenginewriter | TDengine |
TDH Inceptor2 | ✓ | ✓ | rdbmsreader/rdbmswriter | Transwarp TDH 5.1 or later |
TEXT | ✓ | ✓ | textfilereader/textfilewriter | ---- |
5. 安装步骤
1. 安装jdk
2. 安装python2.7
sudo apt install python2.7
sudo apt install python-pip
sudo apt install python3-pip
3. 将addax-4.0.8.tar.gz解压
sudo tar -zxvf addax-4.0.8.tar.gz -C /opt/
4. 执行测试脚本
cd /opt/addax-4.0.8.tar.gz/
sudo ./bin/addax.sh ./job/job.json
如果没有报错,则会有类似以下输出
___ _ _
/ _ \ | | | |
/ /_\ \ __| | __| | __ ___ __
| _ |/ _` |/ _` |/ _` \ \/ /
| | | | (_| | (_| | (_| |> <
\_| |_/\__,_|\__,_|\__,_/_/\_\
:: Addax version :: (v4.0.3-SNAPSHOT)
2021-08-23 13:45:17.199 [ main] INFO VMInfo - VMInfo# operatingSystem class => com.sun.management.internal.OperatingSystemImpl
2021-08-23 13:45:17.223 [ main] INFO Engine -
{
"content":
{
"reader":{
"parameter":{
"column":[
{
"type":"string",
"value":"addax"
},
{
"type":"long",
"value":19890604
},
{
"type":"date",
"value":"1989-06-04 00:00:00"
},
{
"type":"bool",
"value":true
}
],
"sliceRecordCount":10
},
"name":"streamreader"
},
"writer":{
"parameter":{
"print":true
},
"name":"streamwriter"
}
}
}
2021-08-23 13:45:17.238 [ main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2021-08-23 13:45:17.239 [ main] INFO JobContainer - Addax jobContainer starts job.
2021-08-23 13:45:17.240 [ main] INFO JobContainer - Set jobId = 0
2021-08-23 13:45:17.250 [ job-0] INFO JobContainer - Addax Reader.Job [streamreader] do prepare work .
2021-08-23 13:45:17.250 [ job-0] INFO JobContainer - Addax Writer.Job [streamwriter] do prepare work .
2021-08-23 13:45:17.251 [ job-0] INFO JobContainer - Job set Channel-Number to 1 channels.
2021-08-23 13:45:17.251 [ job-0] INFO JobContainer - Addax Reader.Job [streamreader] splits to [1] tasks.
2021-08-23 13:45:17.252 [ job-0] INFO JobContainer - Addax Writer.Job [streamwriter] splits to [1] tasks.
2021-08-23 13:45:17.276 [ job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2021-08-23 13:45:17.282 [ taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2021-08-23 13:45:17.287 [ taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2021-08-23 13:45:17.288 [ taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
addax 19890604 1989-06-04 00:00:00 true
addax 19890604 1989-06-04 00:00:00 true
addax 19890604 1989-06-04 00:00:00 true
addax 19890604 1989-06-04 00:00:00 true
addax 19890604 1989-06-04 00:00:00 true
addax 19890604 1989-06-04 00:00:00 true
addax 19890604 1989-06-04 00:00:00 true
addax 19890604 1989-06-04 00:00:00 true
addax 19890604 1989-06-04 00:00:00 true
addax 19890604 1989-06-04 00:00:00 true
2021-08-23 13:45:20.295 [ job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.
2021-08-23 13:45:20.296 [ job-0] INFO JobContainer - Addax Writer.Job [streamwriter] do post work.
2021-08-23 13:45:20.297 [ job-0] INFO JobContainer - Addax Reader.Job [streamreader] do post work.
2021-08-23 13:45:20.302 [ job-0] INFO JobContainer - PerfTrace not enable!
2021-08-23 13:45:20.305 [ job-0] INFO StandAloneJobContainerCommunicator - Total 10 records, 220 bytes | Speed 73B/s, 3 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.011s | Percentage 100.00%
2021-08-23 13:45:20.307 [ job-0] INFO JobContainer -
任务启动时刻 : 2021-08-23 13:45:17
任务结束时刻 : 2021-08-23 13:45:20
任务总计耗时 : 3s
任务平均流量 : 73B/s
记录写入速度 : 3rec/s
读出记录总数 : 10
读写失败总数 : 0
5. 示例
- 创建两个数据库
- 各创建一张user表
- 表1中增加几条数据
- 编写json配置
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123456",
"column": [
"id",
"name"
],
"connection": [
{
"table": [
"user"
],
"jdbcUrl": [
"jdbc:mysql://127.0.0.1:3306/mysqlreader?characterEncoding=utf8"
]
}
]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"username": "root",
"password": "123456",
"column": [
"id",
"name"
],
"connection": [
{
"table": [
"user"
],
"jdbcUrl": "jdbc:mysql://127.0.0.1:3306/mysqlwriter?characterEncoding=utf8"
}
]
}
}
}
],
"setting": {
"speed": {
"channel": 1,
"byte": 104857600
},
"errorLimit": {
"record": 10,
"percentage": 0.05
}
}
}
}
- settings: 用来定义本次任务的一些控制参数,比如指定多少线程,最大错误率,最大错误记录条数等,这是可选配置。可配置内容如下:
{
"setting": {
"speed": {
"byte": -1,
"record": 100,
"channel": 1
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
}
}
- reader: 用来配置数据读取所需要的相关信息,这是必填内容
- writer: 用来配置写入数据所需要的相关信息,这是必填内容
- transformer: 数据转换规则,如果需要对读取的数据在写入之前做一些变换,可以配置该项,否则可以不配置
- name:数据库类型
- username:数据库用户名
- password:数据库密码
- column:列名
- connection:table:表名
jdbcUrl:sql连接地址
- 执行结果
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2022-02-23 14:14:05.602 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2022-02-23 14:14:05.610 [main] INFO Engine - the machine info =>
osInfo: Private Build 1.8 25.312-b07
jvmInfo: Linux amd64 5.13.0-30-generic
cpu num: 1
totalPhysicalMemory: -0.00G
freePhysicalMemory: -0.00G
maxFileDescriptorCount: -1
currentOpenFileDescriptorCount: -1
GC Names [Copy, MarkSweepCompact]
MEMORY_NAME | allocation_size | init_size
Eden Space | 273.06MB | 273.06MB
Code Cache | 240.00MB | 2.44MB
Survivor Space | 34.13MB | 34.13MB
Compressed Class Space | 1,024.00MB | 0.00MB
Metaspace | -0.00MB | 0.00MB
Tenured Gen | 682.69MB | 682.69MB
2022-02-23 14:14:05.629 [main] INFO Engine -
{
"content":[
{
"reader":{
"name":"mysqlreader",
"parameter":{
"column":[
"id",
"name"
],
"connection":[
{
"jdbcUrl":[
"jdbc:mysql://127.0.0.1:3306/mysqlreader?characterEncoding=utf8"
],
"table":[
"user"
]
}
],
"password":"******",
"username":"root"
}
},
"writer":{
"name":"mysqlwriter",
"parameter":{
"column":[
"id",
"name"
],
"connection":[
{
"jdbcUrl":"jdbc:mysql://127.0.0.1:3306/mysqlwriter?characterEncoding=utf8",
"table":[
"user"
]
}
],
"password":"******",
"username":"root"
}
}
}
],
"setting":{
"errorLimit":{
"percentage":0.05,
"record":10
},
"speed":{
"byte":104857600,
"channel":1
}
}
}
2022-02-23 14:14:05.653 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null
2022-02-23 14:14:05.654 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2022-02-23 14:14:05.655 [main] INFO JobContainer - DataX jobContainer starts job.
2022-02-23 14:14:05.661 [main] INFO JobContainer - Set jobId = 0
2022-02-23 14:14:06.006 [job-0] INFO OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:mysql://127.0.0.1:3306/mysqlreader?characterEncoding=utf8&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true.
2022-02-23 14:14:06.027 [job-0] INFO OriginalConfPretreatmentUtil - table:[user] has columns:[id,name].
2022-02-23 14:14:06.241 [job-0] INFO OriginalConfPretreatmentUtil - table:[user] all columns:[
id,name
].
2022-02-23 14:14:06.251 [job-0] INFO OriginalConfPretreatmentUtil - Write data [
INSERT INTO %s (id,name) VALUES(?,?)
], which jdbcUrl like:[jdbc:mysql://127.0.0.1:3306/mysqlwriter?characterEncoding=utf8&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true]
2022-02-23 14:14:06.251 [job-0] INFO JobContainer - jobContainer starts to do prepare ...
2022-02-23 14:14:06.255 [job-0] INFO JobContainer - DataX Reader.Job [mysqlreader] do prepare work .
2022-02-23 14:14:06.257 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] do prepare work .
2022-02-23 14:14:06.257 [job-0] INFO JobContainer - jobContainer starts to do split ...
2022-02-23 14:14:06.258 [job-0] INFO JobContainer - Job set Max-Byte-Speed to 104857600 bytes.
2022-02-23 14:14:06.265 [job-0] INFO JobContainer - DataX Reader.Job [mysqlreader] splits to [1] tasks.
2022-02-23 14:14:06.267 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] splits to [1] tasks.
2022-02-23 14:14:06.288 [job-0] INFO JobContainer - jobContainer starts to do schedule ...
2022-02-23 14:14:06.298 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2022-02-23 14:14:06.299 [job-0] INFO JobContainer - Running by standalone Mode.
2022-02-23 14:14:06.318 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2022-02-23 14:14:06.324 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2022-02-23 14:14:06.326 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
2022-02-23 14:14:06.346 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2022-02-23 14:14:06.353 [0-0-0-reader] INFO CommonRdbmsReader$Task - Begin to read record by Sql: [select id,name from user
] jdbcUrl:[jdbc:mysql://127.0.0.1:3306/mysqlreader?characterEncoding=utf8&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2022-02-23 14:14:06.380 [0-0-0-reader] INFO CommonRdbmsReader$Task - Finished read record by Sql: [select id,name from user
] jdbcUrl:[jdbc:mysql://127.0.0.1:3306/mysqlreader?characterEncoding=utf8&yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true].
2022-02-23 14:14:06.747 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[409]ms
2022-02-23 14:14:06.748 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks.
2022-02-23 14:14:16.333 [job-0] INFO StandAloneJobContainerCommunicator - Total 3 records, 6 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.002s | Percentage 100.00%
2022-02-23 14:14:16.334 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.
2022-02-23 14:14:16.335 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] do post work.
2022-02-23 14:14:16.335 [job-0] INFO JobContainer - DataX Reader.Job [mysqlreader] do post work.
2022-02-23 14:14:16.336 [job-0] INFO JobContainer - DataX jobId [0] completed successfully.
2022-02-23 14:14:16.337 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /opt/datax/hook
2022-02-23 14:14:16.342 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
Copy | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
MarkSweepCompact | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
2022-02-23 14:14:16.343 [job-0] INFO JobContainer - PerfTrace not enable!
2022-02-23 14:14:16.344 [job-0] INFO StandAloneJobContainerCommunicator - Total 3 records, 6 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.002s | Percentage 100.00%
2022-02-23 14:14:16.350 [job-0] INFO JobContainer -
任务启动时刻 : 2022-02-23 14:14:05
任务结束时刻 : 2022-02-23 14:14:16
任务总计耗时 : 10s
任务平均流量 : 0B/s
记录写入速度 : 0rec/s
读出记录总数 : 3
读写失败总数 : 0
6.不同数据源之间的迁移
- 配置文件修改:
- name
- jdbcUrl
7.多表同步和多表定时增量同步
- 资料中使用shell脚本 + 模板文件实现批量同步
- 实际使用时需要将模板文件中的column属性也通过变量的方式传参
8.待解决问题
1. windows上使用失败
2. addax使用失败
使用addax时连接不上数据库,相同脚本在datax中正常运行