关于Trafodion我就不作更多说明了,由于支持标准的JDBC,Trafodion可以基于JDBC和一些主流的ETL工具集成实现ETL工作,如Kettle、Embulk等。
在前面的文章中我们有介绍如何使用Kettle从Oracle、MySQL中抽取数据到Trafodion,见https://blog.csdn.net/Post_Yuan/article/details/52804105
也可以使用Embulk抽取关系型数据库中的数据到Trafodion,见https://blog.csdn.net/Post_Yuan/article/details/77856320
本文我们介绍另外一款ETL工具与Trafodion的集成,它是阿里云开源离线同步工具DataX,使用的版本为3.0。
DataX 是一个异构数据源离线同步工具,致力于实现包括关系型数据库(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各种异构数据源之间稳定高效的数据同步功能。为了解决异构数据源同步问题,DataX将复杂的网状的同步链路变成了星型数据链路,DataX作为中间传输载体负责连接各种数据源。当需要接入一个新的数据源的时候,只需要将此数据源对接到DataX,便能跟已有的数据源做到无缝数据同步。
关于DataX的详细介绍,可以参考DataX的github网站介绍
https://github.com/alibaba/DataX/blob/master/introduction.md
在此我们主要介绍如何使用DataX实现两个不同的Trafodion集群之间的数据传输,当然,读者如果有兴趣,也可以自行尝试Trafodion与任意数据源之间的数据传输。
1 首先,我们从https://github.com/alibaba/DataX中提供的下载链接下载DataX安装包并解压到某个目录下,如/opt
2 其次,鉴于DataX中的代码问题,我们下载DataX的源码并稍做修改,修改如下,
修改/DataX-master/rdbmsreader/src/main/java/com/alibaba/datax/plugin/reader/rdbmsreader/SubCommonRdbmsReader.java
把
public static class Job extends CommonRdbmsReader.Job {
public Job(DataBaseType dataBaseType) {
super(dataBaseType);
}
}
修改为:
public static class Job extends CommonRdbmsReader.Job {
public Job(DataBaseType dataBaseType) {
super(dataBaseType);
DBUtil.loadDriverClass("reader", "rdbms");
}
}
修改/DataX-master/rdbmswriter/src/main/java/com/alibaba/datax/plugin/reader/rdbmswriter/SubCommonRdbmsWriter.java
把
public static class Job extends CommonRdbmsReader.Job {
public Job(DataBaseType dataBaseType) {
super(dataBaseType);
}
}
修改为:
public static class Job extends CommonRdbmsReader.Job {
public Job(DataBaseType dataBaseType) {
super(dataBaseType);
DBUtil.loadDriverClass("writer", "rdbms");
}
}
修改完之后重新编译rdbmsreader和rdbmswriter两个目录,把编译生成的DataX-master\rdbmsreader\target\datax\plugin\reader\rdbmsreader和DataX-master\rdbmswriter\target\datax\plugin\writer\rdbmswriter覆盖为/opt/datax/plugin/reader/rdbmsreader及/opt/datax/plugin/writer/rdbmswriter。
3 在rdbmsreader添加trafodion的驱动支持
(1)cd /opt/datax/plugin/reader/rdbmsreader/并编辑plugin.json添加trafodion的Driver配置
(2)cd /opt/datax/plugin/writer/rdbmsreader/libs/添加jdbcT4.jar包
4 在rdbmswriter添加trafodion的驱动支持
(1)cd /opt/datax/plugin/writer/rdbmswriter/并编辑plugin.json添加trafodion的Driver配置
(2)cd /opt/datax/plugin/writer/rdbmswriter/libs/添加jdbcT4.jar包
5 编写配置文件traf2traf.json到/opt/datax/job目录
可参考以下链接配置源库与目标库:
https://github.com/alibaba/DataX/blob/master/rdbmsreader/doc/rdbmsreader.md
https://github.com/alibaba/DataX/blob/master/rdbmswriter/doc/rdbmswriter.md
样例内容如下,
{
"job": {
"setting": {
"speed": {
"byte": 1048576
}
},
"content": [
{
"reader": {
"name": "rdbmsreader",
"parameter": {
"username": "trafodion",
"password": "traf123",
"column": [
"*"
],
"connection": [
{
"table": ["seabase.testdatax"],
"jdbcUrl": ["jdbc:t4jdbc://10.10.14.101:23400/:"]
}
]
}
},
"writer": {
"name": "rdbmswriter",
"parameter": {
"connection": [
{
"table": ["seabase.testdatax"],
"jdbcUrl": "jdbc:t4jdbc://10.10.14.102:23400/:"
}
],
"column": [
"*"
],
"username": "trafodion",
"password": "traf123"
}
}
}
]
}
}
6 在源端创建测试表并插入测试数据
SQL>create table testdatax(a int, b varchar(10) character set utf8);
SQL>insert into testdatax values(1,'张三'),(2,'李四'),(3,'王五');
--- 3 row(s) inserted.
SQL>select * from testdatax;
A B
----------- ----------------------------------------
1 张三
2 李四
3 王五
--- 3 row(s) selected.
7 在目标端创建空测试表
SQL>create table testdatax(a int, b varchar(10) character set utf8);
--- SQL operation complete.
8 启动DataX执行数据抽取操作
命令如下,
python /opt/datax/bin/datax.py /opt/datax/job/traf2traf.json
输出日志如下,
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2018-09-25 22:53:17.821 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2018-09-25 22:53:17.830 [main] INFO Engine - the machine info =>
osInfo: Oracle Corporation 1.8 25.11-b03
jvmInfo: Linux amd64 2.6.32-696.el6.x86_64
cpu num: 4
totalPhysicalMemory: -0.00G
freePhysicalMemory: -0.00G
maxFileDescriptorCount: -1
currentOpenFileDescriptorCount: -1
GC Names [PS MarkSweep, PS Scavenge]
MEMORY_NAME | allocation_size | init_size
PS Eden Space | 256.00MB | 256.00MB
Code Cache | 240.00MB | 2.44MB
Compressed Class Space | 1,024.00MB | 0.00MB
PS Survivor Space | 42.50MB | 42.50MB
PS Old Gen | 683.00MB | 683.00MB
Metaspace | -0.00MB | 0.00MB
2018-09-25 22:53:17.856 [main] INFO Engine -
{
"content":[
{
"reader":{
"name":"rdbmsreader",
"parameter":{
"column":[
"*"
],
"connection":[
{
"jdbcUrl":[
"jdbc:t4jdbc://10.10.14.101:23400/:"
],
"table":[
"seabase.testdatax"
]
}
],
"password":"*******",
"username":"trafodion"
}
},
"writer":{
"name":"rdbmswriter",
"parameter":{
"column":[
"*"
],
"connection":[
{
"jdbcUrl":"jdbc:t4jdbc://10.10.14.102:23400/:",
"table":[
"seabase.testdatax"
]
}
],
"password":"*******",
"username":"trafodion"
}
}
}
],
"setting":{
"speed":{
"byte":1048576
}
}
}
2018-09-25 22:53:17.882 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null
2018-09-25 22:53:17.885 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2018-09-25 22:53:17.885 [main] INFO JobContainer - DataX jobContainer starts job.
2018-09-25 22:53:17.888 [main] INFO JobContainer - Set jobId = 0
2018-09-25 22:53:20.133 [job-0] INFO OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:t4jdbc://10.10.14.101:23400/:.
2018-09-25 22:53:20.135 [job-0] WARN OriginalConfPretreatmentUtil - 您的配置文件中的列配置存在一定的风险. 因为您未配置读取数据库表的列,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置并作出修改.
2018-09-25 22:53:21.092 [job-0] INFO OriginalConfPretreatmentUtil - table:[seabase.testdatax] all columns:[
A,B
].
2018-09-25 22:53:21.092 [job-0] WARN OriginalConfPretreatmentUtil - 您的配置文件中的列配置信息存在风险. 因为您配置的写入数据库表的列为*,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置并作出修改.
2018-09-25 22:53:21.096 [job-0] INFO OriginalConfPretreatmentUtil - Write data [
INSERT INTO %s (A,B) VALUES(?,?)
], which jdbcUrl like:[jdbc:t4jdbc://10.10.14.102:23400/:]
2018-09-25 22:53:21.096 [job-0] INFO JobContainer - jobContainer starts to do prepare ...
2018-09-25 22:53:21.098 [job-0] INFO JobContainer - DataX Reader.Job [rdbmsreader] do prepare work .
2018-09-25 22:53:21.098 [job-0] INFO JobContainer - DataX Writer.Job [rdbmswriter] do prepare work .
2018-09-25 22:53:21.100 [job-0] INFO JobContainer - jobContainer starts to do split ...
2018-09-25 22:53:21.102 [job-0] INFO JobContainer - Job set Max-Byte-Speed to 1048576 bytes.
2018-09-25 22:53:21.112 [job-0] INFO JobContainer - DataX Reader.Job [rdbmsreader] splits to [1] tasks.
2018-09-25 22:53:21.114 [job-0] INFO JobContainer - DataX Writer.Job [rdbmswriter] splits to [1] tasks.
2018-09-25 22:53:21.145 [job-0] INFO JobContainer - jobContainer starts to do schedule ...
2018-09-25 22:53:21.160 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2018-09-25 22:53:21.163 [job-0] INFO JobContainer - Running by standalone Mode.
2018-09-25 22:53:21.179 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2018-09-25 22:53:21.187 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2018-09-25 22:53:21.188 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
2018-09-25 22:53:21.235 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2018-09-25 22:53:21.244 [0-0-0-reader] INFO CommonRdbmsReader$Task - Begin to read record by Sql: [select * from seabase.testdatax
] jdbcUrl:[jdbc:t4jdbc://10.10.14.101:23400/:].
2018-09-25 22:53:22.508 [0-0-0-reader] INFO CommonRdbmsReader$Task - Finished read record by Sql: [select * from seabase.testdatax
] jdbcUrl:[jdbc:t4jdbc://10.10.14.101:23400/:].
2018-09-25 22:53:23.040 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[1808]ms
2018-09-25 22:53:23.040 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks.
2018-09-25 22:53:31.195 [job-0] INFO StandAloneJobContainerCommunicator - Total 3 records, 9 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.240s | Percentage 100.00%
2018-09-25 22:53:31.196 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.
2018-09-25 22:53:31.199 [job-0] INFO JobContainer - DataX Writer.Job [rdbmswriter] do post work.
2018-09-25 22:53:31.200 [job-0] INFO JobContainer - DataX Reader.Job [rdbmsreader] do post work.
2018-09-25 22:53:31.201 [job-0] INFO JobContainer - DataX jobId [0] completed successfully.
2018-09-25 22:53:31.203 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /opt/datax/hook
2018-09-25 22:53:31.206 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
PS MarkSweep | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
PS Scavenge | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
2018-09-25 22:53:31.207 [job-0] INFO JobContainer - PerfTrace not enable!
2018-09-25 22:53:31.208 [job-0] INFO StandAloneJobContainerCommunicator - Total 3 records, 9 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.240s | Percentage 100.00%
2018-09-25 22:53:31.211 [job-0] INFO JobContainer -
任务启动时刻 : 2018-09-25 22:53:17
任务结束时刻 : 2018-09-25 22:53:31
任务总计耗时 : 13s
任务平均流量 : 0B/s
记录写入速度 : 0rec/s
读出记录总数 : 3
读写失败总数 : 0
9 检查目标端数据是否生成
SQL>select * from testdatax;
A B
----------- ----------------------------------------
1 张三
2 李四
3 王五
--- 3 row(s) selected.
10 通常以上步骤,数据可以通过INSERT的方式在不同的Trafodion数据库之间传输,通常上述日志可以查看使用INSERT INTO命令
2018-09-25 22:53:21.096 [job-0] INFO OriginalConfPretreatmentUtil - Write data [
INSERT INTO %s (A,B) VALUES(?,?)
], which jdbcUrl like:[jdbc:t4jdbc://10.10.14.102:23400/:]
然而我们知道,Trafodion提供UPSERT/UPSERT USING LOAD方式插入数据性能比INSERT好,如果默认的INSERT性能较慢,我们可以通过修改源码把INSERT换成UPSERT USING LOAD语法而实现性能提升
修改如下,
修改/DataX-master/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/util/OriginalConfPretreatmentUtil.java
把
String writeMode = originalConfig.getString(Key.WRITE_MODE, "INSERT");
修改为:
String writeMode = originalConfig.getString(Key.WRITE_MODE, "UPSERT USING LOAD");
修改/DataX-master/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/util/WriterUtil.java
把
boolean isWriteModeLegal = writeMode.trim().toLowerCase().startsWith("insert")
|| writeMode.trim().toLowerCase().startsWith("replace")
|| writeMode.trim().toLowerCase().startsWith("update");
修改为:
boolean isWriteModeLegal = writeMode.trim().toLowerCase().startsWith("insert")
|| writeMode.trim().toLowerCase().startsWith("replace")
|| writeMode.trim().toLowerCase().startsWith("update")
|| writeMode.trim().toLowerCase().startsWith("upsert using load");
修改之后编译/DataX-master/plugin-rdbms-util生成新的plugin-rdbms-util-0.0.1-SNAPSHOT.jar,将其复制到/opt/datax/plugin/writer/rdbmswriter/libs目录下替换原有文件。
重新执行并查看日志, 可以发现插入已经由原来的INSERT INTO修改为UPSERT USING LOAD INTO方式了。
2018-09-26 00:15:22.041 [job-0] INFO OriginalConfPretreatmentUtil - Write data [
UPSERT USING LOAD INTO %s (A,B) VALUES(?,?)
], which jdbcUrl like:[jdbc:t4jdbc://10.10.14.102:23400/:]