Trafodion集成阿里云DataX实现集群间数据传输

关于Trafodion我就不作更多说明了,由于支持标准的JDBC,Trafodion可以基于JDBC和一些主流的ETL工具集成实现ETL工作,如Kettle、Embulk等。
在前面的文章中我们有介绍如何使用Kettle从Oracle、MySQL中抽取数据到Trafodion,见https://blog.csdn.net/Post_Yuan/article/details/52804105
也可以使用Embulk抽取关系型数据库中的数据到Trafodion,见https://blog.csdn.net/Post_Yuan/article/details/77856320

本文我们介绍另外一款ETL工具与Trafodion的集成,它是阿里云开源离线同步工具DataX,使用的版本为3.0。
DataX 是一个异构数据源离线同步工具,致力于实现包括关系型数据库(MySQL、Oracle等)、HDFS、Hive、ODPS、HBase、FTP等各种异构数据源之间稳定高效的数据同步功能。为了解决异构数据源同步问题,DataX将复杂的网状的同步链路变成了星型数据链路,DataX作为中间传输载体负责连接各种数据源。当需要接入一个新的数据源的时候,只需要将此数据源对接到DataX,便能跟已有的数据源做到无缝数据同步。
在这里插入图片描述
关于DataX的详细介绍,可以参考DataX的github网站介绍
https://github.com/alibaba/DataX/blob/master/introduction.md
在此我们主要介绍如何使用DataX实现两个不同的Trafodion集群之间的数据传输,当然,读者如果有兴趣,也可以自行尝试Trafodion与任意数据源之间的数据传输。

1 首先,我们从https://github.com/alibaba/DataX中提供的下载链接下载DataX安装包并解压到某个目录下,如/opt

2 其次,鉴于DataX中的代码问题,我们下载DataX的源码并稍做修改,修改如下,

修改/DataX-master/rdbmsreader/src/main/java/com/alibaba/datax/plugin/reader/rdbmsreader/SubCommonRdbmsReader.java
把
public static class Job extends CommonRdbmsReader.Job {
        public Job(DataBaseType dataBaseType) {
            super(dataBaseType);
        }
    }
修改为:
public static class Job extends CommonRdbmsReader.Job {
        public Job(DataBaseType dataBaseType) {
            super(dataBaseType);
            DBUtil.loadDriverClass("reader", "rdbms");
        }
    }

修改/DataX-master/rdbmswriter/src/main/java/com/alibaba/datax/plugin/reader/rdbmswriter/SubCommonRdbmsWriter.java
把
public static class Job extends CommonRdbmsReader.Job {
        public Job(DataBaseType dataBaseType) {
            super(dataBaseType);
        }
    }
修改为:
public static class Job extends CommonRdbmsReader.Job {
        public Job(DataBaseType dataBaseType) {
            super(dataBaseType);
            DBUtil.loadDriverClass("writer", "rdbms");
        }
    }

修改完之后重新编译rdbmsreader和rdbmswriter两个目录,把编译生成的DataX-master\rdbmsreader\target\datax\plugin\reader\rdbmsreader和DataX-master\rdbmswriter\target\datax\plugin\writer\rdbmswriter覆盖为/opt/datax/plugin/reader/rdbmsreader及/opt/datax/plugin/writer/rdbmswriter。

3 在rdbmsreader添加trafodion的驱动支持
(1)cd /opt/datax/plugin/reader/rdbmsreader/并编辑plugin.json添加trafodion的Driver配置
在这里插入图片描述
(2)cd /opt/datax/plugin/writer/rdbmsreader/libs/添加jdbcT4.jar包

4 在rdbmswriter添加trafodion的驱动支持
(1)cd /opt/datax/plugin/writer/rdbmswriter/并编辑plugin.json添加trafodion的Driver配置
在这里插入图片描述
(2)cd /opt/datax/plugin/writer/rdbmswriter/libs/添加jdbcT4.jar包

5 编写配置文件traf2traf.json到/opt/datax/job目录
可参考以下链接配置源库与目标库:
https://github.com/alibaba/DataX/blob/master/rdbmsreader/doc/rdbmsreader.md
https://github.com/alibaba/DataX/blob/master/rdbmswriter/doc/rdbmswriter.md
样例内容如下,

{
    "job": {
        "setting": {
            "speed": {
                "byte": 1048576
            }
        },
        "content": [
            {
                "reader": {
                    "name": "rdbmsreader",
                    "parameter": {
                        "username": "trafodion",
                        "password": "traf123",
                        "column": [
                            "*"
                        ],
                        "connection": [
                            {
                                "table": ["seabase.testdatax"],
                                "jdbcUrl": ["jdbc:t4jdbc://10.10.14.101:23400/:"]
                            }
                        ]
                    }
                },
                "writer": {
                    "name": "rdbmswriter",
                    "parameter": {
                        "connection": [
                            {
                                "table": ["seabase.testdatax"],
                                "jdbcUrl": "jdbc:t4jdbc://10.10.14.102:23400/:"
                            }
                        ],
                        "column": [
                            "*"
                        ],
                        "username": "trafodion",
                        "password": "traf123"
                    }
                }
            }
        ]
    }
}

6 在源端创建测试表并插入测试数据

SQL>create table testdatax(a int, b varchar(10) character set utf8);
SQL>insert into testdatax values(1,'张三'),(2,'李四'),(3,'王五');

--- 3 row(s) inserted.
SQL>select * from testdatax;

A           B                                       
----------- ----------------------------------------
          1 张三                                      
          2 李四                                      
          3 王五                                      

--- 3 row(s) selected.

7 在目标端创建空测试表

SQL>create table testdatax(a int, b varchar(10) character set utf8);

--- SQL operation complete.

8 启动DataX执行数据抽取操作
命令如下,

python /opt/datax/bin/datax.py /opt/datax/job/traf2traf.json 

输出日志如下,

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


2018-09-25 22:53:17.821 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2018-09-25 22:53:17.830 [main] INFO  Engine - the machine info  => 

        osInfo: Oracle Corporation 1.8 25.11-b03
        jvmInfo:        Linux amd64 2.6.32-696.el6.x86_64
        cpu num:        4

        totalPhysicalMemory:    -0.00G
        freePhysicalMemory:     -0.00G
        maxFileDescriptorCount: -1
        currentOpenFileDescriptorCount: -1

        GC Names        [PS MarkSweep, PS Scavenge]

        MEMORY_NAME                    | allocation_size                | init_size                      
        PS Eden Space                  | 256.00MB                       | 256.00MB                       
        Code Cache                     | 240.00MB                       | 2.44MB                         
        Compressed Class Space         | 1,024.00MB                     | 0.00MB                         
        PS Survivor Space              | 42.50MB                        | 42.50MB                        
        PS Old Gen                     | 683.00MB                       | 683.00MB                       
        Metaspace                      | -0.00MB                        | 0.00MB                         


2018-09-25 22:53:17.856 [main] INFO  Engine - 
{
        "content":[
                {
                        "reader":{
                                "name":"rdbmsreader",
                                "parameter":{
                                        "column":[
                                                "*"
                                        ],
                                        "connection":[
                                                {
                                                        "jdbcUrl":[
                                                                "jdbc:t4jdbc://10.10.14.101:23400/:"
                                                        ],
                                                        "table":[
                                                                "seabase.testdatax"
                                                        ]
                                                }
                                        ],
                                        "password":"*******",
                                        "username":"trafodion"
                                }
                        },
                        "writer":{
                                "name":"rdbmswriter",
                                "parameter":{
                                        "column":[
                                                "*"
                                        ],
                                        "connection":[
                                                {
                                                        "jdbcUrl":"jdbc:t4jdbc://10.10.14.102:23400/:",
                                                        "table":[
                                                                "seabase.testdatax"
                                                        ]
                                                }
                                        ],
                                        "password":"*******",
                                        "username":"trafodion"
                                }
                        }
                }
        ],
        "setting":{
                "speed":{
                        "byte":1048576
                }
        }
}

2018-09-25 22:53:17.882 [main] WARN  Engine - prioriy set to 0, because NumberFormatException, the value is: null
2018-09-25 22:53:17.885 [main] INFO  PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2018-09-25 22:53:17.885 [main] INFO  JobContainer - DataX jobContainer starts job.
2018-09-25 22:53:17.888 [main] INFO  JobContainer - Set jobId = 0
2018-09-25 22:53:20.133 [job-0] INFO  OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:t4jdbc://10.10.14.101:23400/:.
2018-09-25 22:53:20.135 [job-0] WARN  OriginalConfPretreatmentUtil - 您的配置文件中的列配置存在一定的风险. 因为您未配置读取数据库表的列,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置并作出修改.
2018-09-25 22:53:21.092 [job-0] INFO  OriginalConfPretreatmentUtil - table:[seabase.testdatax] all columns:[
A,B
].
2018-09-25 22:53:21.092 [job-0] WARN  OriginalConfPretreatmentUtil - 您的配置文件中的列配置信息存在风险. 因为您配置的写入数据库表的列为*,当您的表字段个数、类型有变动时,可能影响任务正确性甚至会运行出错。请检查您的配置并作出修改.
2018-09-25 22:53:21.096 [job-0] INFO  OriginalConfPretreatmentUtil - Write data [
INSERT INTO %s (A,B) VALUES(?,?)
], which jdbcUrl like:[jdbc:t4jdbc://10.10.14.102:23400/:]
2018-09-25 22:53:21.096 [job-0] INFO  JobContainer - jobContainer starts to do prepare ...
2018-09-25 22:53:21.098 [job-0] INFO  JobContainer - DataX Reader.Job [rdbmsreader] do prepare work .
2018-09-25 22:53:21.098 [job-0] INFO  JobContainer - DataX Writer.Job [rdbmswriter] do prepare work .
2018-09-25 22:53:21.100 [job-0] INFO  JobContainer - jobContainer starts to do split ...
2018-09-25 22:53:21.102 [job-0] INFO  JobContainer - Job set Max-Byte-Speed to 1048576 bytes.
2018-09-25 22:53:21.112 [job-0] INFO  JobContainer - DataX Reader.Job [rdbmsreader] splits to [1] tasks.
2018-09-25 22:53:21.114 [job-0] INFO  JobContainer - DataX Writer.Job [rdbmswriter] splits to [1] tasks.
2018-09-25 22:53:21.145 [job-0] INFO  JobContainer - jobContainer starts to do schedule ...
2018-09-25 22:53:21.160 [job-0] INFO  JobContainer - Scheduler starts [1] taskGroups.
2018-09-25 22:53:21.163 [job-0] INFO  JobContainer - Running by standalone Mode.
2018-09-25 22:53:21.179 [taskGroup-0] INFO  TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2018-09-25 22:53:21.187 [taskGroup-0] INFO  Channel - Channel set byte_speed_limit to -1, No bps activated.
2018-09-25 22:53:21.188 [taskGroup-0] INFO  Channel - Channel set record_speed_limit to -1, No tps activated.
2018-09-25 22:53:21.235 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2018-09-25 22:53:21.244 [0-0-0-reader] INFO  CommonRdbmsReader$Task - Begin to read record by Sql: [select * from seabase.testdatax 
] jdbcUrl:[jdbc:t4jdbc://10.10.14.101:23400/:].
2018-09-25 22:53:22.508 [0-0-0-reader] INFO  CommonRdbmsReader$Task - Finished read record by Sql: [select * from seabase.testdatax 
] jdbcUrl:[jdbc:t4jdbc://10.10.14.101:23400/:].
2018-09-25 22:53:23.040 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[1808]ms
2018-09-25 22:53:23.040 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] completed it's tasks.
2018-09-25 22:53:31.195 [job-0] INFO  StandAloneJobContainerCommunicator - Total 3 records, 9 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.240s | Percentage 100.00%
2018-09-25 22:53:31.196 [job-0] INFO  AbstractScheduler - Scheduler accomplished all tasks.
2018-09-25 22:53:31.199 [job-0] INFO  JobContainer - DataX Writer.Job [rdbmswriter] do post work.
2018-09-25 22:53:31.200 [job-0] INFO  JobContainer - DataX Reader.Job [rdbmsreader] do post work.
2018-09-25 22:53:31.201 [job-0] INFO  JobContainer - DataX jobId [0] completed successfully.
2018-09-25 22:53:31.203 [job-0] INFO  HookInvoker - No hook invoked, because base dir not exists or is a file: /opt/datax/hook
2018-09-25 22:53:31.206 [job-0] INFO  JobContainer - 
         [total cpu info] => 
                averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
                -1.00%                         | -1.00%                         | -1.00%
                        

         [total gc info] => 
                 NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
                 PS MarkSweep         | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             
                 PS Scavenge          | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             

2018-09-25 22:53:31.207 [job-0] INFO  JobContainer - PerfTrace not enable!
2018-09-25 22:53:31.208 [job-0] INFO  StandAloneJobContainerCommunicator - Total 3 records, 9 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.240s | Percentage 100.00%
2018-09-25 22:53:31.211 [job-0] INFO  JobContainer - 
任务启动时刻                    : 2018-09-25 22:53:17
任务结束时刻                    : 2018-09-25 22:53:31
任务总计耗时                    :                 13s
任务平均流量                    :                0B/s
记录写入速度                    :              0rec/s
读出记录总数                    :                   3
读写失败总数                    :                   0

9 检查目标端数据是否生成

SQL>select * from testdatax;

A           B                                       
----------- ----------------------------------------
          1 张三                                      
          2 李四                                      
          3 王五                                      

--- 3 row(s) selected.

10 通常以上步骤,数据可以通过INSERT的方式在不同的Trafodion数据库之间传输,通常上述日志可以查看使用INSERT INTO命令

2018-09-25 22:53:21.096 [job-0] INFO  OriginalConfPretreatmentUtil - Write data [
INSERT INTO %s (A,B) VALUES(?,?)
], which jdbcUrl like:[jdbc:t4jdbc://10.10.14.102:23400/:]

然而我们知道,Trafodion提供UPSERT/UPSERT USING LOAD方式插入数据性能比INSERT好,如果默认的INSERT性能较慢,我们可以通过修改源码把INSERT换成UPSERT USING LOAD语法而实现性能提升
修改如下,

修改/DataX-master/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/util/OriginalConfPretreatmentUtil.java
把
String writeMode = originalConfig.getString(Key.WRITE_MODE, "INSERT");
修改为:
String writeMode = originalConfig.getString(Key.WRITE_MODE, "UPSERT USING LOAD");

修改/DataX-master/plugin-rdbms-util/src/main/java/com/alibaba/datax/plugin/rdbms/writer/util/WriterUtil.java
把
boolean isWriteModeLegal = writeMode.trim().toLowerCase().startsWith("insert")
                || writeMode.trim().toLowerCase().startsWith("replace")
                || writeMode.trim().toLowerCase().startsWith("update");
修改为:
boolean isWriteModeLegal = writeMode.trim().toLowerCase().startsWith("insert")
                || writeMode.trim().toLowerCase().startsWith("replace")
                || writeMode.trim().toLowerCase().startsWith("update")
                || writeMode.trim().toLowerCase().startsWith("upsert using load");     

修改之后编译/DataX-master/plugin-rdbms-util生成新的plugin-rdbms-util-0.0.1-SNAPSHOT.jar,将其复制到/opt/datax/plugin/writer/rdbmswriter/libs目录下替换原有文件。
重新执行并查看日志, 可以发现插入已经由原来的INSERT INTO修改为UPSERT USING LOAD INTO方式了。

2018-09-26 00:15:22.041 [job-0] INFO  OriginalConfPretreatmentUtil - Write data [
UPSERT USING LOAD INTO %s (A,B) VALUES(?,?)
], which jdbcUrl like:[jdbc:t4jdbc://10.10.14.102:23400/:]
  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

数据源的港湾

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值