datax二次开发rdbms插件 支持直接插入hive表

目录

一. 背景

二次开发思路

 二次开发步骤

2.1 上传hive驱动到rdbmsReader、rdbmsWriter插件的lib目录

2.2 修改plugin.json配置文件

2.3 从git clone源码 并在最外层pom.xml注释掉其他组件,只保留 rdbmswriter模块

2.4 原码测试能否编译打包

2.5 修改源码

2.5.1 新增CommonRdbmsWriterOverride.java类(替代CommonRdbmsWriter类)

 2.5.2 修改RdbmsWriter类为

2.5.3 修改SubCommonRdbmsWriter类为

2.5.4 重新编译打包

2.5.5 取出jar 替换linux上的对应jar包

 2.6 再次测试

建表语句

datax json文件

2.7 查看打印日志


一. 背景

经测试官方rdbms插件,不支持直接插入hive表

二次开发思路

使用一条insert sql 插入多条数据,类似

insertSql: insert into test_databases.test_partition_sink values(?,?,?,?),(?,?,?,?),(?,?,?,?),(?,?,?,?)

 二次开发步骤

2.1 上传hive驱动到rdbmsReader、rdbmsWriter插件的lib目录

hive-jdbc-3.1.1-standalone.jar or hive-jdbc-uber-2.6.5.0-292.jar

[cdp2dev_deloitte@emr-header-1 libs]$ pwd
/home/cdp2dev_deloitte/tmp/jack/datax/plugin/writer/rdbmsreader/libs
[cdp2dev_deloitte@emr-header-1 libs]$ ls
commons-collections-3.0.jar      db2jcc4.jar                guava-r05.jar                   logback-core-1.0.13.jar
commons-io-2.4.jar               Dm7JdbcDriver16.jar        hamcrest-core-1.3.jar           plugin-rdbms-util-0.0.1-SNAPSHOT.jar
commons-lang3-3.3.2.jar          druid-1.0.15.jar           hive-jdbc-3.1.1-standalone.jar  slf4j-api-1.7.10.jar
commons-math3-3.1.1.jar          edb-jdbc16.jar             jconn3-1.0.0-SNAPSHOT.jar
datax-common-0.0.1-SNAPSHOT.jar  fastjson-1.1.46.sec10.jar  logback-classic-1.0.13.jar


[cdp2dev_deloitte@emr-header-1 libs]$ pwd
/home/cdp2dev_deloitte/tmp/jack/datax/plugin/writer/rdbmswriter/libs
[cdp2dev_deloitte@emr-header-1 libs]$ ls
commons-collections-3.0.jar      db2jcc4.jar                guava-r05.jar                   logback-core-1.0.13.jar
commons-io-2.4.jar               Dm7JdbcDriver16.jar        hamcrest-core-1.3.jar           plugin-rdbms-util-0.0.1-SNAPSHOT.jar
commons-lang3-3.3.2.jar          druid-1.0.15.jar           hive-jdbc-3.1.1-standalone.jar  slf4j-api-1.7.10.jar
commons-math3-3.1.1.jar          edb-jdbc16.jar             jconn3-1.0.0-SNAPSHOT.jar
datax-common-0.0.1-SNAPSHOT.jar  fastjson-1.1.46.sec10.jar  logback-classic-1.0.13.jar

2.2 修改plugin.json配置文件

[cdp2dev_deloitte@emr-header-1 rdbmswriter]$ pwd
/home/cdp2dev_deloitte/tmp/jack/datax/plugin/writer/rdbmswriter
[cdp2dev_deloitte@emr-header-1 rdbmswriter]$ ls
libs  plugin_job_template.json  plugin.json  rdbmswriter-0.0.1-SNAPSHOT.jar

#在drivers里面加入org.apache.hive.jdbc.HiveDriver
[cdp2dev_deloitte@emr-header-1 rdbmswriter]$ vi plugin.json


{
    "name": "rdbmswriter",
    "class": "com.alibaba.datax.plugin.reader.rdbmswriter.RdbmsWriter",
    "description": "useScene: prod. mechanism: Jdbc connection using the database, execute select sql, retrieve data from the ResultSet. warn: The more you know about the database, the less problems you encounter.",
    "developer": "alibaba",
    "drivers":["dm.jdbc.driver.DmDriver", "com.sybase.jdbc3.jdbc.SybDriver", "com.edb.Driver","org.apache.hive.jdbc.HiveDriver"]
}

2.3 从git clone源码 并在最外层pom.xml注释掉其他组件,只保留 rdbmswriter模块

2.4 原码测试能否编译打包

2.5 修改源码

2.5.1 新增CommonRdbmsWriterOverride.java类(替代CommonRdbmsWriter类)

package com.alibaba.datax.plugin.reader.rdbmswriter;

import com.alibaba.datax.common.element.Column;
import com.alibaba.datax.common.element.Record;
import com.alibaba.datax.common.exception.DataXException;
import com.alibaba.datax.common.plugin.RecordReceiver;
import com.alibaba.datax.common.plugin.TaskPluginCollector;
import com.alibaba.datax.common.util.Configuration;
import com.alibaba.datax.plugin.rdbms.util.DBUtil;
import com.alibaba.datax.plugin.rdbms.util.DBUtilErrorCode;
import com.alibaba.datax.plugin.rdbms.util.DataBaseType;
import com.alibaba.datax.plugin.rdbms.util.RdbmsException;
import com.alibaba.datax.plugin.rdbms.writer.CommonRdbmsWriter;
import com.alibaba.datax.plugin.rdbms.writer.Constant;
import com.alibaba.datax.plugin.rdbms.writer.util.OriginalConfPretreatmentUtil;
import com.alibaba.datax.plugin.rdbms.writer.util.WriterUtil;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.lang3.tuple.Triple;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.sql.*;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.List;
import java.util.Objects;

public class CommonRdbmsWriterOverride {

    public CommonRdbmsWriterOverride() {
    }

    public static class Task {
        protected static final Logger LOG = LoggerFactory.getLogger(CommonRdbmsWriter.Task.class);
        protected DataBaseType dataBaseType;
        private static final String VALUE_HOLDER = "?";
        protected String username;
        protected String password;
        protected String jdbcUrl;
        protected String table;
        protected List<String> columns;
        protected List<String> preSqls;
        protected List<String> postSqls;
        protected int batchSize;
        protected int batchByteSize;
        protected int columnNumber = 0;
        protected TaskPluginCollector taskPluginCollector;
        protected static String BASIC_MESSAGE;
        protected static String INSERT_OR_REPLACE_TEMPLATE;
        protected String writeRecordSql;
        protected String writeMode;
        protected boolean emptyAsNull;
        protected Triple<List<String>, List<Integer>, List<String>> resultSetMetaData;
        protected Integer totalIndex;
        protected List<String> partitionColumns;

        public Task(DataBaseType dataBaseType) {
            this.dataBaseType = dataBaseType;
        }

        public void init(Configuration writerSliceConfig) {
            this.username = writerSliceConfig.getString("username");
            this.password = writerSliceConfig.getString("password");
            this.jdbcUrl = writerSliceConfig.getString("jdbcUrl");
            if (this.jdbcUrl.startsWith("||_dsc_ob10_dsc_||") && this.dataBaseType == DataBaseType.MySql) {
                String[] ss = this.jdbcUrl.split("\\|\\|_dsc_ob10_dsc_\\|\\|");
                if (ss.length != 3) {
                    throw DataXException.asDataXException(DBUtilErrorCode.JDBC_OB10_ADDRESS_ERROR, "JDBC OB10格式错误,请联系askdatax");
                }

                LOG.info("this is ob1_0 jdbc url.");
                this.username = ss[1].trim() + ":" + this.username;
                this.jdbcUrl = ss[2];
                LOG.info("this is ob1_0 jdbc url. user=" + this.username + " :url=" + this.jdbcUrl);
            }

            this.table = writerSliceConfig.getString("table");
            this.columns = writerSliceConfig.getList("column", String.class);
            this.partitionColumns = writerSliceConfig.getList("partitionColumns", String.class);
            this.columnNumber = this.columns.size();
            this.preSqls = writerSliceConfig.getList("preSql", String.class);
            this.postSqls = writerSliceConfig.getList("postSql", String.class);
            this.batchSize = writerSliceConfig.getInt("batchSize", 2048);
            this.batchByteSize = writerSliceConfig.getInt("batchByteSize", 33554432);
            this.writeMode = writerSliceConfig.getString("writeMode", "INSERT");
            this.emptyAsNull = writerSliceConfig.getBool("emptyAsNull", true);
            INSERT_OR_REPLACE_TEMPLATE = writerSliceConfig.getString(Constant.INSERT_OR_REPLACE_TEMPLATE_MARK);
            this.writeRecordSql = String.format(INSERT_OR_REPLACE_TEMPLATE, this.table);
            BASIC_MESSAGE = String.format("jdbcUrl:[%s], table:[%s]", this.jdbcUrl, this.table);
        }

        public void prepare(Configuration writerSliceConfig) {
            Connection connection = DBUtil.getConnection(this.dataBaseType, this.jdbcUrl, this.username, this.password);
            DBUtil.dealWithSessionConfig(connection, writerSliceConfig, this.dataBaseType, BASIC_MESSAGE);
            int tableNumber = writerSliceConfig.getInt(Constant.TABLE_NUMBER_MARK);
            if (tableNumber != 1) {
                LOG.info("Begin to execute preSqls:[{}]. context info:{}.", StringUtils.join(this.preSqls, ";"), BASIC_MESSAGE);
                WriterUtil.executeSqls(connection, this.preSqls, BASIC_MESSAGE, this.dataBaseType);
            }

            DBUtil.closeDBResources((ResultSet) null, (Statement) null, connection);
        }

        public void startWriteWithConnection(RecordReceiver recordReceiver, TaskPluginCollector taskPluginCollector, Connection connection) {
            this.taskPluginCollector = taskPluginCollector;
            this.resultSetMetaData = DBUtil.getColumnMetaData(connection, this.table, StringUtils.join(this.columns, ","));
            this.calcWriteRecordSql();
            List<Record> writeBuffer = new ArrayList(this.batchSize);
            int bufferBytes = 0;

            boolean isBufferBytes;
            try {
                Record record;
                while ((record = recordReceiver.getFromReader()) != null) {
                    if (record.getColumnNumber() != this.columnNumber) {
                        throw DataXException.asDataXException(DBUtilErrorCode.CONF_ERROR, String.format("列配置信息有错误. 因为您配置的任务中,源头读取字段数:%s 与 目的表要写入的字段数:%s 不相等. 请检查您的配置并作出修改.", record.getColumnNumber(), this.columnNumber));
                    }

                    writeBuffer.add(record);
                    bufferBytes += record.getMemorySize();
                    if (writeBuffer.size() >= this.batchSize || bufferBytes >= this.batchByteSize) {
                        this.doBatchInsert(connection, writeBuffer);
                        writeBuffer.clear();
                        bufferBytes = 0;
                    }
                }

                if (!writeBuffer.isEmpty()) {
                    this.doBatchInsert
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
DataX是一个用于数据同步的开源工具,它提供了丰富的插件支持不同的数据源和目标。根据引用[2],DataX插件的开发模式是基于Record的抽象,各个插件只需要按照规范进行开发即可。引用[3]中提到,DataX的打包成功后的包结构中包含了插件目录。 对于Elasticsearch读插件二次开发,你可以参考DataX插件开发规范和文档。首先,你需要了解Elasticsearch的数据结构和API,以便在插件中进行数据读取操作。然后,你可以在DataX插件目录中创建一个新的插件目录,并按照规范进行插件的开发。在插件的配置文件中,你需要指定Elasticsearch的连接信息和查询条件等参数。 在插件的开发过程中,你可以使用DataX提供的各种工具和接口来简化开发和测试。例如,你可以使用DataX的RecordReader接口来读取Elasticsearch中的数据,并将其转换为DataX的Record对象。你还可以使用DataX的各种工具类来处理数据转换和批量写入等操作。 最后,你可以使用DataX的命令行工具来运行你开发的插件,并通过配置文件指定插件的参数和数据源信息。例如,你可以使用类似于引用[1]的命令来运行你的Elasticsearch读插件,并指定数据源的路径和插件的配置文件。 总结起来,要进行DataX的Elasticsearch读插件二次开发,你需要了解Elasticsearch的数据结构和API,按照DataX插件开发规范进行插件的开发,使用DataX的工具和接口简化开发和测试,最后使用DataX的命令行工具来运行你开发的插件

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值