基于数据湖的多流拼接方案-HUDI实操篇

目录

一、前情提要

二、代码Demo

(一)多写问题

(二)如果要两个流写一个表,这种情况怎么处理?

 (三)测试结果

三、后序


一、前情提要

基于数据湖对两条实时流进行拼接(如前端埋点+服务端埋点、日志流+订单流等);

基础概念见前一篇文章:基于数据湖的多流拼接方案-HUDI概念篇_Leonardo_KY的博客-CSDN博客

二、代码Demo

下文demo均使用datagen生成mock数据进行测试,如到生产,改成Kafka或者其他source即可。

第一个job:stream A,落hudi表:

        env.getCheckpointConfig().setMinPauseBetweenCheckpoints(minPauseBetweenCP);  // 1s
        env.getCheckpointConfig().setCheckpointTimeout(checkpointTimeout);   // 2 min
        env.getCheckpointConfig().setMaxConcurrentCheckpoints(maxConcurrentCheckpoints);

        // env.getCheckpointConfig().setCheckpointStorage("file:///D:/Users/yakang.lu/tmp/checkpoints/");

        TableEnvironment tableEnv = StreamTableEnvironment.create(env);

        // datagen====================================================================
        tableEnv.executeSql("CREATE TABLE sourceA (\n" +
                " uuid bigint PRIMARY KEY NOT ENFORCED,\n" +
                " `name` VARCHAR(3)," +
                " _ts1 TIMESTAMP(3)\n" +
                ") WITH (\n" +
                " 'connector' = 'datagen', \n" +
                " 'fields.uuid.kind'='sequence',\n" +
                " 'fields.uuid.start'='0', \n" +
                " 'fields.uuid.end'='1000000', \n" +
                " 'rows-per-second' = '1' \n" +
                ")");

        // hudi====================================================================
        tableEnv.executeSql("create table hudi_tableA(\n"
                + " uuid bigint PRIMARY KEY NOT ENFORCED,\n"
                + " age int,\n"
                + " name VARCHAR(3),\n"
                + " _ts1 TIMESTAMP(3),\n"
                + " _ts2 TIMESTAMP(3),\n"
                + " d VARCHAR(10)\n"
                + ")\n"
                + " PARTITIONED BY (d)\n"
                + " with (\n"
                + " 'connector' = 'hudi',\n"
                + " 'path' = 'hdfs://ns/user/hive/warehouse/ctripdi_prodb.db/hudi_mor_mutil_source_test', \n"   // hdfs path
                + " 'table.type' = 'MERGE_ON_READ',\n"
                + " 'write.bucket_assign.tasks' = '10',\n"
                + " 'write.tasks' = '10',\n"
                + " 'write.partition.format' = 'yyyyMMddHH',\n"
                + " 'write.partition.timestamp.type' = 'EPOCHMILLISECONDS',\n"
                + " 'hoodie.bucket.index.num.buckets' = '2',\n"
                + " 'changelog.enabled' = 'true',\n"
                + " 'index.type' = 'BUCKET',\n"
                + " 'hoodie.bucket.index.num.buckets' = '2',\n"
                 + String.format(" '%s' = '%s',\n", FlinkOptions.PRECOMBINE_FIELD.key(), "_ts1")
                 + " 'write.payload.class' = '" + PartialUpdateAvroPayload.class.getName() + "',\n"
                + " 'hoodie.write.log.suffix' = 'job1',\n"
                + " 'hoodie.write.concurrency.mode' = 'optimistic_concurrency_control',\n"
                + " 'hoodie.write.lock.provider' = 'org.apache.hudi.client.transaction.lock.FileSystemBasedLockProvider',\n"
                + " 'hoodie.cleaner.policy.failed.writes' = 'LAZY',\n"
                + " 'hoodie.cleaner.policy' = 'KEEP_LATEST_BY_HOURS',\n"
                + " 'hoodie.consistency.check.enabled' = 'false',\n"
                // + " 'hoodie.write.lock.early.conflict.detection.enable' = 'true',\n"   // todo
                // + " 'hoodie.write.lock.early.conflict.detection.strategy' = '"
                // + SimpleTransactionDirectMarkerBasedEarlyConflictDetectionStrategy.class.getName() + "',\n"  //
                + " 'hoodie.keep.min.commits' = '1440',\n"
                + " 'hoodie.keep.max.commits' = '2880',\n"
                + " 'compaction.schedule.enabled'='false',\n"
                + " 'compaction.async.enabled'='false',\n"
                + " 'compaction.trigger.strategy'='num_or_time',\n"
                + " 'compaction.delta_commits' ='3',\n"
                + " 'compaction.delta_seconds' ='60',\n"
                + " 'compaction.max_memory' = '3096',\n"
                + " 'clean.async.enabled' ='false',\n"
                + " 'hive_sync.enable' = 'false'\n"
                // + " 'hive_sync.mode' = 'hms',\n"
                // + " 'hive_sync.db' = '%s',\n"
                // + " 'hive_sync.table' = '%s',\n"
                // + " 'hive_sync.metastore.uris' = '%s'\n"
                + ")");

        // sql====================================================================
        StatementSet statementSet = tableEnv.createStatementSet();

        String sqlString = "insert into hudi_tableA(uuid, name, _ts1, d) select * from " +
                "(select *,date_format(CURRENT_TIMESTAMP,'yyyyMMdd') AS d from sourceA) view1";
        statementSet.addInsertSql(sqlString);
        statementSet.execute();

第二个job:stream B,落hudi表:

        StreamExecutionEnvironment env = manager.getEnv();
        env.getCheckpointConfig().setMinPauseBetweenCheckpoints(minPauseBetweenCP);  // 1s
        env.getCheckpointConfig().setCheckpointTimeout(checkpointTimeout);   // 2 min
        env.getCheckpointConfig().setMaxConcurrentCheckpoints(maxConcurrentCheckpoints);

        // env.getCheckpointConfig().setCheckpointStorage("file:///D:/Users/yakang.lu/tmp/checkpoints/");

        TableEnvironment tableEnv = StreamTableEnvironment.create(env);

        // datagen====================================================================
        tableEnv.executeSql("CREATE TABLE sourceB (\n" +
                " uuid bigint PRIMARY KEY NOT ENFORCED,\n" +
                " `age` int," +
                " _ts2 TIMESTAMP(3)\n" +
                ") WITH (\n" +
                " 'connector' = 'datagen', \n" +
                " 'fields.uuid.kind'='sequence',\n" +
                " 'fields.uuid.start'='0', \n" +
                " 'fields.uuid.end'='1000000', \n" +
                " 'rows-per-second' = '1' \n" +
                ")");

        // hudi====================================================================
        tableEnv.executeSql("create table hudi_tableB(\n"
                + " uuid bigint PRIMARY KEY NOT ENFORCED,\n"
                + " age int,\n"
                + " name VARCHAR(3),\n"
                + " _ts1 TIMESTAMP(3),\n"
                + " _ts2 TIMESTAMP(3),\n"
                + " d VARCHAR(10)\n"
                + ")\n"
                + " PARTITIONED BY (d)\n"
                + " with (\n"
                + " 'connector' = 'hudi',\n"
                + " 'path' = 'hdfs://ns/user/hive/warehouse/ctripdi_prodb.db/hudi_mor_mutil_source_test', \n"   // hdfs path
                + " 'table.type' = 'MERGE_ON_READ',\n"
                + " 'write.bucket_assign.tasks' = '10',\n"
                + " 'write.tasks' = '10',\n"
                + " 'write.partition.format' = 'yyyyMMddHH',\n"
                + " 'hoodie.bucket.index.num.buckets' = '2',\n"
                + " 'changelog.enabled' = 'true',\n"
                + " 'index.type' = 'BUCKET',\n"
                + " 'hoodie.bucket.index.num.buckets' = '2',\n"
                + String.format(" '%s' = '%s',\n", FlinkOptions.PRECOMBINE_FIELD.key(), "_ts1")
                + " 'write.payload.class' = '" + PartialUpdateAvroPayload.class.getName() + "',\n"
                + " 'hoodie.write.log.suffix' = 'job2',\n"
                + " 'hoodie.write.concurrency.mode' = 'optimistic_concurrency_control',\n"
                + " 'hoodie.write.lock.provider' = 'org.apache.hudi.client.transaction.lock.FileSystemBasedLockProvider',\n"
                + " 'hoodie.cleaner.policy.failed.writes' = 'LAZY',\n"
                + " 'hoodie.cleaner.policy' = 'KEEP_LATEST_BY_HOURS',\n"
                + " 'hoodie.consistency.check.enabled' = 'false',\n"
                // + " 'hoodie.write.lock.early.conflict.detection.enable' = 'true',\n"   // todo
                // + " 'hoodie.write.lock.early.conflict.detection.strategy' = '"
                // + SimpleTransactionDirectMarkerBasedEarlyConflictDetectionStrategy.class.getName() + "',\n"
                + " 'hoodie.keep.min.commits' = '1440',\n"
                + " 'hoodie.keep.max.commits' = '2880',\n"
                + " 'compaction.schedule.enabled'='true',\n"
                + " 'compaction.async.enabled'='true',\n"
                + " 'compaction.trigger.strategy'='num_or_time',\n"
                + " 'compaction.delta_commits' ='2',\n"
                + " 'compaction.delta_seconds' ='90',\n"
                + " 'compaction.max_memory' = '3096',\n"
                + " 'clean.async.enabled' ='false'\n"
                // + " 'hive_sync.mode' = 'hms',\n"
                // + " 'hive_sync.db' = '%s',\n"
                // + " 'hive_sync.table' = '%s',\n"
                // + " 'hive_sync.metastore.uris' = '%s'\n"
                + ")");

        // sql====================================================================
        StatementSet statementSet = tableEnv.createStatementSet();
        String sqlString = "insert into hudi_tableB(uuid, age, _ts1, _ts2, d) select * from " +
                "(select *, _ts2 as ts1, date_format(CURRENT_TIMESTAMP,'yyyyMMdd') AS d from sourceB) view2";
        // statementSet.addInsertSql("insert into hudi_tableB(uuid, age, _ts2) select * from sourceB");
        statementSet.addInsertSql(sqlString);
        statementSet.execute();

也可以将两个 writer 放到同一个app中(使用statement):

import java.time.ZoneOffset;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.StatementSet;
import org.apache.flink.table.api.TableConfig;
import org.apache.flink.table.api.TableEnvironment;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
import org.apache.flink.table.api.config.TableConfigOptions;
import org.apache.hudi.common.model.PartialUpdateAvroPayload;
import org.apache.hudi.configuration.FlinkOptions;
// import org.apache.hudi.table.marker.SimpleTransactionDirectMarkerBasedEarlyConflictDetectionStrategy;

public class Test00 {
    public static void main(String[] args) {
        Configuration configuration = TableConfig.getDefault().getConfiguration();
        configuration.setString(TableConfigOptions.LOCAL_TIME_ZONE, ZoneOffset.ofHours(8).toString());//设置东八区
        // configuration.setInteger("rest.port", 8086);
        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(
                configuration);
        env.setParallelism(1);
        env.enableCheckpointing(12000L);
        // env.getCheckpointConfig().setCheckpointStorage("file:///Users/laifei/tmp/checkpoints/");
        TableEnvironment tableEnv = StreamTableEnvironment.create(env);

        // datagen====================================================================
        tableEnv.executeSql("CREATE TABLE sourceA (\n" +
                " uuid bigint PRIMARY KEY NOT ENFORCED,\n" +
                " `name` VARCHAR(3)," +
                " _ts1 TIMESTAMP(3)\n" +
                ") WITH (\n" +
                " 'connector' = 'datagen', \n" +
                " 'fields.uuid.kind'='sequence',\n" +
                " 'fields.uuid.start'='0', \n" +
                " 'fields.uuid.end'='1000000', \n" +
                " 'rows-per-second' = '1' \n" +
                ")");
        tableEnv.executeSql("CREATE TABLE sourceB (\n" +
                " uuid bigint PRIMARY KEY NOT ENFORCED,\n" +
                " `age` int," +
                " _ts2 TIMESTAMP(3)\n" +
                ") WITH (\n" +
                " 'connector' = 'datagen', \n" +
                " 'fields.uuid.kind'='sequence',\n" +
                " 'fields.uuid.start'='0', \n" +
                " 'fields.uuid.end'='1000000', \n" +
                " 'rows-per-second' = '1' \n" +
                ")");

        // hudi====================================================================
        tableEnv.executeSql("create table hudi_tableA(\n"
                + " uuid bigint PRIMARY KEY NOT ENFORCED,\n"
                + " name VARCHAR(3),\n"
                + " age int,\n"
                + " _ts1 TIMESTAMP(3),\n"
                + " _ts2 TIMESTAMP(3)\n"
                + ")\n"
                + " PARTITIONED BY (_ts1)\n"
                + " with (\n"
                + " 'connector' = 'hudi',\n"
                + " 'path' = 'file:\\D:\\Ctrip\\dataWork\\tmp', \n"   // hdfs path
                + " 'table.type' = 'MERGE_ON_READ',\n"
                + " 'write.bucket_assign.tasks' = '2',\n"
                + " 'write.tasks' = '2',\n"
                + " 'write.partition.format' = 'yyyyMMddHH',\n"
                + " 'write.partition.timestamp.type' = 'EPOCHMILLISECONDS',\n"
                + " 'hoodie.bucket.index.num.buckets' = '2',\n"
                + " 'changelog.enabled' = 'true',\n"
                + " 'index.type' = 'BUCKET',\n"
                + " 'hoodie.bucket.index.num.buckets' = '2',\n"
                // + String.format(" '%s' = '%s',\n", FlinkOptions.PRECOMBINE_FIELD.key(), "_ts1:name|_ts2:age")
                // + " 'write.payload.class' = '" + PartialUpdateAvroPayload.class.getName() + "',\n"
                + " 'hoodie.write.log.suffix' = 'job1',\n"
                + " 'hoodie.write.concurrency.mode' = 'optimistic_concurrency_control',\n"
                + " 'hoodie.write.lock.provider' = 'org.apache.hudi.client.transaction.lock.FileSystemBasedLockProvider',\n"
                + " 'hoodie.cleaner.policy.failed.writes' = 'LAZY',\n"
                + " 'hoodie.cleaner.policy' = 'KEEP_LATEST_BY_HOURS',\n"
                + " 'hoodie.consistency.check.enabled' = 'false',\n"
                + " 'hoodie.write.lock.early.conflict.detection.enable' = 'true',\n"
                + " 'hoodie.write.lock.early.conflict.detection.strategy' = '"
                // + SimpleTransactionDirectMarkerBasedEarlyConflictDetectionStrategy.class.getName() + "',\n"  //
                + " 'hoodie.keep.min.commits' = '1440',\n"
                + " 'hoodie.keep.max.commits' = '2880',\n"
                + " 'compaction.schedule.enabled'='false',\n"
                + " 'compaction.async.enabled'='false',\n"
                + " 'compaction.trigger.strategy'='num_or_time',\n"
                + " 'compaction.delta_commits' ='3',\n"
                + " 'compaction.delta_seconds' ='60',\n"
                + " 'compaction.max_memory' = '3096',\n"
                + " 'clean.async.enabled' ='false',\n"
                + " 'hive_sync.enable' = 'false'\n"
                // + " 'hive_sync.mode' = 'hms',\n"
                // + " 'hive_sync.db' = '%s',\n"
                // + " 'hive_sync.table' = '%s',\n"
                // + " 'hive_sync.metastore.uris' = '%s'\n"
                + ")");

        tableEnv.executeSql("create table hudi_tableB(\n"
                + " uuid bigint PRIMARY KEY NOT ENFORCED,\n"
                + " name VARCHAR(3),\n"
                + " age int,\n"
                + " _ts1 TIMESTAMP(3),\n"
                + " _ts2 TIMESTAMP(3)\n"
                + ")\n"
                + " PARTITIONED BY (_ts2)\n"
                + " with (\n"
                + " 'connector' = 'hudi',\n"
                + " 'path' = '/Users/laifei/tmp/hudi/local.db/mutiwrite1', \n"   // hdfs path
                + " 'table.type' = 'MERGE_ON_READ',\n"
                + " 'write.bucket_assign.tasks' = '2',\n"
                + " 'write.tasks' = '2',\n"
                + " 'write.partition.format' = 'yyyyMMddHH',\n"
                + " 'hoodie.bucket.index.num.buckets' = '2',\n"
                + " 'changelog.enabled' = 'true',\n"
                + " 'index.type' = 'BUCKET',\n"
                + " 'hoodie.bucket.index.num.buckets' = '2',\n"
                // + String.format(" '%s' = '%s',\n", FlinkOptions.PRECOMBINE_FIELD.key(), "_ts1:name|_ts2:age")
                // + " 'write.payload.class' = '" + PartialUpdateAvroPayload.class.getName() + "',\n"
                + " 'hoodie.write.log.suffix' = 'job2',\n"
                + " 'hoodie.write.concurrency.mode' = 'optimistic_concurrency_control',\n"
                + " 'hoodie.write.lock.provider' = 'org.apache.hudi.client.transaction.lock.FileSystemBasedLockProvider',\n"
                + " 'hoodie.cleaner.policy.failed.writes' = 'LAZY',\n"
                + " 'hoodie.cleaner.policy' = 'KEEP_LATEST_BY_HOURS',\n"
                + " 'hoodie.consistency.check.enabled' = 'false',\n"
                + " 'hoodie.write.lock.early.conflict.detection.enable' = 'true',\n"
                + " 'hoodie.write.lock.early.conflict.detection.strategy' = '"
                // + SimpleTransactionDirectMarkerBasedEarlyConflictDetectionStrategy.class.getName() + "',\n"
                + " 'hoodie.keep.min.commits' = '1440',\n"
                + " 'hoodie.keep.max.commits' = '2880',\n"
                + " 'compaction.schedule.enabled'='true',\n"
                + " 'compaction.async.enabled'='true',\n"
                + " 'compaction.trigger.strategy'='num_or_time',\n"
                + " 'compaction.delta_commits' ='2',\n"
                + " 'compaction.delta_seconds' ='90',\n"
                + " 'compaction.max_memory' = '3096',\n"
                + " 'clean.async.enabled' ='false'\n"
                // + " 'hive_sync.mode' = 'hms',\n"
                // + " 'hive_sync.db' = '%s',\n"
                // + " 'hive_sync.table' = '%s',\n"
                // + " 'hive_sync.metastore.uris' = '%s'\n"
                + ")");

        // sql====================================================================
        StatementSet statementSet = tableEnv.createStatementSet();
        statementSet.addInsertSql("insert into hudi_tableA(uuid, name, _ts1) select * from sourceA");
        statementSet.addInsertSql("insert into hudi_tableB(uuid, age, _ts2) select * from sourceB");
        statementSet.execute();

    }
}

(一)多写问题

由于HUDI官方提供的code打成jar包是不支持“多写”的,这里使用Tencent改造之后的code进行打包测试;

如果使用官方包,多个writer写入同一个hudi表,则会报如下异常:

而且:

hudi中有个preCombineField,在建表的时候只能指定其中一个字段为preCombineField,但是如果使用官方版本,双流写同一个hudi的时候出现两种情况:

1. 一条流写preCombineField,另一条流不写这个字段,后者会出现 ordering value不能为null;

2. 两条流都写这个字段,出现字段冲突异常;

(二)如果要两个流写一个表,这种情况怎么处理?

经过本地测试:

hudi0.12-multiWrite版本(Tencent修改版),可以支持多 precombineField(在此版本中,只要保证主键、分区字段之外的字段,在多个流中不冲突即可实现多写!)

hudi0.13版本,不支持,而且存在上述问题; 

 (三)测试结果

0

Tencent文章链接:https://cloud.tencent.com/developer/article/2189914

github链接:GitHub - XuQianJin-Stars/hudi at multiwrite-master-7

hudi打包很麻烦,如果需要我将后续上传打好的jar包;

三、后序

基于上述code,当流量比较大的时候,似乎会存在一定程度的数据丢失(在其中一条流进行compact,则另一条流就会存在一定程度的数据丢失);

可以尝试:

(1)先将两个流UNION为一个流,再sink到hudi表(也避免了写冲突);

(2)使用其他数据湖工具,比如apache paimon,参考:新一代数据湖存储技术Apache Paimon入门Demo_Leonardo_KY的博客-CSDN博客

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
Hudi(Hadoop Upsert Delete and Incremental)是一个构建在Hadoop上的开源数据湖架构,它提供了类似于数据库的upsert、delete、incremental等操作,同时支持流处理和批处理。HudiFlink的集成可以实现数据湖的实时计算和增量处理。 在HudiFlink的集成案例中,我们可以使用Flink作为流处理引擎,实现实时数据的读取和写入。具体步骤如下: 首先,我们需要将输入数据源和输出数据源与Flink进行集成。Flink可以读取来自不同数据源的数据,例如Kafka、Hive、HBase等。在我们的案例中,我们需要将Hudi作为输出数据源,因此需要实现一个自定义的Flink Sink函数,用于将Flink的输出数据写入Hudi。 其次,我们需要在Flink中编写业务逻辑,用于对输入数据进行实时计算和增量处理。Flink提供了丰富的API和算子,可以方便地进行数据转换、聚合、过滤等操作。在我们的案例中,我们可以使用Flink的Map和Filter算子,对输入数据进行转换和筛选,然后将结果数据写入Hudi。 最后,我们需要在Flink中配置和管理Hudi的相关参数。Hudi需要使用一些配置信息,例如数据存储路径、数据表的主键、分区字段等。我们可以通过Flink的配置文件或命令行参数,将这些配置信息传递给Hudi。 通过以上步骤,我们可以实现HudiFlink的集成。当输入数据流进入Flink时,Flink可以对数据进行实时计算和增量处理,并将结果数据写入Hudi。这样就可以实现对数据湖中的数据进行实时查询和分析。 通过HudiFlink的集成,我们可以充分发挥两者的优势,实现高效、实时的数据处理和分析。使用Hudi可以保证数据湖的数据一致性和可更新性,而使用Flink可以实现实时计算和增量处理。这样的集成方案可以在企业中广泛应用,为数据团队提供更好的数据湖架构解决方案

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值