多个Topic数据入Hudi一张表

背景

在使用hudi自带的原生HoodieDeltaStreamer工具数据入湖时,发现只支持单topic的数据入湖。在我们实际生产场景中,需要消费多个topic数据。所以针对性对hudi入库做了扩展。

系统

hudi版本0.12.0

涉及修改类:org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer

主要修改点

扩展topic名称的配置,多个topic时使用逗号分隔;

KafkaOffsetGen类初始化

添加变量topicNameArr,根据配置的topic使用逗号分隔。

topicNameArr = topicName.split(",");

strToOffsets与offsetsToStr方法

这两个方法是将checkpoint数据转换成string和string方法,加入多个topic转换,多个之间使用竖线分隔。

/**
     * Reconstruct checkpoint from timeline.
     */
    public static Map<TopicPartition, Long> strToOffsets(String checkpointStr) {
      Map<TopicPartition, Long> offsetMap = new HashMap<>();
      String[] topics = checkpointStr.split("\\|");
      for (String s : topics) {
        String[] splits = s.split(",");
        String topic = splits[0];
        for (int i = 1; i < splits.length; i++) {
          String[] subSplits = splits[i].split(":");
          offsetMap.put(new TopicPartition(topic, Integer.parseInt(subSplits[0])), Long.parseLong(subSplits[1]));
        }
      }
      return offsetMap;
    }

    /**
     * String representation of checkpoint
     * <p>
     * Format: topic1,0:offset0,1:offset1,2:offset2|topic2,0:offset0,1:offset1,2:offset2, .....
     */
    public static String offsetsToStr(OffsetRange[] ranges) {
      StringBuilder sb = new StringBuilder();
      // at least 1 partition will be present.
      Map<String,List<String>> result = new HashMap<>();
      List<String> tmp;
      for (OffsetRange r: ranges) {
        tmp = result.getOrDefault(r.topic(),new ArrayList<>());
        tmp.add(String.format("%s:%d", r.partition(), r.untilOffset()));
        result.put(r.topic(),tmp);
      }

      result.forEach((k,v) ->sb.append(k+",").append(v.stream().collect(Collectors.joining(","))).append("|"));
      sb.deleteCharAt(sb.length() - 1);
      return sb.toString();
    }

computeOffsetRanges方法

计算要从Kafka读取的偏移范围,同时处理新添加的分区、偏移和事件限制。当多个topic时,之前只根据分区编号划分范围就不适用了,需要加上topic。

...
if (!exhaustedPartitions.contains(range.topic() + "-" + range.partition())) {
    long toOffsetMax = toOffsetMap.get(range.topicPartition());
    long toOffset = Math.min(toOffsetMax, range.untilOffset() + eventsPerPartition);
    if (toOffset == toOffsetMax) {
      exhaustedPartitions.add(range.topic() + "-" + range.partition());
    }
...
}
....

getNextOffsetRanges方法

获取偏移量时,第一次需要判断是否是新加的topic,是新加的则需要获取最新的偏移量。

Map<TopicPartition, Long> checkpointOffsets = CheckpointUtils.strToOffsets(lastCheckpointStr.get());
        fromOffsets = fetchValidOffsets(consumer, checkpointOffsets, topicPartitions);
        Set<TopicPartition> newTopicPartition = new HashSet<>();
        topicPartitions.stream().forEach(val -> {
          if (!checkpointOffsets.keySet().contains(val)){
            newTopicPartition.add(val);
          }
        });
        if (newTopicPartition.size()>0){
          LOG.info("add new topic or partitions, old fromOffsets:" + fromOffsets);
          fromOffsets.putAll(consumer.endOffsets(newTopicPartition));
          LOG.info("add new topic or partitions, new fromOffsets:" + fromOffsets);
        }

完整代码

/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.hudi.utilities.sources.helpers;

import org.apache.hudi.DataSourceUtils;
import org.apache.hudi.common.config.ConfigProperty;
import org.apache.hudi.common.config.TypedProperties;
import org.apache.hudi.common.util.Option;
import org.apache.hudi.common.util.StringUtils;
import org.apache.hudi.exception.HoodieException;
import org.apache.hudi.exception.HoodieNotSupportedException;
import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamerMetrics;
import org.apache.hudi.utilities.exception.HoodieDeltaStreamerException;
import org.apache.hudi.utilities.sources.AvroKafkaSource;

import org.apache.kafka.clients.consumer.CommitFailedException;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.clients.consumer.OffsetAndMet
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值