多个Topic数据入Hudi一张表

最新推荐文章于 2024-04-18 09:29:21 发布

xzh199308

最新推荐文章于 2024-04-18 09:29:21 发布

阅读量180

点赞数

分类专栏： Hudi 大数据文章标签：分布式大数据 Powered by 金山文档

本文链接：https://blog.csdn.net/qq_20101897/article/details/128805145

版权

本文介绍了如何扩展Hudi的HoodieDeltaStreamer工具，使其能够处理来自多个Kafka Topic的数据入湖。在Hudi 0.12.0版本中，针对KafkaOffsetGen类进行了初始化、strToOffsets、offsetsToStr方法以及computeOffsetRanges和getNextOffsetRanges方法的修改，以适应多Topic场景，确保数据正确写入。

摘要由CSDN通过智能技术生成

背景

在使用hudi自带的原生HoodieDeltaStreamer工具数据入湖时，发现只支持单topic的数据入湖。在我们实际生产场景中，需要消费多个topic数据。所以针对性对hudi入库做了扩展。

系统

hudi版本0.12.0

涉及修改类：org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer

主要修改点

扩展topic名称的配置，多个topic时使用逗号分隔；

KafkaOffsetGen类初始化

添加变量topicNameArr，根据配置的topic使用逗号分隔。

topicNameArr = topicName.split(",");

strToOffsets与offsetsToStr方法

这两个方法是将checkpoint数据转换成string和string方法，加入多个topic转换，多个之间使用竖线分隔。

/**
     * Reconstruct checkpoint from timeline.
     */
    public static Map<TopicPartition, Long> strToOffsets(String checkpointStr) {
      Map<TopicPartition, Long> offsetMap = new HashMap<>();
      String[] topics = checkpointStr.split("\\|");
      for (String s : topics) {
        String[] splits = s.split(",");
        String topic = splits[0];
        for (int i = 1; i < splits.length; i++) {
          String[] subSplits = splits[i].split(":");
          offsetMap.put(new TopicPartition(topic, Integer.parseInt(subSplits[0])), Long.parseLong(subSplits[1]));
        }
      }
      return offsetMap;
    }

    /**
     * String representation of checkpoint
     * <p>
     * Format: topic1,0:offset0,1:offset1,2:offset2|topic2,0:offset0,1:offset1,2:offset2, .....
     */
    public static String offsetsToStr(OffsetRange[] ranges) {
      StringBuilder sb = new StringBuilder();
      // at least 1 partition will be present.
      Map<String,List<String>> result = new HashMap<>();
      List<String> tmp;
      for (OffsetRange r: ranges) {
        tmp = result.getOrDefault(r.topic(),new ArrayList<>());
        tmp.add(String.format("%s:%d", r.partition(), r.untilOffset()));
        result.put(r.topic(),tmp);
      }

      result.forEach((k,v) ->sb.append(k+",").append(v.stream().collect(Collectors.joining(","))).append("|"));
      sb.deleteCharAt(sb.length() - 1);
      return sb.toString();
    }

computeOffsetRanges方法

计算要从Kafka读取的偏移范围，同时处理新添加的分区、偏移和事件限制。当多个topic时，之前只根据分区编号划分范围就不适用了，需要加上topic。

...
if (!exhaustedPartitions.contains(range.topic() + "-" + range.partition())) {
    long toOffsetMax = toOffsetMap.get(range.topicPartition());
    long toOffset = Math.min(toOffsetMax, range.untilOffset() + eventsPerPartition);
    if (toOffset == toOffsetMax) {
      exhaustedPartitions.add(range.topic() + "-" + range.partition());
    }
...
}
....

getNextOffsetRanges方法

获取偏移量时，第一次需要判断是否是新加的topic，是新加的则需要获取最新的偏移量。

Map<TopicPartition, Long> checkpointOffsets = CheckpointUtils.strToOffsets(lastCheckpointStr.get());
        fromOffsets = fetchValidOffsets(consumer, checkpointOffsets, topicPartitions);
        Set<TopicPartition> newTopicPartition = new HashSet<>();
        topicPartitions.stream().forEach(val -> {
          if (!checkpointOffsets.keySet().contains(val)){
            newTopicPartition.add(val);
          }
        });
        if (newTopicPartition.size()>0){
          LOG.info("add new topic or partitions, old fromOffsets:" + fromOffsets);
          fromOffsets.putAll(consumer.endOffsets(newTopicPartition));
          LOG.info("add new topic or partitions, new fromOffsets:" + fromOffsets);
        }

完整代码

/*
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.hudi.utilities.sources.helpers;

import org.apache.hudi.DataSourceUtils;
import org.apache.hudi.common.config.ConfigProperty;
import org.apache.hudi.common.config.TypedProperties;
import org.apache.hudi.common.util.Option;
import org.apache.hudi.common.util.StringUtils;
import org.apache.hudi.exception.HoodieException;
import org.apache.hudi.exception.HoodieNotSupportedException;
import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamerMetrics;
import org.apache.hudi.utilities.exception.HoodieDeltaStreamerException;
import org.apache.hudi.utilities.sources.AvroKafkaSource;

import org.apache.kafka.clients.consumer.CommitFailedException;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.clients.consumer.OffsetAndMet