背景
在使用hudi自带的原生HoodieDeltaStreamer工具数据入湖时,发现只支持单topic的数据入湖。在我们实际生产场景中,需要消费多个topic数据。所以针对性对hudi入库做了扩展。
系统
hudi版本0.12.0
涉及修改类:org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
主要修改点
扩展topic名称的配置,多个topic时使用逗号分隔;
KafkaOffsetGen类初始化
添加变量topicNameArr,根据配置的topic使用逗号分隔。
topicNameArr = topicName.split(",");
strToOffsets与offsetsToStr方法
这两个方法是将checkpoint数据转换成string和string方法,加入多个topic转换,多个之间使用竖线分隔。
/**
* Reconstruct checkpoint from timeline.
*/
public static Map<TopicPartition, Long> strToOffsets(String checkpointStr) {
Map<TopicPartition, Long> offsetMap = new HashMap<>();
String[] topics = checkpointStr.split("\\|");
for (String s : topics) {
String[] splits = s.split(",");
String topic = splits[0];
for (int i = 1; i < splits.length; i++) {
String[] subSplits = splits[i].split(":");
offsetMap.put(new TopicPartition(topic, Integer.parseInt(subSplits[0])), Long.parseLong(subSplits[1]));
}
}
return offsetMap;
}
/**
* String representation of checkpoint
* <p>
* Format: topic1,0:offset0,1:offset1,2:offset2|topic2,0:offset0,1:offset1,2:offset2, .....
*/
public static String offsetsToStr(OffsetRange[] ranges) {
StringBuilder sb = new StringBuilder();
// at least 1 partition will be present.
Map<String,List<String>> result = new HashMap<>();
List<String> tmp;
for (OffsetRange r: ranges) {
tmp = result.getOrDefault(r.topic(),new ArrayList<>());
tmp.add(String.format("%s:%d", r.partition(), r.untilOffset()));
result.put(r.topic(),tmp);
}
result.forEach((k,v) ->sb.append(k+",").append(v.stream().collect(Collectors.joining(","))).append("|"));
sb.deleteCharAt(sb.length() - 1);
return sb.toString();
}
computeOffsetRanges方法
计算要从Kafka读取的偏移范围,同时处理新添加的分区、偏移和事件限制。当多个topic时,之前只根据分区编号划分范围就不适用了,需要加上topic。
...
if (!exhaustedPartitions.contains(range.topic() + "-" + range.partition())) {
long toOffsetMax = toOffsetMap.get(range.topicPartition());
long toOffset = Math.min(toOffsetMax, range.untilOffset() + eventsPerPartition);
if (toOffset == toOffsetMax) {
exhaustedPartitions.add(range.topic() + "-" + range.partition());
}
...
}
....
getNextOffsetRanges方法
获取偏移量时,第一次需要判断是否是新加的topic,是新加的则需要获取最新的偏移量。
Map<TopicPartition, Long> checkpointOffsets = CheckpointUtils.strToOffsets(lastCheckpointStr.get());
fromOffsets = fetchValidOffsets(consumer, checkpointOffsets, topicPartitions);
Set<TopicPartition> newTopicPartition = new HashSet<>();
topicPartitions.stream().forEach(val -> {
if (!checkpointOffsets.keySet().contains(val)){
newTopicPartition.add(val);
}
});
if (newTopicPartition.size()>0){
LOG.info("add new topic or partitions, old fromOffsets:" + fromOffsets);
fromOffsets.putAll(consumer.endOffsets(newTopicPartition));
LOG.info("add new topic or partitions, new fromOffsets:" + fromOffsets);
}
完整代码
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hudi.utilities.sources.helpers;
import org.apache.hudi.DataSourceUtils;
import org.apache.hudi.common.config.ConfigProperty;
import org.apache.hudi.common.config.TypedProperties;
import org.apache.hudi.common.util.Option;
import org.apache.hudi.common.util.StringUtils;
import org.apache.hudi.exception.HoodieException;
import org.apache.hudi.exception.HoodieNotSupportedException;
import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamerMetrics;
import org.apache.hudi.utilities.exception.HoodieDeltaStreamerException;
import org.apache.hudi.utilities.sources.AvroKafkaSource;
import org.apache.kafka.clients.consumer.CommitFailedException;
import org.apache.kafka.clients.consumer.ConsumerConfig;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.clients.consumer.OffsetAndMet