实现FlinkKafkaConsumer在指定时间范围消费数据

写在前面

自0.10版本起,kafka开始支持指定起始时间戳进行消费,即使用KafkaConsumer.offsetsForTimes定位时间戳对应的offset, 本质上依然是定位offset进行消费。
对应的,FlinkKafkaConsumer010起,也由source接口支持了在kafka中指定起始时间消费。

FlinkKafkaConsumerBase<T> setStartFromTimestamp(long startupOffsetsTimestamp)

由于业务上的需求,要在Flink环境下消费指定时间段的kafka数据。结合Flink目前提供的API, 仅支持指定起始时间,在指定起始时间后则会一直消费到最后,“有头没尾”。为满足业务,需要实现在某一时刻结束消费

1.分析

通过分析FlinkKafkaConsumer源码,消费者启动是由FlinkKafkaConsumerBase.run实现,通过创建一个KafkaFetcher,由KafkaFetcher启动消费kafka topic的线程,并获取数据(同步的 Handover 在两个线程间使用全局变量共享数据)。实现逻辑在KafkaFetcher.runFetchLoop()中,具体原理可详细查阅KafkaConsumerThread和Handover的源码。

@Override
	public void runFetchLoop() throws Exception {
		try {
			final Handover handover = this.handover;

			// kick off the actual Kafka consumer
			consumerThread.start();

			while (running) {
				// this blocks until we get the next records
				// it automatically re-throws exceptions encountered in the consumer thread
				final ConsumerRecords<byte[], byte[]> records = handover.pollNext();

				// get the records for each topic partition
				for (KafkaTopicPartitionState<TopicPartition> partition : subscribedPartitionStates()) {

					List<ConsumerRecord<byte[], byte[]>> partitionRecords =
						records.records(partition.getKafkaPartitionHandle());

					for (ConsumerRecord<byte[], byte[]> record : partitionRecords) {
						final T value = deserializer.deserialize(record);

						if (deserializer.isEndOfStream(value)) {
							// end of stream signaled
							running = false;
							break;
						}

						// emit the actual record. this also updates offset state atomically
						// and deals with timestamps and watermark generation
						emitRecord(value, partition, record.offset(), record);
					}
				}
			}
		}
		finally {
			// this signals the consumer thread that no more work is to be done
			consumerThread.shutdown();
		}

通过源码可以看到,KafkaFetcher调用runFetchLoop()方法循环拉取数据,循环结束条件是数据流被读到末尾。显然,若要手动停止KafkaConsumer消费,另外增加一个循环结束的条件即可。

2.实现

a.为满足停止消费的功能,个人的思路是首先从KafkaFetcher入手,增加一个指定结束时间戳的参数,作为成员变量。

  private long stopConsumingTimestamp;

b.然后则是重构消费数据的核心部分,即runFetchLoop()方法,增加一个循环结束的条件,当拉取数据到达指定时间戳时,停止拉取,关闭线程。

//stop fetching and emitting
if (record.timestamp() > stopConsumingTimestamp && stopConsumingTimestamp != 0) {
  this.running = false;
  break;
}

c.对于FlinkKafkaConsumer, 同样增加结束时间戳的参数,创建Fetcher时进行传递。另外增加指定时间范围的方法setIntervalFromTimestamp() , 为stopConsumingTimestamp参数赋值,同时复用setStartFromTimestamp()方法,为consumer指定起始消费的时间戳。

public FlinkKafkaConsumerBase<T> setIntervalFromTimestamp(long startupOffsetsTimestamp, long stopConsumingTimestamp) {
  setStopConsumingTimestamp(stopConsumingTimestamp);
  if (startupOffsetsTimestamp > stopConsumingTimestamp) {
    throw new IllegalArgumentException("The start consuming time " + startupOffsetsTimestamp + "exceeds the end time");
  } else {
    return super.setStartFromTimestamp(startupOffsetsTimestamp);
  }
}

实现并不复杂,总结来说即是重构了两个类,FlinkKafkaConsumer与KafkaFetcher.
附上完整代码:

SpecificFlinkKafkaConsumer.java

package org.apache.flink.streaming.connectors.kafka;

import org.apache.flink.api.common.serialization.DeserializationSchema;
import org.apache.flink.metrics.MetricGroup;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks;
import org.apache.flink.streaming.api.operators.StreamingRuntimeContext;
import org.apache.flink.streaming.connectors.kafka.config.OffsetCommitMode;
import org.apache.flink.streaming.connectors.kafka.internal.SpecificKafkaFetcher;
import org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher;
import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition;
import org.apache.flink.util.SerializedValue;

import java.util.List;
import java.util.Map;
import java.util.Properties;
import java.util.regex.Pattern;

/**
 * @author lee
 */
public class SpecificFlinkKafkaConsumer<T> extends FlinkKafkaConsumer<T> {
  private long stopConsumingTimestamp;

  public SpecificFlinkKafkaConsumer(String topic, DeserializationSchema<T> valueDeserializer, Properties props) {
    super(topic, valueDeserializer, props);
  }

  public SpecificFlinkKafkaConsumer(String topic, KafkaDeserializationSchema<T> deserializer, Properties props) {
    super(topic, deserializer, props);
  }

  public SpecificFlinkKafkaConsumer(List<String> topics, DeserializationSchema<T> deserializer, Properties props) {
    super(topics, deserializer, props);
  }

  public SpecificFlinkKafkaConsumer(List<String> topics, KafkaDeserializationSchema<T> deserializer, Properties props) {
    super(topics, deserializer, props);
  }

  public SpecificFlinkKafkaConsumer(Pattern subscriptionPattern, DeserializationSchema<T> valueDeserializer, Properties props) {
    super(subscriptionPattern, valueDeserializer, props);
  }

  public SpecificFlinkKafkaConsumer(Pattern subscriptionPattern, KafkaDeserializationSchema<T> deserializer, Properties props) {
    super(subscriptionPattern, deserializer, props);
  }

  private void setStopConsumingTimestamp(long stopConsumingTimestamp) {
    this.stopConsumingTimestamp = stopConsumingTimestamp;
  }

  //指定时间范围
  public FlinkKafkaConsumerBase<T> setIntervalFromTimestamp(long startupOffsetsTimestamp, long stopConsumingTimestamp) {
    setStopConsumingTimestamp(stopConsumingTimestamp);
    if (startupOffsetsTimestamp > stopConsumingTimestamp) {
      throw new IllegalArgumentException("The start consuming time " + startupOffsetsTimestamp + "exceeds the end time");
    } else {
      return super.setStartFromTimestamp(startupOffsetsTimestamp);
    }
  }

  @Override
  protected AbstractFetcher<T, ?> createFetcher(SourceContext<T> sourceContext, Map<KafkaTopicPartition, Long> assignedPartitionsWithInitialOffsets, SerializedValue<AssignerWithPeriodicWatermarks<T>> watermarksPeriodic, SerializedValue<AssignerWithPunctuatedWatermarks<T>> watermarksPunctuated, StreamingRuntimeContext runtimeContext, OffsetCommitMode offsetCommitMode, MetricGroup consumerMetricGroup, boolean useMetrics) throws Exception {
    adjustAutoCommitConfig(this.properties, offsetCommitMode);
    return new SpecificKafkaFetcher(sourceContext, assignedPartitionsWithInitialOffsets, watermarksPeriodic, watermarksPunctuated, runtimeContext.getProcessingTimeService(), runtimeContext.getExecutionConfig().getAutoWatermarkInterval(), runtimeContext.getUserCodeClassLoader(), runtimeContext.getTaskNameWithSubtasks(), this.deserializer, this.properties, this.pollTimeout, runtimeContext.getMetricGroup(), consumerMetricGroup, useMetrics, stopConsumingTimestamp);
  }
}


SpecificKafkaFetcher.java

package org.apache.flink.streaming.connectors.kafka.internal;

import org.apache.flink.annotation.Internal;
import org.apache.flink.metrics.MetricGroup;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.streaming.connectors.kafka.KafkaDeserializationSchema;
import org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher;
import org.apache.flink.streaming.connectors.kafka.internals.KafkaCommitCallback;
import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition;
import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartitionState;
import org.apache.flink.streaming.runtime.tasks.ProcessingTimeService;
import org.apache.flink.util.Preconditions;
import org.apache.flink.util.SerializedValue;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.OffsetAndMetadata;
import org.apache.kafka.common.TopicPartition;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import javax.annotation.Nonnull;

/**
 * @author lee
 */
@Internal
public class SpecificKafkaFetcher<T> extends AbstractFetcher<T, TopicPartition> {
  private static final Logger LOG = LoggerFactory.getLogger(KafkaFetcher.class);
  private final KafkaDeserializationSchema<T> deserializer;
  private final Handover handover;
  private final KafkaConsumerThread consumerThread;
  private volatile boolean running = true;

  public long stopConsumingTimestamp;

  public SpecificKafkaFetcher(SourceFunction.SourceContext<T> sourceContext, Map<KafkaTopicPartition, Long> assignedPartitionsWithInitialOffsets, SerializedValue<AssignerWithPeriodicWatermarks<T>> watermarksPeriodic, SerializedValue<AssignerWithPunctuatedWatermarks<T>> watermarksPunctuated, ProcessingTimeService processingTimeProvider, long autoWatermarkInterval, ClassLoader userCodeClassLoader, String taskNameWithSubtasks, KafkaDeserializationSchema<T> deserializer, Properties kafkaProperties, long pollTimeout, MetricGroup subtaskMetricGroup, MetricGroup consumerMetricGroup, boolean useMetrics, long stopConsumingTimestamp) throws Exception {
    super(sourceContext, assignedPartitionsWithInitialOffsets, watermarksPeriodic, watermarksPunctuated, processingTimeProvider, autoWatermarkInterval, userCodeClassLoader, consumerMetricGroup, useMetrics);
    this.deserializer = deserializer;
    this.handover = new Handover();
    this.consumerThread = new KafkaConsumerThread(LOG, this.handover, kafkaProperties, this.unassignedPartitionsQueue, this.getFetcherName() + " for " + taskNameWithSubtasks, pollTimeout, useMetrics, consumerMetricGroup, subtaskMetricGroup);
    this.stopConsumingTimestamp = stopConsumingTimestamp;
  }

  public void runFetchLoop() throws Exception {
    try {
      final Handover handover = this.handover;

      // kick off the actual Kafka consumer
      consumerThread.start();

      while (running) {
        // this blocks until we get the next records
        // it automatically re-throws exceptions encountered in the consumer thread
        final ConsumerRecords<byte[], byte[]> records = handover.pollNext();

        // get the records for each topic partition
        for (KafkaTopicPartitionState<TopicPartition> partition : subscribedPartitionStates()) {

          List<ConsumerRecord<byte[], byte[]>> partitionRecords =
              records.records(partition.getKafkaPartitionHandle());

          for (ConsumerRecord<byte[], byte[]> record : partitionRecords) {
            final T value = deserializer.deserialize(record);

            if (deserializer.isEndOfStream(value)) {
              // end of stream signaled
              running = false;
              break;
            }

            //stop fetching and emitting
            if (record.timestamp() > stopConsumingTimestamp && stopConsumingTimestamp != 0) {
              this.running = false;
              break;
            }

            // emit the actual record. this also updates offset state atomically
            // and deals with timestamps and watermark generation
            emitRecord(value, partition, record.offset(), record);
          }
        }
      }
    } finally {
      // this signals the consumer thread that no more work is to be done
      consumerThread.shutdown();
    }

    // on a clean exit, wait for the runner thread
    try {
      consumerThread.join();
    } catch (InterruptedException e) {
      // may be the result of a wake-up interruption after an exception.
      // we ignore this here and only restore the interruption state
      Thread.currentThread().interrupt();
    }
  }

  protected void emitRecord(T record, KafkaTopicPartitionState<TopicPartition> partition, long offset, ConsumerRecord<?, ?> consumerRecord) throws Exception {
    this.emitRecordWithTimestamp(record, partition, offset, consumerRecord.timestamp());
  }

  public void cancel() {
    this.running = false;
    this.handover.close();
    this.consumerThread.shutdown();
  }


  protected String getFetcherName() {
    return "Kafka Fetcher";
  }

  public TopicPartition createKafkaPartitionHandle(KafkaTopicPartition partition) {
    return new TopicPartition(partition.getTopic(), partition.getPartition());
  }

  protected void doCommitInternalOffsetsToKafka(Map<KafkaTopicPartition, Long> offsets, @Nonnull KafkaCommitCallback commitCallback) throws Exception {
    List<KafkaTopicPartitionState<TopicPartition>> partitions = this.subscribedPartitionStates();
    Map<TopicPartition, OffsetAndMetadata> offsetsToCommit = new HashMap(partitions.size());
    Iterator var5 = partitions.iterator();

    while (var5.hasNext()) {
      KafkaTopicPartitionState<TopicPartition> partition = (KafkaTopicPartitionState) var5.next();
      Long lastProcessedOffset = (Long) offsets.get(partition.getKafkaTopicPartition());
      if (lastProcessedOffset != null) {
        Preconditions.checkState(lastProcessedOffset >= 0L, "Illegal offset value to commit");
        long offsetToCommit = lastProcessedOffset + 1L;
        offsetsToCommit.put(partition.getKafkaPartitionHandle(), new OffsetAndMetadata(offsetToCommit));
        partition.setCommittedOffset(offsetToCommit);
      }
    }

    this.consumerThread.setOffsetsToCommit(offsetsToCommit, commitCallback);
  }


}

测试类

import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.SpecificFlinkKafkaConsumer;

import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Properties;

/**
 * @author lee
 */
public class FlinkKafkaConsumerWithTimestampTest {
  private static ThreadLocal<DateFormat> pattern = ThreadLocal.withInitial(() -> new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"));

  public static void main(String[] args) throws Exception {
    final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setParallelism(1);

    Properties prop = new Properties();
    prop.put("bootstrap.servers", "kafka:23092");
    prop.put("group.id", "flink-streaming-job");

    long startTimestamp = pattern.get().parse("2020-10-20 11:14:00").getTime();
    long stopTimestamp = pattern.get().parse("2020-10-20 11:14:10").getTime();

    SpecificFlinkKafkaConsumer<String> consumer = new SpecificFlinkKafkaConsumer<String>("http_log", new SimpleStringSchema(), prop);
    consumer.setIntervalFromTimestamp(startTimestamp, stopTimestamp);

    DataStreamSource<String> dataStreamSource = env.addSource(consumer);
    dataStreamSource.print();
    env.execute();

  }
}

特别需要注意的是,因相关方法是protected修饰,在重构上述源码时,创建的java类必须和Flink提供的原生FlinkKafkaConsumer相关类在同一包下。
即org.apache.flink.streaming.connectors.kafka

  • 8
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
Flink作为一个分布式流处理框架,可以集成Kafka实现指定offset消费。下面是使用Flink消费Kafka指定offset的简单步骤: 首先,确保你的项目依赖中已经引入了FlinkKafka的相关库。 在Flink应用程序中,你需要创建一个消费者并指定消费Kafka话题。使用`FlinkKafkaConsumer`类来创建一个Kafka消费者对象,并在构造函数中指定Kafka的连接地址、话题和反序列化器等相关信息。例如: ```java Properties properties = new Properties(); properties.setProperty("bootstrap.servers", "localhost:9092"); properties.setProperty("group.id", "my-consumer-group"); FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>("my-topic", new SimpleStringSchema(), properties); ``` 然后,你可以使用`setStartFromSpecificOffsets()`方法来指定要从哪个offset开始消费。`setStartFromSpecificOffsets()`方法接受一个`Map<KafkaTopicPartition, Long>`参数,其中`KafkaTopicPartition`表示Kafka话题的分区,`Long`表示要指定offset。例如,假设你要指定从话题`my-topic`的第一个分区的偏移量10开始消费,那么你可以这样设置: ```java Map<KafkaTopicPartition, Long> specificOffsets = new HashMap<>(); specificOffsets.put(new KafkaTopicPartition("my-topic", 0), 10L); kafkaConsumer.setStartFromSpecificOffsets(specificOffsets); ``` 最后,将Kafka消费者对象传递给Flink的`addSource()`方法来创建数据源。例如: ```java DataStream<String> dataStream = env.addSource(kafkaConsumer); ``` 在这之后,你可以继续处理和转换数据流,实现你的业务逻辑。 以上就是使用Flink Kafka消费指定offset消费的简单过程。通过指定offset,你可以从指定位置开始消费Kafka数据,而不是从最新或最早的offset开始消费

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值