kafka 的订阅模式
在描写flink kafka 的订阅模式前,先来回顾一下kafka 的订阅模式,kafka 的订阅模式有以下两种。
1.subscribe() 具有消费者自动再均衡的功能,当组内的消费者增加或者减少的时候,分区的关系就会自动调整。
2.assign 是可以自定义根据分区来拉取数据的。assgin模式因为是自己定义的,所以就缺少来kafkasubscribe中自带的reblance。
而flink kafka选取的模式就是assign模式。
flink kafka 消费者
在flink kafka 的消费者中FlinkKafkaConsumerBase 为主要程序。在flinkKafkaConsumerBase中的open会初始化一些参数。在open 函数中有个变量subscribedPartitionsToStartOffsets,这个参数是传给 createFetcher 函数,这个函数最后进入AbstractFetcher。在abstractFetcher中会启动kafka consumer的线程,并且会根据分区拉取数据,实现动态分区的分配问题。下面就来详细的分析AbstractFetcher 和kafkaConsumer线程的源码。
AbstractFetcher 详解
定义的变量
- 初始化相应的waterMarker
private static final int NO_TIMESTAMPS_WATERMARKS = 0;
private static final int PERIODIC_WATERMARKS = 1;
private static final int PUNCTUATED_WATERMARKS = 2;
- 分区相关的初始化
/** The source context to emit records and watermarks to. */
protected final SourceContext<T> sourceContext;
/** The lock that guarantees that record emission and state updates are atomic,
* from the view of taking a checkpoint. */
private final Object checkpointLock;
/** All partitions (and their state) that this fetcher is subscribed to. */
private final List<KafkaTopicPartitionState<KPH>> subscribedPartitionStates;
/**
* Queue of partitions that are not yet assigned to any Kafka clients for consuming.
* Kafka version-specific implementations of {@link AbstractFetcher#runFetchLoop()}
* should continuously poll this queue for unassigned partitions, and start consuming
* them accordingly.
*
* <p>All partitions added to this queue are guaranteed to have been added
* to {@link #subscribedPartitionStates} already.
*/
protected final ClosableBlockingQueue<KafkaTopicPartitionState<KPH>> unassignedPartitionsQueue;
/** The mode describing whether the fetcher also generates timestamps and watermarks. */
private final int timestampWatermarkMode;
/**
* Optional timestamp extractor / watermark generator that will be run per Kafka partition,
* to exploit per-partition timestamp characteristics.
* The assigner is kept in serialized form, to deserialize it into multiple copies.
*/
private final SerializedValue<AssignerWithPeriodicWatermarks<T>> watermarksPeriodic;
/**
* Optional timestamp extractor / watermark generator that will be run per Kafka partition,
* to exploit per-partition timestamp characteristics.
* The assigner is kept in serialized form, to deserialize it into multiple copies.
*/
private final SerializedValue<AssignerWithPunctuatedWatermarks<T>> watermarksPunctuated;
/** User class loader used to deserialize watermark assigners. */
private final ClassLoader userCodeClassLoader;
/** Only relevant for punctuated watermarks: The current cross partition watermark. */
private volatile long maxWatermarkSoFar = Long.MIN_VALUE;
这里subscribedPartitionStates 和 unassignedPartitionsQueue 是对于分区很重要的两个变量。可以看到上面源码中给出的官方解释。
1. subscribedPartitonStates :All partitions (and their state) that this fetcher is subscribed to. 存放的是当前topic的所有分区的状态以及一些相应的描述。
2.unassignedPartitionsQueue: 我们可知kafka 的消费者,根据分区来拉取数据的,在开始要先根据分区,subtask确定,哪个分区归哪个subtask,这里的分配原理,在前面的文章中有写如何分配subtask 和分区的策略。这个queue中存放的是没有被分配的分区。
未完待续。。。