入口
关键类:KafkaDynamicTableFactory
Source
通过createDynamicTableSource方法创建 kafka source,这里主要做几件事:
- 从context获取table ddl中相关的信息、比如schema、with属性,生成TableFactoryHelper辅助工具类。
- 根据with中的key/value format配置discover key/value的format。
各种参数校验。 - 构造KafkaDynamicSource对象。
- 在KafkaDynamicSource中通过key/value 的format创建对应的deserialization schema,将schema中的metadata字段和普通字段分开,创建FlinkKafkaConsumer对象封装在SourceFunctionProvider当中。
@Override
public ScanRuntimeProvider getScanRuntimeProvider(ScanContext context) {
final DeserializationSchema<RowData> keyDeserialization =
createDeserialization(context, keyDecodingFormat, keyProjection, keyPrefix);
final DeserializationSchema<RowData> valueDeserialization =
createDeserialization(context, valueDecodingFormat, valueProjection, null);
final TypeInformation<RowData> producedTypeInfo =
context.createTypeInformation(producedDataType);
final FlinkKafkaConsumer<RowData> kafkaConsumer =
createKafkaConsumer(keyDeserialization, valueDeserialization, producedTypeInfo);
return SourceFunctionProvider.of(kafkaConsumer, false);
}
FlinkKafkaConsumer就是用来读取kafka的,核心逻辑在其父类FlinkKafkaConsumerBase中,几个核心方法:
open:kafka consumer相关对象的初始化,包括offset提交模式、动态分区发现、消费模式、反序列化器
run: 通过kafkaFetcher从kafka中拉取数据
runWithPartitionDiscovery: 独立线程运行动态分区发现
snapshotState:Checkpoint时对partition和offset信息进行快照,用于failover
initializeState:从Checkpoint恢复时用来恢复现场
notifyCheckpointComplete:Checkpoint完成时进行offset提交到kafka
关于动态分区发现,在open中就一次性拉取了topic的所有分区,当周期性的执行分区发现,如果有新的partition加入,就会再拉取一次所有的partition,根据partition id判断哪些是基于上次新增的,并根据一下分配算法决定由哪个subtask进行订阅消费。
public static int assign(KafkaTopicPartition partition, int numParallelSubtasks) {
int startIndex =
((partition.getTopic().hashCode() * 31) & 0x7FFFFFFF) % numParallelSubtasks;
// here, the assumption is that the id of Kafka partitions are always ascending
// starting from 0, and therefore can be used directly as the offset clockwise from the
// start index
return (startIndex + partition.getPartition()) % numParallelSubtasks;
}
KafkaFetcher通过消费线程KafkaConsumerThread来消费kafka的数据,内部是使用kafka的KafkaConsumer实现。
kafkaFetcher每次从Handover中pollnext,KafkaConsumerThread消费到数据然后produce到handover当中,handover充当了生产者-消费者模型中阻塞队列的作用。
public void runFetchLoop() throws Exception {
try {
// kick off the actual Kafka consumer
consumerThread.start();
while (running) {
// this blocks until we get the next records
// it automatically re-throws exceptions encountered in the consumer thread
final ConsumerRecords<byte[], byte[]> records = handover.pollNext();
// get the records for each topic partition
for (KafkaTopicPartitionState<T, TopicPartition> partition :
subscribedPartitionStates()) {
List<ConsumerRecord<byte[], byte[]>> partitionRecords =
records.records(partition.getKafkaPartitionHandle());
partitionConsumerRecordsHandler(partitionRecords, partition);
}
}
} finally {
// this signals the consumer thread that no more work is to be done
consumerThread.shutdown();
}
// on a clean exit, wait for the runner thread
try {
consumerThread.join();
} catch (InterruptedException e) {
// may be the result of a wake-up interruption after an exception.
// we ignore this here and only restore the interruption state
Thread.currentThread().interrupt();
}
}
Sink
sink也类似,在createDynamicTableSink方法中创建KafkaDynamicSink,主要负责:
同source,有个特殊处理,如果是avro-confluent或debezium-avro-confluent,且schema-registry.subject没有设置的话,自动补齐。
根据with熟悉discover key/value的encoding format
参数校验
构造KafkaDynamicSink对象
在SinkRuntimeProvider#getSinkRuntimeProvider构造FlinkKafkaProducer封装在SinkFunctionProvider当中。
public SinkRuntimeProvider getSinkRuntimeProvider(Context context) {
final SerializationSchema<RowData> keySerialization =
createSerialization(context, keyEncodingFormat, keyProjection, keyPrefix);
final SerializationSchema<RowData> valueSerialization =
createSerialization(context, valueEncodingFormat, valueProjection, null);
final FlinkKafkaProducer<RowData> kafkaProducer =
createKafkaProducer(keySerialization, valueSerialization);
return SinkFunctionProvider.of(kafkaProducer, parallelism);
}
FlinkKafkaProducer向kafka中写数据,为了保证exactly-once语义,其继承了TwoPhaseCommitSinkFunction两段式提交方法,利用kafka事务机制保证了数据的仅此一次语义。
FlinkKafkaProducer几个核心方法:
open:kafka相关属性初始化
invoke:数据处理逻辑,将key和value序列化后构造成ProducerRecord,根据分区策略调用kafka的API KafkaProducer来发送数据
beginTransaction:开启事务
preCommit:预提交
commit:正式提交
snapshotState:Checkpoint时对状态进行快照,主要是事务相关的状态
notifyCheckpointComplete:父类方法,用于Checkpoint完成时回调,提交事务
initializeState:状态初始化,用于任务从Checkpoint恢复时恢复状态
整个过程发送数据以及事务提交过程如下:
initializeState(程序启动或从cp恢复开启第一次事务 beginTransaction)→invoke(处理数据并发送kafka)→snapshotState(将当前事务存储并记录到状态,并开启下一次事务,同时进行预提交preCommit)→notifyCheckpointComplete(提交之前pending的事务,并进行正式提交commit)
如果中间有报错,最终会调用close方法来终止事务。