接着上篇哈~
work部分,先总结 work 要干的工作
1.rpc 到 SourceCoordinator 表示 我已经开始工作了,要注册自己的id,并且获取对应切分的分区信息。
2.等传来的分区信息,等到后,做准备工作,获取tp的offset,然后,要构建一个拉取tp数据的线程。维护 tp的offset状态。
3.异常处理。
先看注册部分,该部分发生在SourceOperator 的Open方法
可能会发生的rpc 部分
// 这里是获取split,如果失败重启,就从ck readerState 中获得split
final List<SplitT> splits = CollectionUtil.iterableToList(readerState.get());
// 如果split不为空,基本上就可以按照split 先去消费数据
if (!splits.isEmpty()) {
sourceReader.addSplits(splits);
}
//看英文翻译,也知这就是注册,表示有新的分区信息,记得通知。
// Register the reader to the coordinator.
registerReader();
// Start the reader after registration, sending messages in start is allowed.
sourceReader.start();
其中 注册就是 rpc 操作了,向SourceCoordinator发送注册消息。然后,SourceCoordinator就会回传split消息。
private void registerReader() {
operatorEventGateway.sendEventToCoordinator(
new ReaderRegistrationEvent(
getRuntimeContext().getIndexOfThisSubtask(), localHostname));
}
从此开始添加split的旅程
而在SourceOperator 上的处理SourceCoordinator回传的split消息,就是添加该split
public void handleOperatorEvent(OperatorEvent event) {
if (event instanceof AddSplitEvent) {
try {
sourceReader.addSplits(((AddSplitEvent<SplitT>) event).splits(splitSerializer));
} catch (IOException e) {
throw new FlinkRuntimeException("Failed to deserialize the splits.", e);
}
}
}
上述代码中的 sourceReader 正是KafkaSource 的createReader() 所产生。
KafkaSource 创建了KafkaPartitionSplitReader(后续用来创建consumer消费 kafka数据) 、KafkaRecordEmitter ( 发送 record、以及维持 split的 offset)等
代码是创建KafkaSourceReader的过程
public SourceReader<OUT, KafkaPartitionSplit> createReader(SourceReaderContext readerContext)
throws Exception {
FutureCompletingBlockingQueue<RecordsWithSplitIds<Tuple3<OUT, Long, Long>>> elementsQueue =
new FutureCompletingBlockingQueue<>();
deserializationSchema.open(
new DeserializationSchema.InitializationContext() {
@Override
public MetricGroup getMetricGroup() {
return readerContext.metricGroup().addGroup("deserializer");
}
@Override
public UserCodeClassLoader getUserCodeClassLoader() {
return readerContext.getUserCodeClassLoader();
}
});
Supplier<KafkaPartitionSplitReader<OUT>> splitReaderSupplier =
() ->
new KafkaPartitionSplitReader<>(
props, deserializationSchema, readerContext.getIndexOfSubtask());
KafkaRecordEmitter<OUT> recordEmitter = new KafkaRecordEmitter<>();
return new KafkaSourceReader<>(
elementsQueue,
splitReaderSupplier,
recordEmitter,
toConfiguration(props),
readerContext);
}
// 进一步的构造器可以看出,KafkaSourceFetcherManager开始为获取数据做准备,
// 而 offsetsToCommit 则是split 的offset的记录
public KafkaSourceReader(
FutureCompletingBlockingQueue<RecordsWithSplitIds<Tuple3<T, Long, Long>>> elementsQueue,
Supplier<KafkaPartitionSplitReader<T>> splitReaderSupplier,
RecordEmitter<Tuple3<T, Long, Long>, T, KafkaPartitionSplitState> recordEmitter,
Configuration config,
SourceReaderContext context) {
super(
elementsQueue,
new KafkaSourceFetcherManager<>(elementsQueue, splitReaderSupplier::get),
recordEmitter,
config,
context);
this.offsetsToCommit = Collections.synchronizedSortedMap(new TreeMap<>());
this.offsetsOfFinishedSplits = new ConcurrentHashMap<>();
}
回到刚刚的addsplit(…)
这里就回到了splitFetcherManager的添加split
public void addSplits(List<SplitT> splits) {
LOG.info("Adding split(s) to reader: {}", splits);
// Initialize the state for each split.
// 初始化当前split的state ,主要是针对 split 的offset 部分
splits.forEach(
s ->
splitStates.put(
s.splitId(), new SplitContext<>(s.splitId(), initializedState(s))));
// Hand over the splits to the split fetcher to start fetch.
// 这里就是要开始在每个fetcher poll 数据时的,做准备工作。
splitFetcherManager.addSplits(splits);
}
看下其中的initializedState(…)方法,该方法突出的功能,1.初始化记录当前split的offset,2.ck时,将split、offset 信息作为state保存,3.在KafkaRecordEmitter emit record时,动态更新 offset 信息,保证source 源头,消费的一致性。
KafkaPartitionSplitState
public KafkaPartitionSplitState(KafkaPartitionSplit partitionSplit) {
super(
partitionSplit.getTopicPartition(),
partitionSplit.getStartingOffset(),
partitionSplit.getStoppingOffset().orElse(NO_STOPPING_OFFSET));
this.currentOffset = partitionSplit.getStartingOffset();
}
splitFetcherManager中,要创建fetcher,并添加splits。
@Override
public void addSplits(List<SplitT> splitsToAdd) {
SplitFetcher<E, SplitT> fetcher = getRunningFetcher();
if (fetcher == null) {
fetcher = createSplitFetcher();
// Add the splits to the fetchers.
fetcher.addSplits(splitsToAdd);
startFetcher(fetcher);
} else {
fetcher.addSplits(splitsToAdd);
}
}
这里就插一些前提背景。先了解 SplitFetcherManager,如果去源码的英文注解就可以很明显的知道,它就是要维持一个线程池,同时保存好所有的splitFetcher,splitFetcher,既有run方法,也可以提交任务。这就是线程模型。
public SplitFetcherManager(
FutureCompletingBlockingQueue<RecordsWithSplitIds<E>> elementsQueue,
Supplier<SplitReader<E, SplitT>> splitReaderFactory) {
this.elementsQueue = elementsQueue;
this.errorHandler =
new Consumer<Throwable>() {
@Override
public void accept(Throwable t) {
LOG.error("Received uncaught exception.", t);
if (!uncaughtFetcherException.compareAndSet(null, t)) {
// Add the exception to the exception list.
uncaughtFetcherException.get().addSuppressed(t);
}
// Wake up the main thread to let it know the exception.
elementsQueue.notifyAvailable();
}
};
this.splitReaderFactory = splitReaderFactory;
this.uncaughtFetcherException = new AtomicReference<>(null);
this.fetcherIdGenerator = new AtomicInteger(0);
//这里是用来管理所有fetcher
this.fetchers = new ConcurrentHashMap<>();
// Create the executor with a thread factory that fails the source reader if one of
// the fetcher thread exits abnormally.
final String taskThreadName = Thread.currentThread().getName();
//看 这里是线程池
this.executors =
Executors.newCachedThreadPool(
r -> new Thread(r, "Source Data Fetcher for " + taskThreadName));
this.closed = false;
}
接着看下面方法就是提交任务咯
protected void startFetcher(SplitFetcher<E, SplitT> fetcher) {
executors.submit(fetcher);
}
然后就是SplitFetcher, ta 首先定义 task 运行的流行,其它包含了一个主体fetchTask。而之前的 addSplit 也转为AddSplitsTask,这两个task 都剑指KafkaPartitionSplitReader,要调用ta的方法。add 就是 将kafka 分区,以及分区offset 准备工作处理好,而fetch 就是为创建 KafkaConsumer 来poll 数据。
这两个就是最具体、干活的对象了。
public KafkaPartitionSplitReader(
Properties props,
KafkaRecordDeserializationSchema<T> deserializationSchema,
int subtaskId) {
this.subtaskId = subtaskId;
Properties consumerProps = new Properties();
consumerProps.putAll(props);
consumerProps.setProperty(ConsumerConfig.CLIENT_ID_CONFIG, createConsumerClientId(props));
// 创建KafkaConsumer
this.consumer = new KafkaConsumer<>(consumerProps);
this.stoppingOffsets = new HashMap<>();
this.deserializationSchema = deserializationSchema;
this.collector = new SimpleCollector<>();
this.groupId = consumerProps.getProperty(ConsumerConfig.GROUP_ID_CONFIG);
}
@Override
public RecordsWithSplitIds<Tuple3<T, Long, Long>> fetch() throws IOException {
KafkaPartitionSplitRecords<Tuple3<T, Long, Long>> recordsBySplits =
new KafkaPartitionSplitRecords<>();
ConsumerRecords<byte[], byte[]> consumerRecords;
try {
// 开始消费数据
consumerRecords = consumer.poll(Duration.ofMillis(POLL_TIMEOUT));
} catch (WakeupException we) {
recordsBySplits.prepareForRead();
return recordsBySplits;
}
...还有部分细节
}
这是AddSplitsTask 部分要调用的方法,为获取数据做准备工作。
public void handleSplitsChanges(SplitsChange<KafkaPartitionSplit> splitsChange) {
... 部分细节
// Parse the starting and stopping offsets.
splitsChange
.splits()
.forEach(
s -> {
newPartitionAssignments.add(s.getTopicPartition());
parseStartingOffsets(
s,
partitionsStartingFromEarliest,
partitionsStartingFromLatest,
partitionsStartingFromSpecifiedOffsets);
parseStoppingOffsets(
s, partitionsStoppingAtLatest, partitionsStoppingAtCommitted);
});
// Assign new partitions.
newPartitionAssignments.addAll(consumer.assignment());
//做好split分区后,做好offset准备工作,采用assign方式
consumer.assign(newPartitionAssignments);
//设置好对应的offset
// Seek on the newly assigned partitions to their stating offsets.
seekToStartingOffsets(
partitionsStartingFromEarliest,
partitionsStartingFromLatest,
partitionsStartingFromSpecifiedOffsets);
// Setup the stopping offsets.
acquireAndSetStoppingOffsets(partitionsStoppingAtLatest, partitionsStoppingAtCommitted);
// After acquiring the starting and stopping offsets, remove the empty splits if necessary.
removeEmptySplits();
maybeLogSplitChangesHandlingResult(splitsChange);
}
SplitFetcher部分 稍后补上
今天就到这啦。下次写watermark了。