接着之前
这篇主要是补上篇挖的洞,看看拉取数据的线程模型。
这里面要考量的点,包括几点
- 1.数据拉取存放在哪里,如果有多个线程,采用哪种方式存储?作为数据获取源头是不是要采取生产者消费者模型,防止一味的生产数据,造成内存的膨胀。
- 2.线程中除了要拉取数据,当分区信息有变动时,还需要添加新分区,怎么安排。
- 3.线程的异常处理
问1
- 如果看了之前的系列,应该清楚,即使是多个线程,也代表不同线程中是不同分区的数据,所以线程和线程之间的存储顺序是不需要计较的,但是同一个线程中,存取就需要严格按照顺序来,而且作为源头 势必需要做好生产者与消费者模型。
代码主要体现在FutureCompletingBlockingQueue
ta 是一个会阻塞的队列,也就是说ta 有可能因为阻塞而致线程占用,所以,还包含了一个wakeup,从阻塞中wakeup的一个功能。
建议 直接看源码,源码上有详细的注解。 - 这里就介绍下,不同于一般的生存者消费者模型,这里采用CompletableFuture 来做,CompletableFuture 的get() 会阻塞线程,就是利用这个,没有数据了就new 一个CompletableFuture对象,一直等待,有数据,就将CompletableFuture 的result设置为null。
public T take() throws InterruptedException {
T next;
while ((next = poll()) == null) {
// use the future to wait for availability to avoid busy waiting
try {
getAvailabilityFuture().get();
} catch (ExecutionException | CompletionException e) {
// this should never happen, but we propagate just in case
throw new FlinkRuntimeException("exception in queue future completion", e);
}
}
return next;
}
private void moveToAvailable() {
final CompletableFuture<Void> current = currentFuture;
//AVAILABLE 是成员变量,已经初始化,且赋值为null。
// public static final CompletableFuture<Void> AVAILABLE = getAvailableFuture();
if (current != AVAILABLE) {
currentFuture = AVAILABLE;
current.complete(null);
}
}
private void moveToUnAvailable() {
if (currentFuture == AVAILABLE) {
currentFuture = new CompletableFuture<>();
}
}
问二
fetchTask,与addSplitTask 运行在同一个线程中,就需要判断,如果任务队列有其它任务(如:addSplitTask)就运行,不然就运行fetchTask,但是fetchTask 有可能阻塞,原因,在于FutureCompletingBlockingQueue 的容量满了之后,会阻塞线程,就需要wakeUp,打破这个阻塞,才能运行addSplitTask。
// 阻塞代码,这里是 ReentrantLock的 Condition
private void waitOnPut(int fetcherIndex) throws InterruptedException {
maybeCreateCondition(fetcherIndex);
Condition cond = putConditionAndFlags[fetcherIndex].condition();
notFull.add(cond);
cond.await();
}
打破阻塞就在 添加AddSplitTask 后,对splitFetcher 调用wakeUp()
public void addSplits(List<SplitT> splitsToAdd) {
//先添加任务
enqueueTask(new AddSplitsTask<>(splitReader, splitsToAdd, assignedSplits));
// 后wakeup
wakeUp(true);
}
这里的wakeUp 有点说头,这里的wakeUp,除了fetchTask 在添加数据,有可能阻塞线程,需要唤醒,还有一点,就是kafka Consumer 正准备或正在读数据时,这时wakeUp,就被打断了,再去跑addSplitTask 话,就分区就会被更新。
void wakeUp(boolean taskOnly) {
// Synchronize to make sure the wake up only works for the current invocation of runOnce().
synchronized (wakeUp) {
// Do not wake up repeatedly.
wakeUp.set(true);
// Now the wakeUp flag is set.
SplitFetcherTask currentTask = runningTask;
if (isRunningTask(currentTask)) {
// The running task may have missed our wakeUp flag and running, wake it up.
LOG.debug("Waking up running task {}", currentTask);
currentTask.wakeUp();
} else if (!taskOnly) {
// The task has not started running yet, and it will not run for this
// runOnce() invocation due to the wakeUp flag. But we might have to
// wake up the fetcher thread in case it is blocking on the task queue.
// Only wake up when the thread has started and there is no running task.
LOG.debug("Waking up fetcher thread.");
taskQueue.add(WAKEUP_TASK);
}
}
}
private v
fetchTask 中的 wakeUp
public void wakeUp() {
// Set the wakeup flag first.
wakeup = true;
if (lastRecords == null) {
// Two possible cases:
// 1. The splitReader is reading or is about to read the records.
// 2. The records has been enqueued and set to null.
// In case 1, we just wakeup the split reader. In case 2, the next run might be skipped.
// In any case, the records won't be enqueued in the ongoing run().
splitReader.wakeUp();
} else {
// The task might be blocking on enqueuing the records, just interrupt.
elementsQueue.wakeUpPuttingThread(fetcherIndex);
}
}
done~