Flink1.13版本新背压监控指标源码详解

最新推荐文章于 2024-02-04 00:01:07 发布

zxfBdd

最新推荐文章于 2024-02-04 00:01:07 发布

阅读量465

点赞数

分类专栏：大数据文章标签： java 大数据前端

原文链接：https://blog.csdn.net/zhaochengxuyuan1/article/details/120089098

版权

大数据专栏收录该内容

587 篇文章 28 订阅

订阅专栏

订阅专栏
背压(Back Pressured，也称为反压)是flink众多监控指标中比较重要的一个，它可以很直观的反应下游task是否能及时处理完所接收到的数据，关于背压的详细情况可以参考官网: 监控反压。

注:1.13官网的背压介绍，还是基于1.12的背压计算方式。

在1.12之前，flink是通过输出堆栈采样来判断是否背压的，而在1.13中，更改为使用基于任务 Mailbox 计时，并且重新实现了作业图的 UI 展示，Flink 现在在 UI 上通过颜色和数值来展示繁忙和反压的程度，颜色越红则背压越严重。除此之外，在task的详情中BackPressure标签也增加了两个新指标(如下图)，Idle和Busy,分别表示该task的忙碌情况，下面针对这两个指标，带大家详细了解一下这两个指标的采集逻辑。

在这里插入图片描述
熟悉来看一下官网对这两个指标的介绍:

从官网的介绍可以看出，这两个指标是用来反映task的每秒钟busy/idle情况的，故此也可以猜测出，这两个指标相加应该等于1000，上图中的两个指标相加应该等于100%。接下来就让我们一起到源码中看看，该指标到底是如何采集的，以及他们是不是互补的关系。

在源码中找到该指标最快的方式，就是直接在org.apache.flink.runtime.metrics包中搜索idleTimeMsPerSecond，则可以直接找到包含该指标的类TaskIOMetricGroup，可以看出，该指标是属于IO方面的指标。具体定义如下(为了直观，剔除无关代码):

private final TimerGauge idleTimePerSecond;
private final Gauge busyTimePerSecond;
private final TimerGauge backPressuredTimePerSecond;

public TaskIOMetricGroup(TaskMetricGroup parent) {
super(parent);
this.idleTimePerSecond = gauge(MetricNames.TASK_IDLE_TIME, new TimerGauge());
this.backPressuredTimePerSecond =
gauge(MetricNames.TASK_BACK_PRESSURED_TIME, new TimerGauge());
this.busyTimePerSecond = gauge(MetricNames.TASK_BUSY_TIME, this::getBusyTimePerSecond);
}

可以看出idleTimePerSecond和backPressuredTimePerSecond是TimerGauge类型，并在构造方法中直接通过new的方式创建了一个TimerGauge对象；busyTimePerSecond是通过调用getBusyTimePerSecond方法获取的；

注: gauge方法不影响指标定义，只是将指标名和指标值绑定，故不在此展示。

getBusyTimePerSecond方法的实现如下:

private double getBusyTimePerSecond() {
double busyTime = idleTimePerSecond.getValue() + backPressuredTimePerSecond.getValue();
return busyTimeEnabled ? 1000.0 - Math.min(busyTime, 1000.0) : Double.NaN;
}

可以看出，busyTimePerSecond的取值取决于busyTimeEnabled，busyTimeEnabled是通过set方法由外部传入的，调用set方法的有两个类，SourceStreamTask和StreamTask，SourceStreamTask传入的值是false，StreamTask传入的是true。所以source task的busyTimeMsPerSecond是NaN；而当其是stream task时，则是由idleTimePerSecond和backPressuredTimePerSecond计算出来的，如果backPressuredTimePerSecond为0，则就会和idleTimePerSecond相加为1000，也算是验证了上面的猜想。

接下来可以专注于idleTimePerSecond，来看看这个值是如何算出来的。该值是通过get方法传送给外部对象，调用get方法的代码只有一处StreamTask，具体如下:

if (!recordWriter.isAvailable()) {
timer = ioMetrics.getBackPressuredTimePerSecond();
resumeFuture = recordWriter.getAvailableFuture();
} else {
timer = ioMetrics.getIdleTimeMsPerSecond();
resumeFuture = inputProcessor.getAvailableFuture();
}
assertNoException(
resumeFuture.thenRun(
new ResumeWrapper(controller.suspendDefaultAction(timer), timer)));

可以看出，StreamTask在获取到idleTimePerSecond后，有两处使用到该变量的，一处是作为suspendDefaultAction方法的参数，一个是作为ResumeWrapper构造函数的参数，先来看ResumeWrapper构造函数中用它做了什么。

private static class ResumeWrapper implements Runnable {
private final Suspension suspendedDefaultAction;
private final TimerGauge timer;

public ResumeWrapper(Suspension suspendedDefaultAction, TimerGauge timer) {
this.suspendedDefaultAction = suspendedDefaultAction;
timer.markStart();
this.timer = timer;
}

@Override
public void run() {
timer.markEnd();
suspendedDefaultAction.resume();
}
}

可以看出，在构造函数中调用了它的markStart方法，而且在该对象run时，调用了markEnd，也就是创建对象后开始计算空闲时间，在run开始时停止计算。

接下来看另一处使用，suspendDefaultAction的具体实现:

@Override
public MailboxDefaultAction.Suspension suspendDefaultAction(
TimerGauge suspensionIdleTimer) {
return mailboxProcessor.suspendDefaultAction(suspensionIdleTimer);
}

suspendDefaultAction方法内，又调用了mailboxProcessor.suspendDefaultAction方法，继续看mailboxProcessor.suspendDefaultAction的实现。

private void maybePauseIdleTimer() {
if (suspendedDefaultAction != null && suspendedDefaultAction.suspensionTimer != null) {
suspendedDefaultAction.suspensionTimer.markEnd();
}
}

private void maybeRestartIdleTimer() {
if (suspendedDefaultAction != null && suspendedDefaultAction.suspensionTimer != null) {
suspendedDefaultAction.suspensionTimer.markStart();
}
}

/**
* Calling this method signals that the mailbox-thread should (temporarily) stop invoking the
* default action, e.g. because there is currently no input available.
*/
private MailboxDefaultAction.Suspension suspendDefaultAction(
@Nullable TimerGauge suspensionTimer) {

checkState(
mailbox.isMailboxThread(),
"Suspending must only be called from the mailbox thread!");

checkState(suspendedDefaultAction == null, "Default action has already been suspended");
if (suspendedDefaultAction == null) {
suspendedDefaultAction = new DefaultActionSuspension(suspensionTimer);
ensureControlFlowSignalCheck();
}

return suspendedDefaultAction;
}

到此可以看到，mailboxProcessor.suspendDefaultAction方法中，使用传入的idleTimePerSecond创建了DefaultActionSuspension，并将其赋值给内部参数suspendedDefaultAction，而使用suspendedDefaultAction的即maybePauseIdleTimer和maybeRestartIdleTimer，maybePauseIdleTimer是暂停计时，maybeRestartIdleTimer为重启计时。而这两个方法的调用情况如下:

private boolean processMailsWhenDefaultActionUnavailable() throws Exception {
boolean processedSomething = false;
Optional<Mail> maybeMail;
while (isDefaultActionUnavailable() && isNextLoopPossible()) {
maybeMail = mailbox.tryTake(MIN_PRIORITY);
if (!maybeMail.isPresent()) {
maybeMail = Optional.of(mailbox.take(MIN_PRIORITY));
}
maybePauseIdleTimer();
maybeMail.get().run();
maybeRestartIdleTimer();
processedSomething = true;
}
return processedSomething;
}

private boolean processMailsNonBlocking(boolean singleStep) throws Exception {
long processedMails = 0;
Optional<Mail> maybeMail;

while (isNextLoopPossible() && (maybeMail = mailbox.tryTakeFromBatch()).isPresent()) {
if (processedMails++ == 0) {
maybePauseIdleTimer();
}
maybeMail.get().run();
if (singleStep) {
break;
}
}
if (processedMails > 0) {
maybeRestartIdleTimer();
return true;
} else {
return false;
}
}

可以看到，这两个方法都是，在开始处理任务的时候暂停计时，等任务处理完后开始计时，以此来统计task的空闲时间。

注：想要了解MailBox的具体设计和实现，可以参考该文章:Flink 基于 MailBox 实现的 StreamTask 线程模型

了解完idleTimePerSecond的使用情况，再来看看它本身的具体实现。idleTimePerSecond定义为TimerGauge类，下面就来看看TimerGauge类的具体内容:

public TimerGauge() {
this(SystemClock.getInstance());
}

public TimerGauge(Clock clock) {
this.clock = clock;
}

public synchronized void markStart() {
if (currentMeasurementStart == 0) {
currentMeasurementStart = clock.absoluteTimeMillis();
}
}

public synchronized void markEnd() {
if (currentMeasurementStart != 0) {
currentCount += clock.absoluteTimeMillis() - currentMeasurementStart;
currentMeasurementStart = 0;
}
}

其中在定义idleTimePerSecond，使用的是TimerGauge()构造方法，故此clock为SystemClock，而SystemClock的absoluteTimeMillis()方法为获取当前时间戳，故此可以看到，当开始计时时，获取当前时间戳保存到currentMeasurementStart，等停止计时时，在此获取时间戳并与currentMeasurementStart相减，将计算结果+=到currentCount上，再次将currentMeasurementStart致为0。这样通过不断的开始和结束，即可累计该task所执行的时间，并以此来判断该task的执行压力。