问题
在测试环境部署的flink,无法成功的存储checkpoints。或者使用flink命令执行savepoint也无法成功保存。hdsf中创建了对应的目录,却没有写任何文件。
通过flink控制台可以看到,job的checkpoint状态处于IN_PROGRESS状态。
执行flink savepoint也可以看到输出(log4j-cli.properties中开启DEBUG级别),不断获得状态是IN_PROGRESS直到超时。
|
通过flink控制台查看Job Manager的日志(log4j.properties中开启DEBUG级别,并且增加了%t关于线程名称的输出)可以看到"Triggering checkpoint 1 (type=SAVEPOINT) @ 1637282069853 for job 9e97360ca975514b4a91369b05431267."触发了checkpoint后,再无其他相关日志或异常
|
同样的Job在10.0.11.66部署的单机flink可以checkpoint,在开发本机上可以checkpoint,在10.0.11.21-24部署的yarn集群上不能checkpoint。所以怀疑yarn或其他环境问题导致。
解决过程
查看Flink源码,找到日志中最后一条有用输出“Triggering checkpoint”的源码位置。org.apache.flink.runtime.checkpoint.CheckpointCoordinator#createPendingCheckpoint
org.apache.flink.runtime.checkpoint.CheckpointCoordinator#createPendingCheckpoint
private PendingCheckpoint createPendingCheckpoint(
long timestamp,
CheckpointProperties props,
CheckpointPlan checkpointPlan,
boolean isPeriodic,
long checkpointID,
CheckpointStorageLocation checkpointStorageLocation,
CompletableFuture<CompletedCheckpoint> onCompletionPromise) {
synchronized (lock) {
try {
// since we haven't created the PendingCheckpoint yet, we need to check the
// global state here.
preCheckGlobalState(isPeriodic);
} catch (Throwable t) {
throw new CompletionException(t);
}
}
final PendingCheckpoint checkpoint =
new PendingCheckpoint(
job,
checkpointID,
timestamp,
checkpointPlan,
OperatorInfo.getIds(coordinatorsToCheckpoint),
masterHooks.keySet(),
props,
checkpointStorageLocation,
onCompletionPromise);
trackPendingCheckpointStats(checkpoint);
synchronized (lock) {
pendingCheckpoints.put(checkpointID, checkpoint);
ScheduledFuture<?> cancellerHandle =
timer.schedule(
new CheckpointCanceller(checkpoint),
checkpointTimeout,
TimeUnit.MILLISECONDS);
if (!checkpoint.setCancellerHandle(cancellerHandle)) {
// checkpoint is already disposed!
cancellerHandle.cancel(false);
}
}
LOG.info(
"Triggering checkpoint {} (type={}) @ {} for job {}.",
checkpointID,
checkpoint.getProps().getCheckpointType(),
timestamp,
job);
return checkpoint;
}
“Triggering checkpoint”的成功输出,说明这里没有问题,查询调用者代码org.apache.flink.runtime.checkpoint.CheckpointCoordinator#startTriggeringCheckpoint
org.apache.flink.runtime.checkpoint.CheckpointCoordinator#startTriggeringCheckpoint
private void startTriggeringCheckpoint(CheckpointTriggerRequest request) {
try {
synchronized (lock) {
preCheckGlobalState(request.isPeriodic);
}
// we will actually trigger this checkpoint!
Preconditions.checkState(!isTriggering);
isTriggering = true;
final long timestamp = System.currentTimeMillis();
CompletableFuture<CheckpointPlan> checkpointPlanFuture =
checkpointPlanCalculator.calculateCheckpointPlan();
final CompletableFuture<PendingCheckpoint> pendingCheckpointCompletableFuture =
checkpointPlanFuture
.thenApplyAsync(
plan -> {
try {
CheckpointIdAndStorageLocation
checkpointIdAndStorageLocation =
initializeCheckpoint(
request.props,
request.externalSavepointLocation);
return new Tuple2<>(
plan, checkpointIdAndStorageLocation);
} catch (Throwable e) {
throw new CompletionException(e);
}
},
executor)
.thenApplyAsync(
(checkpointInfo) ->
createPendingCheckpoint(
timestamp,
request.props,
checkpointInfo.f0,
request.isPeriodic,
checkpointInfo.f1.checkpointId,
checkpointInfo.f1.checkpointStorageLocation,
request.getOnCompletionFuture()),
timer);
final CompletableFuture<?> coordinatorCheckpointsComplete =
pendingCheckpointCompletableFuture.thenComposeAsync(
(pendingCheckpoint) ->
OperatorCoordinatorCheckpoints
.triggerAndAcknowledgeAllCoordinatorCheckpointsWithCompletion(
coordinatorsToCheckpoint,
pendingCheckpoint,
timer),
timer);
// We have to take the snapshot of the master hooks after the coordinator checkpoints
// has completed.
// This is to ensure the tasks are checkpointed after the OperatorCoordinators in case
// ExternallyInducedSource is used.
final CompletableFuture<?> masterStatesComplete =
coordinatorCheckpointsComplete.thenComposeAsync(
ignored -> {
// If the code reaches here, the pending checkpoint is guaranteed to
// be not null.
// We use FutureUtils.getWithoutException() to make compiler happy
// with checked
// exceptions in the signature.
PendingCheckpoint checkpoint =
FutureUtils.getWithoutException(
pendingCheckpointCompletableFuture);
return snapshotMasterState(checkpoint);
},
timer);
FutureUtils.assertNoException(
CompletableFuture.allOf(masterStatesComplete, coordinatorCheckpointsComplete)
.handleAsync(
(ignored, throwable) -> {
final PendingCheckpoint checkpoint =
FutureUtils.getWithoutException(
pendingCheckpointCompletableFuture);
Preconditions.checkState(
checkpoint != null || throwable != null,
"Either the pending checkpoint needs to be created or an error must have been occurred.");
if (throwable != null) {
// the initialization might not be finished yet
if (checkpoint == null) {
onTriggerFailure(request, throwable);
} else {
onTriggerFailure(checkpoint, throwable);
}
} else {
if (checkpoint.isDisposed()) {
onTriggerFailure(
checkpoint,
new CheckpointException(
CheckpointFailureReason
.TRIGGER_CHECKPOINT_FAILURE,
checkpoint.getFailureCause()));
} else {
// no exception, no discarding, everything is OK
final long checkpointId =
checkpoint.getCheckpointId();
snapshotTaskState(
timestamp,
checkpointId,
checkpoint.getCheckpointStorageLocation(),
request.props,
checkpoint
.getCheckpointPlan()
.getTasksToTrigger());
coordinatorsToCheckpoint.forEach(
(ctx) ->
ctx.afterSourceBarrierInjection(
checkpointId));
// It is possible that the tasks has finished
// checkpointing at this point.
// So we need to complete this pending checkpoint.
if (!maybeCompleteCheckpoint(checkpoint)) {
return null;
}
onTriggerSuccess();
}
}
return null;
},
timer)
.exceptionally(
error -> {
if (!isShutdown()) {
throw new CompletionException(error);
} else if (findThrowable(
error, RejectedExecutionException.class)
.isPresent()) {
LOG.debug("Execution rejected during shutdown");
} else {
LOG.warn("Error encountered during shutdown", error);
}
return null;
}));
} catch (Throwable throwable) {
onTriggerFailure(request, throwable);
}
}
这里是重点,然后疯狂的给源码加LOG输出,几乎每行后面,每一个子方法里面都疯狂的加LOG输出。然后编译flink部署测试环境10.0.11.24(只留下这一台实验),编译一次20分钟,折腾了N回。
pendingCheckpointCompletableFuture
.thenComposeAsync(
(pendingCheckpoint) -> {
LOG.warn(
"pendingCheckpointCompletableFuture.thenComposeAsync >>>>>>>> pendingCheckpoint:{},Thread.currentThread():{}",
pendingCheckpoint,
Thread.currentThread().getId());
return OperatorCoordinatorCheckpoints
.triggerAndAcknowledgeAllCoordinatorCheckpointsWithCompletion(
coordinatorsToCheckpoint,
pendingCheckpoint,
timer);
},
timer);
最终发现org.apache.flink.runtime.checkpoint.OperatorCoordinatorCheckpoints#triggerAndAcknowledgeAllCoordinatorCheckpointsWithCompletion中走了之后,其中返回的CompletableFuture没有被thenComposeAsync执行。
org.apache.flink.runtime.checkpoint.OperatorCoordinatorCheckpoints#triggerAndAcknowledgeAllCoordinatorCheckpointsWithCompletion
public static CompletableFuture<Void> triggerAndAcknowledgeAllCoordinatorCheckpoints(
final Collection<OperatorCoordinatorCheckpointContext> coordinators,
final PendingCheckpoint checkpoint,
final Executor acknowledgeExecutor)
throws Exception {
LOG.warn(
"triggerAndAcknowledgeAllCoordinatorCheckpoints >>>>> checkpoint:{}",
checkpoint);
try {
final CompletableFuture<AllCoordinatorSnapshots> snapshots =
triggerAllCoordinatorCheckpoints(coordinators, checkpoint.getCheckpointId());
LOG.warn(
"triggerAndAcknowledgeAllCoordinatorCheckpoints >>>>> snapshots:{},acknowledgeExecutor:{}",
snapshots,
acknowledgeExecutor);
// TODO: 2021/11/18 下面没有走 ,为了看有没有错误导致不执行,源码的thenAcceptAsync替换成handleAsync
return snapshots.handleAsync(
(allSnapshots, err) -> {
if (err != null) {
LOG.error(
"snapshots.thenAcceptAsync >>>>> err",
err);
} else {
try {
LOG.warn(
"snapshots.thenAcceptAsync >>>>> checkpoint:{}, allSnapshots.snapshots:{}",
checkpoint,
allSnapshots.snapshots);
acknowledgeAllCoordinators(checkpoint, allSnapshots.snapshots);
} catch (Exception e) {
LOG.error("snapshots.thenAcceptAsync >>>>> Exception", e);
throw new CompletionException(e);
}
}
return null;
}, acknowledgeExecutor);
} catch (Exception ex) {
LOG.error(
"triggerAndAcknowledgeAllCoordinatorCheckpoints >>>>> Exception",
ex);
throw ex;
}
}
public static CompletableFuture<Void>
triggerAndAcknowledgeAllCoordinatorCheckpointsWithCompletion(
final Collection<OperatorCoordinatorCheckpointContext> coordinators,
final PendingCheckpoint checkpoint,
final Executor acknowledgeExecutor)
throws CompletionException {
try {
LOG.warn(
"OperatorCoordinatorCheckpoints.triggerAndAcknowledgeAllCoordinatorCheckpointsWithCompletion >>>>> checkpoint:{}",
checkpoint);
return triggerAndAcknowledgeAllCoordinatorCheckpoints(
coordinators, checkpoint, acknowledgeExecutor);
} catch (Exception e) {
LOG.error(
"OperatorCoordinatorCheckpoints.triggerAndAcknowledgeAllCoordinatorCheckpointsWithCompletion >>>>> Exception",
e);
throw new CompletionException(e);
}
}
返回snapshots.thenAcceptAsync产生的CompletableFuture没有执行,不知道卡住在哪里了。
return snapshots.thenAcceptAsync(
(allSnapshots) -> {
try {
acknowledgeAllCoordinators(checkpoint, allSnapshots.snapshots);
} catch (Exception e) {
throw new CompletionException(e);
}
},
acknowledgeExecutor);
注意到日志输出时,输出的日志中线程名称都是“Checkpoint Timer”
|
在测试环境上用arthas查查这个“Checkpoint Timer”干啥了
|
果然这个线程在奇怪的傻等着“WAITING on java.util.concurrent.CompletableFuture”,为什么会这么等着呢。继续观察源码。
里面:
return snapshots.thenAcceptAsync(
(allSnapshots) -> {
try {
acknowledgeAllCoordinators(checkpoint, allSnapshots.snapshots);
} catch (Exception e) {
throw new CompletionException(e);
}
},
acknowledgeExecutor);
外面:
final CompletableFuture<?> coordinatorCheckpointsComplete =
pendingCheckpointCompletableFuture.thenComposeAsync(
(pendingCheckpoint) ->
OperatorCoordinatorCheckpoints
.triggerAndAcknowledgeAllCoordinatorCheckpointsWithCompletion(
coordinatorsToCheckpoint,
pendingCheckpoint,
timer),
timer);
观察thenComposeAsync内外都使用了同一个Executor参数执行,追查一下这个timer怎么来的,查到org.apache.flink.runtime.executiongraph.DefaultExecutionGraph#enableCheckpointing方法创建CheckpointCoordinator实例的时候。
org.apache.flink.runtime.executiongraph.DefaultExecutionGraph#enableCheckpointing
checkpointCoordinatorTimer =
Executors.newSingleThreadScheduledExecutor(
new DispatcherThreadFactory(
Thread.currentThread().getThreadGroup(), "Checkpoint Timer"));
// create the coordinator that triggers and commits checkpoints and holds the state
checkpointCoordinator =
new CheckpointCoordinator(
jobInformation.getJobId(),
chkConfig,
operatorCoordinators,
checkpointIDCounter,
checkpointStore,
checkpointStorage,
ioExecutor,
checkpointsCleaner,
new ScheduledExecutorServiceAdapter(checkpointCoordinatorTimer),
SharedStateRegistry.DEFAULT_FACTORY,
failureManager,
createCheckpointPlanCalculator(),
new ExecutionAttemptMappingProvider(getAllExecutionVertices()));
发现这个叫timer的Executor是Executors.newSingleThreadScheduledExecutor创建的单线程执行器。那问题会不会出在单线程上,这个执行器只要一个线程,外面的异步执行已经用了,而里面的异步也需要这个执行器提供一个线程,然后没线程用了。
我自己写了一套例子,然而本地执行证明这个猜想并不成立呀,没有卡住。不死心网上搜索相关问题,果然有人同样这么问,并且是有卡住的可能
How to use CompletableFuture.thenComposeAsync()?
我在将他写的例子在本地执行,但是仍然不会卡住。突然想起来问题是测试环境才有的
public class CompletableFutureTest {
public static void main(String[] args) {
ScheduledExecutorService singleThreadExecutor = Executors.newSingleThreadScheduledExecutor();
CompletableFuture.runAsync(() ->
{
System.out.println("Task 1. Thread: " + Thread.currentThread().getId());
}, singleThreadExecutor).thenComposeAsync((Void unused) ->
{
return CompletableFuture.runAsync(() ->
{
System.out.println("Task 2. Thread: " + Thread.currentThread().getId());
}, singleThreadExecutor);
}, singleThreadExecutor).join();
System.out.println("finished");
// 我写的例子更模仿flink的checkpoint代码,其实问题只在thenComposeAsync,都一样
// CompletableFuture<?> cf = CompletableFuture
// .supplyAsync(() -> "aaaaa", scheduledExecutorService)
// .thenComposeAsync((str) -> {
// return CompletableFuture.runAsync(() -> {
// try {
// Thread.sleep(2000);
// } catch (InterruptedException e) {
// e.printStackTrace();
// }
// System.out.println(str.concat(" bbbbbbbbbbbb"));
// }, scheduledExecutorService);
// }, scheduledExecutorService);
// CompletableFuture
// .allOf(cf.thenApplyAsync((str) -> {
// try {
// Thread.sleep(5000);
// } catch (InterruptedException e) {
// e.printStackTrace();
// }
// System.out.println("ddddddd");
// return "ccccc";
// }, scheduledExecutorService), cf)
// .handleAsync((str, err) -> {
// System.out.println(str);
// return null;
// }, scheduledExecutorService).join();
// scheduledExecutorService.shutdown();
}
}
然后拿到10.0.11.24出问题的测试环境执行,果然卡住了
然后拿到没有问题的10.0.11.66上执行
对比JDK版本 问题机器24:
没问题机器66:
也就是说某些低版本的jdk存在CompletableFuture会被Executors线程数不够卡死的问题。较高版本的JDK解决了这个问题。
在问题机器24上更换JDK版本
然而不行,查看Job Manager的日志JAVA_HOME仍然指向老的jdk
|
试了好几种方式yarn环境里的JAVA_HOME: /usr/lib/jvm/j2sdk1.8-oracle仍然指向老版本的JDK,为了着急看效果,直接将新版jdk文件覆盖/usr/lib/jvm/j2sdk1.8-oracle,然后重试。
savepoint成功!!!!
|