背景
最近看了一个flink相关的issue,主要是在heartbeat来带上TaskManager的status作为payload,主要是为了解决TaskExecutor#updateTaskExecutionState会因为暂时的网络异常原因导致将TM的状态通知到JM失败的问题,如果是terminal state的通知失败会导致JM无法感知TM的结束。具体讨论细节请看issue
issue地址: https://issues.apache.org/jira/browse/FLINK-17075
flink heartbeat
借着这个机制顺便先研究下TM与JM之间的心跳机制。先看一下JM处理心跳的方法JobMasterGateway#heartbeatFromTaskManager
。这是一个RPC方法,看一下在TaskManager的那里会调用他,唯一的调用地方是TaskExecutor#establishJobManagerConnection
private void establishJobManagerConnection(JobTable.Job job, final JobMasterGateway jobMasterGateway, JMTMRegistrationSuccess registrationSuccess) {
...
// 在注册到jobManagerHeartbeatManager里的HeartbeatTarget的receiveHeartbeat方法里调用了向JM发送心跳的方法
// monitor the job manager as heartbeat target
jobManagerHeartbeatManager.monitorTarget(jobManagerResourceID, new HeartbeatTarget<TaskExecutorToJobManagerHeartbeatPayload>() {
@Override
public void receiveHeartbeat(ResourceID resourceID, TaskExecutorToJobManagerHeartbeatPayload payload) {
jobMasterGateway.heartbeatFromTaskManager(resourceID, payload);
}
@Override
public void requestHeartbeat(ResourceID resourceID, TaskExecutorToJobManagerHeartbeatPayload payload) {
// request heartbeat will never be called on the task manager side
}
});
internalOfferSlotsToJobManager(establishedConnection);
}
接下来我们看下被注册到HeartbeatManager里的HeartbeatTarget的receiveHeartbeat方法什么时候被调用。直接看一下TaskExecutor
中的jobManagerHeartbeatManager实际上是HeartbeatManager的哪个子类,根据初始化部分的代码我们可以看到是HeartbeatManagerImpl
的实例。我们看下其中调用HeartbeatTarget#receiveHeartbeat
方法的地方
@Override
public void requestHeartbeat(final ResourceID requestOrigin, I heartbeatPayload) {
if (!stopped) {
log.debug("Received heartbeat request from {}.", requestOrigin);
// 获取被注册的HeartbeatTarget