点击这里查看 Flink 1.13 源码解析 目录汇总
点击查看相关章节 Flink 1.13 源码解析——TaskManager启动流程 之 初始化TaskExecutor
点击查看相关章节 Flink 1.13 源码解析——TaskManager启动流程概览
前言
在上一章中,我们概述了一下TaskExecutor的启动流程,有以下几个步骤:
- 启动对ResourceManager的监控服务
- 启动taskSlotTable服务
- 启动对JobMaster的监控服务
- 启动文件缓存服务
在本章中,我们首先来分析一下 启动对ResourceManager的监控服务流程
一、resourceManagerLeaderRetriever的启动
1.1、启动监听服务
首先我们先来大概聊一下这个服务的功能,从变量名称可以猜出,这个服务是为了获取ResourceManager的地址,同时添加针对于ResourceManager的监听,在获取到ResourceManager的地址之后,就会开始对当前的TaskExecutor进行注册了。如果注册失败,则报错并直接关闭JVM,如果注册成功,则开始维持和ResourceManager的心跳,并向ResourceManager做自身slot的资源汇报。
在大概介绍完这个服务的功能之后,我们来代码里看看这个服务的细节,首先我们来看看resourceManagerLeaderRetriever.start方法,选择DefaultLeaderRetrievalService实现:
@Override
public void start(LeaderRetrievalListener listener) throws Exception {
checkNotNull(listener, "Listener must not be null.");
Preconditions.checkState(
leaderListener == null,
"DefaultLeaderRetrievalService can " + "only be started once.");
synchronized (lock) {
// 初始化Leader监听器
leaderListener = listener;
// TODO 1.12 新特性,老版本没有在这里进行封装,直接调用 NodeCache.start()执行监听
// TODO 一切需要进行注册,从zk中获取一些信息的,都被封装成了一个LeaderRetrievalDriver
leaderRetrievalDriver =
leaderRetrievalDriverFactory.createLeaderRetrievalDriver(
this, new LeaderRetrievalFatalErrorHandler());
LOG.info("Starting DefaultLeaderRetrievalService with {}.", leaderRetrievalDriver);
running = true;
}
}
在这个方法里初始化了一个监听器,并对要从zk中获取的信息封装成一个leaderRetrievalDriver对象,我们进入createLeaderRetrievalDriver方法,选择zookekeper的实现:
@Override
public ZooKeeperLeaderRetrievalDriver createLeaderRetrievalDriver(
LeaderRetrievalEventHandler leaderEventHandler, FatalErrorHandler fatalErrorHandler)
throws Exception {
// TODO
return new ZooKeeperLeaderRetrievalDriver(
client, retrievalPath, leaderEventHandler, fatalErrorHandler);
}
再进入ZooKeeperLeaderRetrievalDriver的构造方法:
public ZooKeeperLeaderRetrievalDriver(
CuratorFramework client,
String retrievalPath,
LeaderRetrievalEventHandler leaderRetrievalEventHandler,
FatalErrorHandler fatalErrorHandler)
throws Exception {
/*
TODO
1. CuratorFramework 为zkAPI框架Curator,内部封装了一个zookeeper类
*/
this.client = checkNotNull(client, "CuratorFramework client");
// TODO Curator框架的NodeCache相当于zk中的Watcher(监听的是znode节点的数据变化)
this.cache = new NodeCache(client, retrievalPath);
this.retrievalPath = checkNotNull(retrievalPath);
this.leaderRetrievalEventHandler = checkNotNull(leaderRetrievalEventHandler);
this.fatalErrorHandler = checkNotNull(fatalErrorHandler);
client.getUnhandledErrorListenable().addListener(this);
// TODO 开启监听
// TODO cache为NodeCache,维护着节点数据的缓存,当发现缓存中的数据和zk上的数据不同是,会回调cache的nodeChanged方法
cache.getListenable().addListener(this);
cache.start();
client.getConnectionStateListenable().addListener(connectionStateListener);
running = true;
}
可以看到这里主要做了两件事:
- 初始化了一个NodeCache对象,这里使用的是Curator框架,这里的nodeCache就相当于zk中的watcher,用于监听znode节点数据变化。
- 开启监听,nodeCache维护着节点zk节点数据的缓存,当发现缓存中的数据和zk节点中的数据不一致时,会触发cache的nodeChanged方法。
我们来看当前类的nodeChanged方法:
@Override
public void nodeChanged() {
// TODO 从zk中获取leader的信息
retrieveLeaderInformationFromZooKeeper();
}
再进入retrieveLeaderInformationFromZooKeeper方法:
private void retrieveLeaderInformationFromZooKeeper() {
try {
LOG.debug("Leader node has changed.");
// TODO 获取znode 节点数据
final ChildData childData = cache.getCurrentData();
// TODO 如果有数据
if (childData != null) {
final byte[] data = childData.getData();
if (data != null && data.length > 0) {
ByteArrayInputStream bais = new ByteArrayInputStream(data);
ObjectInputStream ois = new ObjectInputStream(bais);
final String leaderAddress = ois.readUTF();
final UUID leaderSessionID = (UUID) ois.readObject();
// TODO 通知我们拿到了地址
leaderRetrievalEventHandler.notifyLeaderAddress(
LeaderInformation.known(leaderSessionID, leaderAddress));
return;
}
}
// TODO 如果没有数据,则通知empty
leaderRetrievalEventHandler.notifyLeaderAddress(LeaderInformation.empty());
} catch (Exception e) {
fatalErrorHandler.onFatalError(
new LeaderRetrievalException("Could not handle node changed event.", e));
ExceptionUtils.checkInterrupted(e);
}
}
可以看到在这里面做了三件事:
- 首先从nodeCache里去获取znode中的节点数据
- 如果有数据则调用notifyLeaderAddress方法告知我们拿到的节点数据
- 如果没有数据则调用notifyLeaderAddress方法,但会报告一个空消息empty
我们再来看notifyLeaderAddress方法
@Override
@GuardedBy("lock")
public void notifyLeaderAddress(LeaderInformation leaderInformation) {
final UUID newLeaderSessionID = leaderInformation.getLeaderSessionID();
final String newLeaderAddress = leaderInformation.getLeaderAddress();
synchronized (lock) {
if (running) {
if (!Objects.equals(newLeaderAddress, lastLeaderAddress)
|| !Objects.equals(newLeaderSessionID, lastLeaderSessionID)) {
if (LOG.isDebugEnabled()) {
if (newLeaderAddress == null && newLeaderSessionID == null) {
LOG.debug(
"Leader information was lost: The listener will be notified accordingly.");
} else {
LOG.debug(
"New leader information: Leader={}, session ID={}.",
newLeaderAddress,
newLeaderSessionID);
}
}
lastLeaderAddress = newLeaderAddress;
lastLeaderSessionID = newLeaderSessionID;
// Notify the listener only when the leader is truly changed.
// TODO 如果当前是获取ResourceManager的leader信息,则此处去找TaskExecutor中的ResourceManagerListener的实现
leaderListener.notifyLeaderAddress(newLeaderAddress, newLeaderSessionID);
}
} else {
if (LOG.isDebugEnabled()) {
LOG.debug(
"Ignoring notification since the {} has already been closed.",
leaderRetrievalDriver);
}
}
}
}
里面的核心代码为:
// Notify the listener only when the leader is truly changed.
// TODO 如果当前是获取ResourceManager的leader信息,则此处去找TaskExecutor中的ResourceManagerListener的实现
leaderListener.notifyLeaderAddress(newLeaderAddress, newLeaderSessionID);
通过这个代码来获取ResourceManager的leader信息,我们点进这个方法,由于当前是在TaskExecutor内,去获取ResourceManager的信息,所以我们选择TaskExecutor内部的ResourceManagerListener实现
@Override
public void notifyLeaderAddress(final String leaderAddress, final UUID leaderSessionID) {
// TODO 监听回调,获取Leader地址
runAsync(
() ->
// TODO
notifyOfNewResourceManagerLeader(
leaderAddress,
ResourceManagerId.fromUuidOrNull(leaderSessionID)));
}
可以看到,这里是一个异步监听回调,我们点进notifyOfNewResourceManagerLeader方法:
private void notifyOfNewResourceManagerLeader(
String newLeaderAddress, ResourceManagerId newResourceManagerId) {
// TODO 将从zk中拿到的数据封装为真正的ResourceManager地址
resourceManagerAddress =
createResourceManagerAddress(newLeaderAddress, newResourceManagerId);
// TODO 连接这个地址
// TODO 此处命名为reconnect的原因是,只要ResourceManager的地址发生改变,这里就会调用一次
// TODO 先关闭和旧ResourceManager的连接,在启动和新ResourceManager的连接
reconnectToResourceManager(
new FlinkException(
String.format(
"ResourceManager leader changed to new address %s",
resourceManagerAddress)));
}
可以看到,这里会将从zk中拿到的ResourceManager信息封装为ResourceManager地址对象,并开始去连接这个地址。我们可以看到方法的名字中带有reconnect,原因是这个方法只要ResourceManager的地址发生改变,就会触发这个方法进行重新连接。我们点进这个reconnectToResourceManager方法:
private void reconnectToResourceManager(Exception cause) {
// TODO 关闭和旧ResourceManager的连接
closeResourceManagerConnection(cause);
// TODO 开启注册超时定时任务
startRegistrationTimeout();
// TODO 连接新的ResourceManager
tryConnectToResourceManager();
}
可以看到做了三个工作:
- 先关闭和旧的ResourceManager的连接
- 开启我们之前讲到的延时注册超时任务
- 再去连接新的ResourceManager。
到这里,监听服务启动完毕,并通过zk已经拿到了ResourceManager的Leader节点地址,接下来将进行节点连接,以及注册工作
1.2、TaskExecutor对ResourceManager注册
我们进入tryConnectToResourceManager方法,再进入connectToResourceManager方法:
private void connectToResourceManager() {
assert (resourceManagerAddress != null);
assert (establishedResourceManagerConnection == null);
assert (resourceManagerConnection == null);
log.info("Connecting to ResourceManager {}.", resourceManagerAddress);
// TODO 封装从节点的一些信息,准备将封装好的信息发送给主节点去注册,但这个对象并不是注册时发送的对象
final TaskExecutorRegistration taskExecutorRegistration =
new TaskExecutorRegistration(
getAddress(),
getResourceID(),
unresolvedTaskManagerLocation.getDataPort(),
JMXService.getPort().orElse(-1),
hardwareDescription,
memoryConfiguration,
taskManagerConfiguration.getDefaultSlotResourceProfile(),
taskManagerConfiguration.getTotalResourceProfile());
// TODO 连接ResourceManager
resourceManagerConnection =
new TaskExecutorToResourceManagerConnection(
log,
getRpcService(),
taskManagerConfiguration.getRetryingRegistrationConfiguration(),
resourceManagerAddress.getAddress(),
resourceManagerAddress.getResourceManagerId(),
getMainThreadExecutor(),
new ResourceManagerRegistrationListener(),
taskExecutorRegistration);
// TODO 开始注册
resourceManagerConnection.start();
}
可以看到在这里做了三个工作:
- 封装从节点的一些信息,准备向主节点进行注册
- 开始连接ResourceManager
- 连接完成后开始注册
我们进入 resourceManagerConnection.start()方法
public void start() {
checkState(!closed, "The RPC connection is already closed");
checkState(
!isConnected() && pendingRegistration == null,
"The RPC connection is already started");
// TODO 创建注册对象
final RetryingRegistration<F, G, S, R> newRegistration = createNewRegistration();
if (REGISTRATION_UPDATER.compareAndSet(this, null, newRegistration)) {
// TODO 开始注册,注册完成后的回调代码在createNewRegistration()方法内
newRegistration.startRegistration();
} else {
// concurrent start operation
newRegistration.cancel();
}
}
在这个方法里,开始正式创建注册所用的对象,并使用该对象进行注册,我们首先来看注册对象的构建:
1.2.1、注册对象的初始化
private RetryingRegistration<F, G, S, R> createNewRegistration() {
// TODO 构建注册对象
RetryingRegistration<F, G, S, R> newRegistration = checkNotNull(generateRegistration());
CompletableFuture<RetryingRegistration.RetryingRegistrationResult<G, S, R>> future =
newRegistration.getFuture();
// TODO 完成注册后回调,不论成功或是失败
future.whenCompleteAsync(
(RetryingRegistration.RetryingRegistrationResult<G, S, R> result,
Throwable failure) -> {
// TODO 如果注册失败
if (failure != null) {
// TODO 如果失败原因的因为取消注册
if (failure instanceof CancellationException) {
// TODO 则不报错,只打印debug日志
// we ignore cancellation exceptions because they originate from
// cancelling
// the RetryingRegistration
log.debug(
"Retrying registration towards {} was cancelled.",
targetAddress);
} else {
// TODO 如果是其他原因失败,回调这个方法
// this future should only ever fail if there is a bug, not if the
// registration is declined
onRegistrationFailure(failure);
}
} else {
// TODO 注册成功
if (result.isSuccess()) {
targetGateway = result.getGateway();
// TODO 回调这个方法
onRegistrationSuccess(result.getSuccess());
} else if (result.isRejection()) {
onRegistrationRejection(result.getRejection());
} else {
throw new IllegalArgumentException(
String.format(
"Unknown retrying registration response: %s.", result));
}
}
},
executor);
return newRegistration;
}
可以看到在这里做了:
- 构建注册对象
- 一个注册完成后的回调方法
- 在回调方法中,如果注册失败,且是因为取消注册,则不报错
- 在回调方法中,如果注册失败,且因为期待原因失败,则触发onRegistrationFailure方法
- 在回调方法中,如果注册成功,则回调onRegistrationSuccess方法
- 在回调方法中,如果注册被拒绝,则回调onRegistrationRejection方法
我们首先来看注册对象的构建,点进generateRegistration方法:
@Override
protected RetryingRegistration<
ResourceManagerId,
ResourceManagerGateway,
TaskExecutorRegistrationSuccess,
TaskExecutorRegistrationRejection>
generateRegistration() {
// TODO 生成真正的注册对象
return new TaskExecutorToResourceManagerConnection.ResourceManagerRegistration(
log,
rpcService,
getTargetAddress(),
getTargetLeaderId(),
retryingRegistrationConfiguration,
taskExecutorRegistration);
}
接下来的注册完成回调函数暂时先不看,我们在流程走到这里时再看,我们获取看在完成注册对象的初始化后,使用该对象完成向ResourceManager注册的流程。
1.2.2 开始向ResourceManager注册
我们首先点进newRegistration.startRegistration()方法里:
public void startRegistration() {
if (canceled) {
// we already got canceled
return;
}
try {
// trigger resolution of the target address to a callable gateway
final CompletableFuture<G> rpcGatewayFuture;
// TODO 这里的RPCGateway相当于主节点的一个引用(ActorRef),后续的注册使用的是这个引用
if (FencedRpcGateway.class.isAssignableFrom(targetType)) {
rpcGatewayFuture =
(CompletableFuture<G>)
rpcService.connect(
targetAddress,
fencingToken,
targetType.asSubclass(FencedRpcGateway.class));
} else {
// TODO TaskExecutor连接ResourceManager
rpcGatewayFuture = rpcService.connect(targetAddress, targetType);
}
// upon success, start the registration attempts
// TODO 如果连接建立成功,获取到了RPCGateWay
CompletableFuture<Void> rpcGatewayAcceptFuture =
// TODO 异步注册
rpcGatewayFuture.thenAcceptAsync(
(G rpcGateway) -> {
// TODO 使用这个引用对象进行注册
log.info("Resolved {} address, beginning registration", targetName);
register(
rpcGateway,
1,
retryingRegistrationConfiguration
.getInitialRegistrationTimeoutMillis());
},
rpcService.getExecutor());
// upon failure, retry, unless this is cancelled
// TODO 异步注册的回调
rpcGatewayAcceptFuture.whenCompleteAsync(
(Void v, Throwable failure) -> {
// TODO 如果失败,且并非手动取消
if (failure != null && !canceled) {
final Throwable strippedFailure =
ExceptionUtils.stripCompletionException(failure);
if (log.isDebugEnabled()) {
log.debug(
"Could not resolve {} address {}, retrying in {} ms.",
targetName,
targetAddress,
retryingRegistrationConfiguration.getErrorDelayMillis(),
strippedFailure);
} else {
log.info(
"Could not resolve {} address {}, retrying in {} ms: {}",
targetName,
targetAddress,
retryingRegistrationConfiguration.getErrorDelayMillis(),
strippedFailure.getMessage());
}
// TODO 如果注册失败,尝试再次注册,延时调度,时长通过cluster.registration.error-delay参数进行配置,默认10s
startRegistrationLater(
retryingRegistrationConfiguration.getErrorDelayMillis());
}
},
rpcService.getExecutor());
} catch (Throwable t) {
completionFuture.completeExceptionally(t);
cancel();
}
}
代码很长,并且接下来会有三次回调,逻辑很恶心,我们分步骤来看。
1.2.2.1、连接ResourceManager,获取ResourceManager的引用,并进行注册
// TODO 这里的RPCGateway相当于主节点的一个引用(ActorRef),后续的注册使用的是这个引用
if (FencedRpcGateway.class.isAssignableFrom(targetType)) {
rpcGatewayFuture =
(CompletableFuture<G>)
rpcService.connect(
targetAddress,
fencingToken,
targetType.asSubclass(FencedRpcGateway.class));
} else {
// TODO TaskExecutor连接ResourceManager
rpcGatewayFuture = rpcService.connect(targetAddress, targetType);
}
// upon success, start the registration attempts
// TODO 如果连接建立成功,获取到了RPCGateWay
CompletableFuture<Void> rpcGatewayAcceptFuture =
// TODO 异步注册
rpcGatewayFuture.thenAcceptAsync(
(G rpcGateway) -> {
// TODO 使用这个引用对象进行注册
log.info("Resolved {} address, beginning registration", targetName);
register(
rpcGateway,
1,
retryingRegistrationConfiguration
.getInitialRegistrationTimeoutMillis());
},
rpcService.getExecutor());
在方法里,首先会去和ResourceManager建立连接,尝试去获取ResourceManager的代理,这里的RpcGateWay相当于Akka模型中的ActorRef,如果连接建立成功,并获取到代理,则开始用这个ResourceManager的引用对象进行当前TaskExecutor的注册,我们来看这个注册方法,register:
private void register(final G gateway, final int attempt, final long timeoutMillis) {
// eager check for canceling to avoid some unnecessary work
if (canceled) {
return;
}
try {
log.debug(
"Registration at {} attempt {} (timeout={}ms)",
targetName,
attempt,
timeoutMillis);
// TODO 真正开始注册
// TODO 如果注册成功,RegistrationResponse = TaskExecutorRegistrationSuccess
CompletableFuture<RegistrationResponse> registrationFuture =
invokeRegistration(gateway, fencingToken, timeoutMillis);
// if the registration was successful, let the TaskExecutor know
CompletableFuture<Void> registrationAcceptFuture =
registrationFuture.thenAcceptAsync(
(RegistrationResponse result) -> {
if (!isCanceled()) {
// TODO 注册成功
if (result instanceof RegistrationResponse.Success) {
log.debug(
"Registration with {} at {} was successful.",
targetName,
targetAddress);
S success = (S) result;
// TODO 此方法执行完将触发回调:onRegistrationSuccess方法
completionFuture.complete(
RetryingRegistrationResult.success(
gateway, success));
// TODO 拒绝注册
} else if (result instanceof RegistrationResponse.Rejection) {
log.debug(
"Registration with {} at {} was rejected.",
targetName,
targetAddress);
R rejection = (R) result;
// TODO 此方法执行完将触发回调:onRegistrationFailure方法
completionFuture.complete(
RetryingRegistrationResult.rejection(rejection));
} else {
// registration failure
// TODO 其他原因注册失败
if (result instanceof RegistrationResponse.Failure) {
RegistrationResponse.Failure failure =
(RegistrationResponse.Failure) result;
log.info(
"Registration failure at {} occurred.",
targetName,
failure.getReason());
} else {
log.error(
"Received unknown response to registration attempt: {}",
result);
}
log.info(
"Pausing and re-attempting registration in {} ms",
retryingRegistrationConfiguration
.getRefusedDelayMillis());
// TODO 重试
registerLater(
gateway,
1,
retryingRegistrationConfiguration
.getInitialRegistrationTimeoutMillis(),
retryingRegistrationConfiguration
.getRefusedDelayMillis());
}
}
},
rpcService.getExecutor());
// upon failure, retry
// TODO 如果注册失败
registrationAcceptFuture.whenCompleteAsync(
(Void v, Throwable failure) -> {
if (failure != null && !isCanceled()) {
// TODO 如果因为超时
if (ExceptionUtils.stripCompletionException(failure)
instanceof TimeoutException) {
// we simply have not received a response in time. maybe the timeout
// was
// very low (initial fast registration attempts), maybe the target
// endpoint is
// currently down.
if (log.isDebugEnabled()) {
log.debug(
"Registration at {} ({}) attempt {} timed out after {} ms",
targetName,
targetAddress,
attempt,
timeoutMillis);
}
// TODO 每超时一次,超时时间*2,
// 但是不能超过retryingRegistrationConfiguration.getMaxRegistrationTimeoutMillis()
long newTimeoutMillis =
Math.min(
2 * timeoutMillis,
retryingRegistrationConfiguration
.getMaxRegistrationTimeoutMillis());
// TODO 重试次数+1
register(gateway, attempt + 1, newTimeoutMillis);
} else {
// a serious failure occurred. we still should not give up, but keep
// trying
log.error(
"Registration at {} failed due to an error",
targetName,
failure);
log.info(
"Pausing and re-attempting registration in {} ms",
retryingRegistrationConfiguration.getErrorDelayMillis());
registerLater(
gateway,
1,
retryingRegistrationConfiguration
.getInitialRegistrationTimeoutMillis(),
retryingRegistrationConfiguration.getErrorDelayMillis());
}
}
},
rpcService.getExecutor());
} catch (Throwable t) {
completionFuture.completeExceptionally(t);
cancel();
}
}
这个方法也非常非常的长,我们还是分步骤来看
1.2.2.1.1 正式开始注册
在方法的开始部分,正式进行TaskExecutor向ResourceManager的注册:
// TODO 真正开始注册
// TODO 如果注册成功,RegistrationResponse = TaskExecutorRegistrationSuccess
CompletableFuture<RegistrationResponse> registrationFuture =
invokeRegistration(gateway, fencingToken, timeoutMillis);
我们点进这个注册方法invokeRegistration,选择TaskExecutorToResourceManagerConnection内部的实现:
@Override
protected CompletableFuture<RegistrationResponse> invokeRegistration(
ResourceManagerGateway resourceManager,
ResourceManagerId fencingToken,
long timeoutMillis)
throws Exception {
Time timeout = Time.milliseconds(timeoutMillis);
// TODO 发送RPC请求:注册
return resourceManager.registerTaskExecutor(taskExecutorRegistration, timeout);
}
可以看到,此时我们已经在使用ResourceManager的代理对象了,我们调用的也是ResourceManager代理对象的注册方法,接下来再进入registerTaskExecutor
@Override
public CompletableFuture<RegistrationResponse> registerTaskExecutor(
final TaskExecutorRegistration taskExecutorRegistration, final Time timeout) {
// TODO 获取TaskExecutor的代理,准备回复注册响应
CompletableFuture<TaskExecutorGateway> taskExecutorGatewayFuture =
getRpcService()
.connect(
taskExecutorRegistration.getTaskExecutorAddress(),
TaskExecutorGateway.class);
taskExecutorGatewayFutures.put(
taskExecutorRegistration.getResourceId(), taskExecutorGatewayFuture);
return taskExecutorGatewayFuture.handleAsync(
(TaskExecutorGateway taskExecutorGateway, Throwable throwable) -> {
final ResourceID resourceId = taskExecutorRegistration.getResourceId();
if (taskExecutorGatewayFuture == taskExecutorGatewayFutures.get(resourceId)) {
taskExecutorGatewayFutures.remove(resourceId);
if (throwable != null) {
return new RegistrationResponse.Failure(throwable);
} else {
// TODO 内部注册具体实现
return registerTaskExecutorInternal(
taskExecutorGateway, taskExecutorRegistration);
}
} else {
log.debug(
"Ignoring outdated TaskExecutorGateway connection for {}.",
resourceId.getStringWithMetadata());
return new RegistrationResponse.Failure(
new FlinkException("Decline outdated task executor registration."));
}
},
getMainThreadExecutor());
}
此时可以看到,我们已经在ResourceManager类内部了,在方法里ResourceManager首先去获取TaskExecutor的代理对象,准备回复注册响应,我们来看注册方法的具体实现,我们进入registerTaskExecutorInternal方法:
private RegistrationResponse registerTaskExecutorInternal(
TaskExecutorGateway taskExecutorGateway,
TaskExecutorRegistration taskExecutorRegistration) {
// TODO TaskExecutor的ResourceId
ResourceID taskExecutorResourceId = taskExecutorRegistration.getResourceId();
// TODO 获取TaskExecutor的注册对象,如果存在,则证明注册过,需要更新
WorkerRegistration<WorkerType> oldRegistration =
taskExecutors.remove(taskExecutorResourceId);
// TODO 如果有旧注册信息
if (oldRegistration != null) {
// TODO :: suggest old taskExecutor to stop itself
log.debug(
"Replacing old registration of TaskExecutor {}.",
taskExecutorResourceId.getStringWithMetadata());
// TODO 则先取消旧的TaskManager的注册,在进行新TaskManager的注册
// remove old task manager registration from slot manager
slotManager.unregisterTaskManager(
oldRegistration.getInstanceID(),
new ResourceManagerException(
String.format(
"TaskExecutor %s re-connected to the ResourceManager.",
taskExecutorResourceId.getStringWithMetadata())));
}
final WorkerType newWorker = workerStarted(taskExecutorResourceId);
String taskExecutorAddress = taskExecutorRegistration.getTaskExecutorAddress();
if (newWorker == null) {
log.warn(
"Discard registration from TaskExecutor {} at ({}) because the framework did "
+ "not recognize it",
taskExecutorResourceId.getStringWithMetadata(),
taskExecutorAddress);
return new TaskExecutorRegistrationRejection(
"The ResourceManager does not recognize this TaskExecutor.");
} else {
// 生成注册对象
WorkerRegistration<WorkerType> registration =
new WorkerRegistration<>(
taskExecutorGateway,
newWorker,
taskExecutorRegistration.getDataPort(),
taskExecutorRegistration.getJmxPort(),
taskExecutorRegistration.getHardwareDescription(),
taskExecutorRegistration.getMemoryConfiguration(),
taskExecutorRegistration.getTotalResourceProfile(),
taskExecutorRegistration.getDefaultSlotResourceProfile());
log.info(
"Registering TaskManager with ResourceID {} ({}) at ResourceManager",
taskExecutorResourceId.getStringWithMetadata(),
taskExecutorAddress);
// TODO 完成注册,这个taskExecutors是一个map,维护着ResourceID和注册对象的关系
taskExecutors.put(taskExecutorResourceId, registration);
// TODO 到此为止,注册逻辑完成
// TODO 从节点心跳管理器,保存了注册进来的TaskExecutor的ResourceID和包装的该TaskExecutor的心跳对象
taskManagerHeartbeatManager.monitorTarget(
taskExecutorResourceId,
new HeartbeatTarget<Void>() {
@Override
public void receiveHeartbeat(ResourceID resourceID, Void payload) {
// the ResourceManager will always send heartbeat requests to the
// TaskManager
}
@Override
public void requestHeartbeat(ResourceID resourceID, Void payload) {
// TODO ResourceManager发送心跳Rpc请求给TaskExecutor
taskExecutorGateway.heartbeatFromResourceManager(resourceID);
}
});
// TODO 返回注册成功消息给TaskExecutor的引用
return new TaskExecutorRegistrationSuccess(
registration.getInstanceID(), resourceId, clusterInformation);
}
}
在这个方法里,做了以下几件事:
- 首先去TaskExecutors这个Map中去根据当前注册对象的ResourceID去尝试获取以下注册对象,这个TaskExecutors Map 里面存储的是每个TaskExecutor节点的ResourceId和该节点的注册对象的映射。
- 如果通过ResourceId拿到了注册节点对象,说明该节点已经注册过,此刻需要进行信息更新
- 如果有旧的注册信息,则先取消就的TaskManager的注册,再进行新TaskManager的注册
- 如果能识别该从节点,则开始为此从节点的TaskExecutor开始生成注册对象并开始注册
- 注册完成后将该节点的ResourceId和注册对象放入TaskExecutors map
- 通过从节点心跳管理器构建并保存注册进来的从节心跳对象
- 最后返回注册消息给TaskExecutor的代理对象
此刻完成注册后,会触发第一个回调函数,register对象中设置的回调方法,我们回来看这个回调方法
1.2.2.1.2、开始执行注册完成后的第一次回调
当注册流程完成,将进入此方法
CompletableFuture<Void> registrationAcceptFuture =
registrationFuture.thenAcceptAsync(
(RegistrationResponse result) -> {
if (!isCanceled()) {
// TODO 注册成功
if (result instanceof RegistrationResponse.Success) {
log.debug(
"Registration with {} at {} was successful.",
targetName,
targetAddress);
S success = (S) result;
// TODO 此方法执行完将触发回调:onRegistrationSuccess方法
completionFuture.complete(
RetryingRegistrationResult.success(
gateway, success));
// TODO 拒绝注册
} else if (result instanceof RegistrationResponse.Rejection) {
log.debug(
"Registration with {} at {} was rejected.",
targetName,
targetAddress);
R rejection = (R) result;
// TODO 此方法执行完将触发回调:onRegistrationFailure方法
completionFuture.complete(
RetryingRegistrationResult.rejection(rejection));
} else {
// registration failure
// TODO 其他原因注册失败
if (result instanceof RegistrationResponse.Failure) {
RegistrationResponse.Failure failure =
(RegistrationResponse.Failure) result;
log.info(
"Registration failure at {} occurred.",
targetName,
failure.getReason());
} else {
log.error(
"Received unknown response to registration attempt: {}",
result);
}
log.info(
"Pausing and re-attempting registration in {} ms",
retryingRegistrationConfiguration
.getRefusedDelayMillis());
// TODO 重试
registerLater(
gateway,
1,
retryingRegistrationConfiguration
.getInitialRegistrationTimeoutMillis(),
retryingRegistrationConfiguration
.getRefusedDelayMillis());
}
}
},
rpcService.getExecutor());
首先会判断注册的结果,如果注册成功将会执行这段代码:
// TODO 此方法执行完将触发回调:onRegistrationSuccess方法
completionFuture.complete(
RetryingRegistrationResult.success(
gateway, success));
接下来会回调onRegistrationSuccess方法,我们稍等再看这个方法。
如果注册被拒绝,将会执行这段代码:
// TODO 此方法执行完将触发回调:onRegistrationFailure方法
completionFuture.complete(
RetryingRegistrationResult.rejection(rejection));
接下来会回调onRegistrationFailure,我们还是先接着往下看,如果因为其他原因注册失败,将进行重试流程:
// TODO 重试
registerLater(
gateway,
1,
retryingRegistrationConfiguration
.getInitialRegistrationTimeoutMillis(),
retryingRegistrationConfiguration
.getRefusedDelayMillis());
我们进入registerLater方法:
private void registerLater(
final G gateway, final int attempt, final long timeoutMillis, long delay) {
rpcService.scheduleRunnable(
new Runnable() {
@Override
public void run() {
register(gateway, attempt, timeoutMillis);
}
},
delay,
TimeUnit.MILLISECONDS);
}
可以看到这里又调用了一次register方法,进行重试。
如果在之前的注册流程中出现了异常报错了,则会进入这里:
registrationAcceptFuture.whenCompleteAsync(
(Void v, Throwable failure) -> {
if (failure != null && !isCanceled()) {
// TODO 如果因为超时
if (ExceptionUtils.stripCompletionException(failure)
instanceof TimeoutException) {
// we simply have not received a response in time. maybe the timeout
// was
// very low (initial fast registration attempts), maybe the target
// endpoint is
// currently down.
if (log.isDebugEnabled()) {
log.debug(
"Registration at {} ({}) attempt {} timed out after {} ms",
targetName,
targetAddress,
attempt,
timeoutMillis);
}
// TODO 每超时一次,超时时间*2,
// 但是不能超过retryingRegistrationConfiguration.getMaxRegistrationTimeoutMillis()
long newTimeoutMillis =
Math.min(
2 * timeoutMillis,
retryingRegistrationConfiguration
.getMaxRegistrationTimeoutMillis());
// TODO 重试次数+1
register(gateway, attempt + 1, newTimeoutMillis);
} else {
// a serious failure occurred. we still should not give up, but keep
// trying
log.error(
"Registration at {} failed due to an error",
targetName,
failure);
log.info(
"Pausing and re-attempting registration in {} ms",
retryingRegistrationConfiguration.getErrorDelayMillis());
registerLater(
gateway,
1,
retryingRegistrationConfiguration
.getInitialRegistrationTimeoutMillis(),
retryingRegistrationConfiguration.getErrorDelayMillis());
}
}
},
rpcService.getExecutor());
首先会判断异常原因,如果因为超时则:
// TODO 如果因为超时
if (ExceptionUtils.stripCompletionException(failure)
instanceof TimeoutException) {
// we simply have not received a response in time. maybe the timeout
// was
// very low (initial fast registration attempts), maybe the target
// endpoint is
// currently down.
if (log.isDebugEnabled()) {
log.debug(
"Registration at {} ({}) attempt {} timed out after {} ms",
targetName,
targetAddress,
attempt,
timeoutMillis);
}
// TODO 每超时一次,超时时间*2,
// 但是不能超过retryingRegistrationConfiguration.getMaxRegistrationTimeoutMillis()
long newTimeoutMillis =
Math.min(
2 * timeoutMillis,
retryingRegistrationConfiguration
.getMaxRegistrationTimeoutMillis());
// TODO 重试次数+1
register(gateway, attempt + 1, newTimeoutMillis);
在此处进行重试,每超时一次,每重试一次,超时限制时间*2,但是最大超时限制时间不能超过retryingRegistrationConfiguration.getMaxRegistrationTimeoutMillis()参数的配置。
如果是因为其他原因报错,则会稍等一段时间后再次重试:
registerLater(
gateway,
1,
retryingRegistrationConfiguration
.getInitialRegistrationTimeoutMillis(),
retryingRegistrationConfiguration.getErrorDelayMillis());
接下来我们来看刚才的那几个回调方法
1.2.2.1.3、注册成功或失败的回调方法
回调方法触发的地方在我们之前构建TaskExecutor注册对象的方法中createNewRegistration(),我们回到这个方法中:
// TODO 完成注册后回调,不论成功或是失败
future.whenCompleteAsync(
(RetryingRegistration.RetryingRegistrationResult<G, S, R> result,
Throwable failure) -> {
// TODO 如果注册失败
if (failure != null) {
// TODO 如果失败原因的因为取消注册
if (failure instanceof CancellationException) {
// TODO 则不报错,只打印debug日志
// we ignore cancellation exceptions because they originate from
// cancelling
// the RetryingRegistration
log.debug(
"Retrying registration towards {} was cancelled.",
targetAddress);
} else {
// TODO 如果是其他原因失败,回调这个方法
// this future should only ever fail if there is a bug, not if the
// registration is declined
onRegistrationFailure(failure);
}
} else {
// TODO 注册成功
if (result.isSuccess()) {
targetGateway = result.getGateway();
// TODO 回调这个方法
onRegistrationSuccess(result.getSuccess());
} else if (result.isRejection()) {
onRegistrationRejection(result.getRejection());
} else {
throw new IllegalArgumentException(
String.format(
"Unknown retrying registration response: %s.", result));
}
}
},
executor);
可以看到在这里根据注册的成功失败会触发不通的方法,我们接下来来看看这些方法的实现
注册成功回调(维持心跳、汇报Slot、注册slot)
首先我们来看注册成功的回调方法onRegistrationSuccess,我们点进来,选择TaskExecutorToResourceManagerConnection实现:
@Override
protected void onRegistrationSuccess(TaskExecutorRegistrationSuccess success) {
log.info(
"Successful registration at resource manager {} under registration id {}.",
getTargetAddress(),
success.getRegistrationId());
// TODO
registrationListener.onRegistrationSuccess(this, success);
}
再进入registrationListener.onRegistrationSuccess方法:
@Override
public void onRegistrationSuccess(
TaskExecutorToResourceManagerConnection connection,
TaskExecutorRegistrationSuccess success) {
final ResourceID resourceManagerId = success.getResourceManagerId();
final InstanceID taskExecutorRegistrationId = success.getRegistrationId();
final ClusterInformation clusterInformation = success.getClusterInformation();
final ResourceManagerGateway resourceManagerGateway = connection.getTargetGateway();
runAsync(
() -> {
// filter out outdated connections
//noinspection ObjectEquality
if (resourceManagerConnection == connection) {
try {
// TODO 完成和ResourceManager的连接
establishResourceManagerConnection(
resourceManagerGateway,
resourceManagerId,
taskExecutorRegistrationId,
clusterInformation);
} catch (Throwable t) {
log.error(
"Establishing Resource Manager connection in Task Executor failed",
t);
}
}
});
}
再点入establishResourceManagerConnection方法:
private void establishResourceManagerConnection(
ResourceManagerGateway resourceManagerGateway,
ResourceID resourceManagerResourceId,
InstanceID taskExecutorRegistrationId,
ClusterInformation clusterInformation) {
// TODO 发送Slot信息汇报给ResourceManager
final CompletableFuture<Acknowledge> slotReportResponseFuture =
resourceManagerGateway.sendSlotReport(
getResourceID(),
taskExecutorRegistrationId,
// TODO 生成注册报告,由TaskSlotTable完成
taskSlotTable.createSlotReport(getResourceID()),
taskManagerConfiguration.getRpcTimeout());
slotReportResponseFuture.whenCompleteAsync(
(acknowledge, throwable) -> {
if (throwable != null) {
reconnectToResourceManager(
new TaskManagerException(
"Failed to send initial slot report to ResourceManager.",
throwable));
}
},
getMainThreadExecutor());
// TODO 维持心跳
// monitor the resource manager as heartbeat target
resourceManagerHeartbeatManager.monitorTarget(
resourceManagerResourceId,
new HeartbeatTarget<TaskExecutorHeartbeatPayload>() {
/**
* TODO 当TaskExecutor收到ResourceManager的心跳后,会进行回复
* 此处与hdfs的心跳机制不通,hdfs的主节点只管收心跳,不会发心跳
* 而flink的主节点也会向从节点发送心跳
*/
@Override
public void receiveHeartbeat(
ResourceID resourceID, TaskExecutorHeartbeatPayload heartbeatPayload) {
resourceManagerGateway.heartbeatFromTaskManager(
resourceID, heartbeatPayload);
}
@Override
public void requestHeartbeat(
ResourceID resourceID, TaskExecutorHeartbeatPayload heartbeatPayload) {
// the TaskManager won't send heartbeat requests to the ResourceManager
}
});
// set the propagated blob server address
final InetSocketAddress blobServerAddress =
new InetSocketAddress(
clusterInformation.getBlobServerHostname(),
clusterInformation.getBlobServerPort());
blobCacheService.setBlobServerAddress(blobServerAddress);
// TODO 构建TaskExecutor和ResourceManager之间的连接对象
establishedResourceManagerConnection =
new EstablishedResourceManagerConnection(
resourceManagerGateway,
resourceManagerResourceId,
taskExecutorRegistrationId);
// TODO 停止超时计时任务,因为注册成功
stopRegistrationTimeout();
}
在这个方法里完成了以下工作:
- 发送Slot信息汇报给ResourceManager
- 维持和ResourceManager的心跳
- 构建TaskExecutor和ResourceManager的连接对象
- 停止超时倒计时任务,因为此时以注册成功
我们首先来看发送Slot信息汇报给ResourceManager的方法实现:
@Override
public CompletableFuture<Acknowledge> sendSlotReport(
ResourceID taskManagerResourceId,
InstanceID taskManagerRegistrationId,
SlotReport slotReport,
Time timeout) {
// TODO 之前在注册时会将注册信息和ResourceID的映射关系保存在taskExecutors这个map里,
// TODO 所以这里直接获取节点注册对象
final WorkerRegistration<WorkerType> workerTypeWorkerRegistration =
taskExecutors.get(taskManagerResourceId);
// TODO 看注册对象的ResourceID和我们的注册Id是否一样
if (workerTypeWorkerRegistration.getInstanceID().equals(taskManagerRegistrationId)) {
// TODO 完成资源汇报,注册该TaskManager上的所有Slot
if (slotManager.registerTaskManager(
workerTypeWorkerRegistration,
slotReport,
workerTypeWorkerRegistration.getTotalResourceProfile(),
workerTypeWorkerRegistration.getDefaultSlotResourceProfile())) {
onWorkerRegistered(workerTypeWorkerRegistration.getWorker());
}
return CompletableFuture.completedFuture(Acknowledge.get());
} else {
// TODO 不一样则报错
return FutureUtils.completedExceptionally(
new ResourceManagerException(
String.format(
"Unknown TaskManager registration id %s.",
taskManagerRegistrationId)));
}
}
这是一个异步方法:
- 我们在进行TaskExecutor注册的时候会将TaskExecutor的ResourceID和该节点的注册封装对象WorkerRegistration封装在这个taskExecutorsMap里,我们在这里通过ResourceID可以拿到当前从节点的注册对象
- 拿到注册对象后回避对该注册对象的ResourceId和我们的注册ID是否一样,不一样则直接报错
- 如果ID一样,则开始进行资源汇报,并注册该TaskManager节点上的所有Slot
我们先来看资源注册方法registerTaskManager,选择SlotManagerImpl实现,我们来看代码中最重要的一行:
if (taskManagerRegistrations.containsKey(taskExecutorConnection.getInstanceID())) {
// TODO 汇报slot状态
reportSlotStatus(taskExecutorConnection.getInstanceID(), initialSlotReport);
return false;
}
我们点进汇报Slot状态的方法reportSlotStatus
@Override
public boolean reportSlotStatus(InstanceID instanceId, SlotReport slotReport) {
checkInit();
TaskManagerRegistration taskManagerRegistration = taskManagerRegistrations.get(instanceId);
if (null != taskManagerRegistration) {
LOG.debug("Received slot report from instance {}: {}.", instanceId, slotReport);
// TODO 进行TaskExecutor的所有Slot的状态汇报
for (SlotStatus slotStatus : slotReport) {
// TODO 更新slot状态
updateSlot(
slotStatus.getSlotID(),
slotStatus.getAllocationID(),
slotStatus.getJobID());
}
return true;
} else {
LOG.debug(
"Received slot report for unknown task manager with instance id {}. Ignoring this report.",
instanceId);
return false;
}
}
可以看到这里会遍历所有的当前节点的slot,并汇报slot状态,更新主节点slot状态记录,我们点进ubdateSlot方法:
private boolean updateSlot(SlotID slotId, AllocationID allocationId, JobID jobId) {
// TODO 根据SlotId获取taskManagerSlot对象
final TaskManagerSlot slot = slots.get(slotId);
if (slot != null) {
// TODO 获取该Slot所属的TaskManager的注册对象
final TaskManagerRegistration taskManagerRegistration =
taskManagerRegistrations.get(slot.getInstanceId());
// TODO 再次确认注册信息,只有注册过才能更新该TaskExecutor上的Slot状态
if (taskManagerRegistration != null) {
// TODO 修改状态
updateSlotState(slot, taskManagerRegistration, allocationId, jobId);
return true;
} else {
throw new IllegalStateException(
"Trying to update a slot from a TaskManager "
+ slot.getInstanceId()
+ " which has not been registered.");
}
} else {
LOG.debug("Trying to update unknown slot with slot id {}.", slotId);
return false;
}
}
我们首先根据SlotID获取Slot对象,再根据slot对象获取到slot所属的TaskManager对象,在该TaskManager确认已注册的前提下,进行主节点的记录中slot状态的修改,我们点进updateSlotState方法,可以看到这个方法里有一个switch,来匹配当前slot的状态信息。这里我计划放在后面Slot的分配章节来详细的讲,在这里就先普及一个slot的知识:
- 每一个申请slot的请求,都被封装成了PendingSlotRequest
- 每个Slot都有三种基本状态: Allocated(已分配的), Free(空闲的), Pending(待定)
- 申请的时候是JobMaster向ResourceManager去申请Slot
- ResourceManager经过计算之后,就把某个TaskExecutor之上的某个Free状态的slot分配给当前这个JobMaster
- ResourceManager会给JobMaster一个反馈,说已经把某个TaskExecutor至上的某个Slot分配给你了(逻辑分配),此时这个slot的状态,在ResourceManager的信息管理中心,就由Free编程Pending状态,ResourceManager会发送Rpc请求,告知TaskExecutor,已经把你身上的某个Slot分配给了某个JobMaster,此时,这个从节点首先完成分配,然后再去联系JobMaster,进行Slot注册,最后,pendingSlotRequest才会变成CompletedSlotRequest
好了我们回到establishResourceManagerConnection方法内,继续往下看,维持心跳的相关代码:
// TODO 维持心跳
// monitor the resource manager as heartbeat target
resourceManagerHeartbeatManager.monitorTarget(
resourceManagerResourceId,
new HeartbeatTarget<TaskExecutorHeartbeatPayload>() {
/**
* TODO 当TaskExecutor收到ResourceManager的心跳后,会进行回复
* 此处与hdfs的心跳机制不通,hdfs的主节点只管收心跳,不会发心跳
* 而flink的主节点也会向从节点发送心跳
*/
@Override
public void receiveHeartbeat(
ResourceID resourceID, TaskExecutorHeartbeatPayload heartbeatPayload) {
resourceManagerGateway.heartbeatFromTaskManager(
resourceID, heartbeatPayload);
}
@Override
public void requestHeartbeat(
ResourceID resourceID, TaskExecutorHeartbeatPayload heartbeatPayload) {
// the TaskManager won't send heartbeat requests to the ResourceManager
}
});
在这段代码里TaskManager会接收ResourceManager的心跳,并进行回复,我们来看monitorTarget方法,点进去选择HeartbeatManagerImpl实现:
@Override
public void monitorTarget(ResourceID resourceID, HeartbeatTarget<O> heartbeatTarget) {
if (!stopped) {
if (heartbeatTargets.containsKey(resourceID)) {
log.debug(
"The target with resource ID {} is already been monitored.",
resourceID.getStringWithMetadata());
} else {
// TODO 根据HeartbeatTarget 创建 HeartbeatMonitor并注册到heartbeatTargets map中
HeartbeatMonitor<O> heartbeatMonitor =
heartbeatMonitorFactory.createHeartbeatMonitor(
resourceID,
heartbeatTarget,
mainThreadExecutor,
heartbeatListener,
heartbeatTimeoutIntervalMs);
// TODO 加入心跳目标对象集合
heartbeatTargets.put(resourceID, heartbeatMonitor);
// check if we have stopped in the meantime (concurrent stop operation)
// TODO 如果心跳机制HeartbeatManagerImpl已关闭,则取消心跳超时任务
if (stopped) {
heartbeatMonitor.cancel();
heartbeatTargets.remove(resourceID);
}
}
}
}
Flink集群中心跳的设计和Hdfs中不一样,在Flink 中主节点ResourceManager会向从节点进行心跳注册,而在Hdfs中,主节点只负责收心跳,不负责发送。
在这个方法里,从节点会根据将主节点注册进来的信息封装为heartbeatMonitor对象,并加入map中,接着进行判断心跳机制是否已关闭,如果已关闭则取消心跳超时任务。更为详细的心跳机制我们将在后面章节中来分析.
维持心跳完成后将开始构建TaskExecutor和ResourceManager之间的连接对象:
// TODO 构建TaskExecutor和ResourceManager之间的连接对象
establishedResourceManagerConnection =
new EstablishedResourceManagerConnection(
resourceManagerGateway,
resourceManagerResourceId,
taskExecutorRegistrationId);
// TODO 停止超时计时任务,因为注册成功
stopRegistrationTimeout();
构建完成之后执行注册超时任务的停止方法
到此,注册成功的回调方法我们已经分析完了,我们来看看注册失败的回调方法
注册失败回调
我们点击onRegistrationFailure来,选择TaskExecutorToResourceManagerConnection实现,再点进registrationListener.onRegistrationFailure方法,再点进onFatalError方法,再点进fatalErrorHandler.onFatalError,选择TaskManagerRunner实现:
@Override
public void onFatalError(Throwable exception) {
TaskManagerExceptionUtils.tryEnrichTaskManagerError(exception);
LOG.error(
"Fatal error occurred while executing the TaskManager. Shutting it down...",
exception);
if (ExceptionUtils.isJvmFatalOrOutOfMemoryError(exception)) {
terminateJVM();
} else {
closeAsync(Result.FAILURE);
FutureUtils.orTimeout(
terminationFuture, FATAL_ERROR_SHUTDOWN_TIMEOUT_MS, TimeUnit.MILLISECONDS);
}
}
可以看到,如果注册失败,在这里将直接关闭当前TaskManager从节点的JVM。
到此resourceManagerLeaderRetriever的启动流程就全部完成了。
二、taskSlotTable的启动
内容很简单,这里一带而过:
@Override
public void start(
SlotActions initialSlotActions, ComponentMainThreadExecutor mainThreadExecutor) {
Preconditions.checkState(
state == State.CREATED,
"The %s has to be just created before starting",
TaskSlotTableImpl.class.getSimpleName());
this.slotActions = Preconditions.checkNotNull(initialSlotActions);
this.mainThreadExecutor = Preconditions.checkNotNull(mainThreadExecutor);
timerService.start(this);
// TODO 修改状态标识
state = State.RUNNING;
可以看到只是做了状态检查和修改状态标识
三、启动监控JobMaster的服务
内容也很简单,同样也来看一眼:
@Override
public void start(
final String initialOwnerAddress,
final RpcService initialRpcService,
final HighAvailabilityServices initialHighAvailabilityServices,
final JobLeaderListener initialJobLeaderListener) {
if (DefaultJobLeaderService.State.CREATED != state) {
throw new IllegalStateException("The service has already been started.");
} else {
LOG.info("Start job leader service.");
this.ownerAddress = Preconditions.checkNotNull(initialOwnerAddress);
this.rpcService = Preconditions.checkNotNull(initialRpcService);
this.highAvailabilityServices =
Preconditions.checkNotNull(initialHighAvailabilityServices);
this.jobLeaderListener = Preconditions.checkNotNull(initialJobLeaderListener);
state = DefaultJobLeaderService.State.STARTED;
}
}
同样也只是做了状态检查和状态修改。
到此TaskExecutor就已经完成了启动工作。
总结
总结下来,resourceManagerLeaderRetriever的启动流程中完成了以下工作:
- TaskExecutor对ResourceManager的注册
- TaskExecutor维持对ResourceManager的心跳
- TaskExecutor汇报自身的Slot情况给ResourceManager
在接下来的章节中,我们将分析一下ResourceManager、JobMaster和TaskExecutor之间的心跳交互实现。