JobMaster在启动内部组件和服务时,会和ResourceManager建立RPC连接。
// 用于获取ResourceManager的Leader节点(通过LeaderRetrievalListener接口感知Leader变更)
private final LeaderRetrievalService resourceManagerLeaderRetriever;
/**
* 启动JobMaster的内部服务,包括心跳服务、SlotPool组件
*/
private void startJobMasterServices() throws Exception {
// 启动心跳服务:会向TaskManager和ResourceManager发送心跳
startHeartbeatServices();
// 启动SlotPool:负责向ResourceManager申请Slot、管理“分配给JobManager“的Slot
slotPool.start(getFencingToken(), getAddress(), getMainThreadExecutor());
// 启动JobMaster中的Scheduler调度器:负责Task的调度、执行
scheduler.start(getMainThreadExecutor());
/**
* 创建JobMaster和ResourceManager之间的RPC连接
*/
reconnectToResourceManager(new FlinkException("Starting JobMaster component."));
/**
* 使用LeaderRetrievalService服务,通过注册LeaderRetrievalListener,监听ResourceManager Leader的变更
*/
resourceManagerLeaderRetriever.start(new ResourceManagerLeaderListener());
}
1.番外:监听ResourceManager的Leader变更
LeaderRetrievalService服务之所以能够提供高可用组件的Leader节点,完全是依赖于LeaderRetrievalListener(作为监听器)感知Leader变更。一旦Leader变更,就会被LeaderRetrievalListener感知到。谁“拥有”这个监听器,“Leader变更”的消息就能通知给谁。
/**
* 负责监听ResourceManager Leader变更,及时通知给JobMaster
*/
private class ResourceManagerLeaderListener implements LeaderRetrievalListener {
/**
* 监听到ResourceManager的Leader发生变更后,及时作出相应处理。如:更新Leader地址、重新和Leader建立连接
*/
@Override
public void notifyLeaderAddress(final String leaderAddress, final UUID leaderSessionID) {
runAsync(
// 更新变更后的Leader地址
() -> notifyOfNewResourceManagerLeader(
leaderAddress,
ResourceManagerId.fromUuidOrNull(leaderSessionID)));
}
@Override
public void handleError(final Exception exception) {
handleJobMasterError(new Exception("Fatal error in the ResourceManager leader service", exception));
}
}
假设现在ResourceManager的Leader发生了变更,这就会被LeaderRetrievalListener监听到,并及时的将更新后的ResourceManager的Leader地址更新给了JobMaster。这就能保证JobMaster始终都是和ResourceManager的Leader进行交互!
/**
* ResourceManager的Leader变更后,LeaderRetrievalListener会及时更新"新ResourceManager Leader"的地址,并尝试与之建立连接
*/
private void notifyOfNewResourceManagerLeader(final String newResourceManagerAddress, final ResourceManagerId resourceManagerId) {
// 更新ResourceManager的Leader地址
resourceManagerAddress = createResourceManagerAddress(newResourceManagerAddress, resourceManagerId);
// 尝试和新ResourceManager Leader建立连接
reconnectToResourceManager(new FlinkException(String.format("ResourceManager leader changed to new address %s", resourceManagerAddress)));
}
2.建立连接
在启动JobMaster的时候,JobMaster会和ResourceManager Leader建立RPC连接。当然,当LeaderRetrievalListener监听到ResourceManager的Leader发生变更,在更新“新Leader”地址的同时,也会调用相同的方法和“新Leader”尝试建立连接。
/**
* 创建JobMaster和ResourceManager之间的RPC连接
*/
private void reconnectToResourceManager(Exception cause) {
// 关闭原先的旧连接
closeResourceManagerConnection(cause);
// 创建新的RPC连接:ResourceManagerConnection
tryConnectToResourceManager();
}
/**
* 创建新的RPC连接:ResourceManagerConnection
*/
private void tryConnectToResourceManager() {
// LeaderRetrievalListener会监听到Leader的地址变更,在回调中会及时更新ResourceManagerAddress
if (resourceManagerAddress != null) {
// 创建与ResourceManager Leader之间的新的RPC连接:ResourceManagerConnection
connectToResourceManager();
}
}
一旦确定了ResourceManager Leader的地址,那就可以放心与之建立RPC连接了
/**
* 创建新的RPC连接:
* 连接的本质是注册,拿着解析得来的ResourceManager Leader的Gateway,将JobMaster注册给ResourceManager。
* 根据注册结果--RegistrationResponse,进行相应的后续处理。
*/
private void connectToResourceManager() {
assert(resourceManagerAddress != null);
assert(resourceManagerConnection == null);
assert(establishedResourceManagerConnection == null);
log.info("Connecting to ResourceManager {}", resourceManagerAddress);
/**
* 创建ResourceManagerConnection:提供了注册、处理注册结果的具体实现逻辑
*/
resourceManagerConnection = new ResourceManagerConnection(
log,
jobGraph.getJobID(),
resourceId,
getAddress(),
getFencingToken(),
// LeaderRetrievalListener监听到的ResourceManager的Leader地址
resourceManagerAddress.getAddress(),
resourceManagerAddress.getResourceManagerId(),
scheduledExecutorService);
/**
* 建立连接,本质就是解析出目标RPC节点的Gateway后,将自己注册给目标RPC节点
*/
resourceManagerConnection.start();
}
2个RPC节点要想建立RPC连接,必须准备好RegisteredRpcConnection,它就是RPC连接的“化身”。RegisteredRpcConnection和RetryingRegistration组合使用,才能解析出Gateway、完成注册
/**
* 正式开始建立2个RPC组件之间的RPC连接:就是使用RegisteredRpcConnection和RetryingRegistration,打出一套“组合拳”
*/
public void start() {
checkState(!closed, "The RPC connection is already closed");
checkState(!isConnected() && pendingRegistration == null, "The RPC connection is already started");
// 创建2个RPC节点建立RPC连接时需要用到的RetryingRegistration:通过目标RPC节点的Gateway,将1个RPC节点注册给目标RPC节点
final RetryingRegistration<F, G, S> newRegistration = createNewRegistration();
// 如果当前值就是期望值,则原子的将其设置给this的字段
if (REGISTRATION_UPDATER.compareAndSet(this, null, newRegistration)) {
/**
* 正式开始注册:利用目标RPC节点的地址,解析出Gateway,并通过Gateway向目标RPC节点进行注册,并拿到注册结果--RegistrationResponse
*/
newRegistration.startRegistration();
} else {
// concurrent start operation
newRegistration.cancel();
}
}
首先就是要构建好关键的RetryingRegistration。其中在构建RetryingRegistration时,会为“解析得到Gateway后才会调用执行的抽象注册方法”提供具体的实现逻辑(由RegisteredRpcConnection的抽象实现子类–ResourceManagerConnection提供),一旦解析出ResourceManager的Gateway,就会立刻调用执行“注册”操作。
/**
* 2个RPC节点建立RPC连接,连接的本质就是拿到目标的Gateway,将自己注册给目标RPC节点。注册成功了,也就意味着RPC连接建立成功了。
* 注意:注册以及对注册结果的处理,具体逻辑均由RegisteredRpcConnection的抽象子类提供
*/
private RetryingRegistration<F, G, S> createNewRegistration() {
/**
* 核心:创建RetryingRegistration(注册信息),具体实现逻辑由RegisteredRpcConnection的抽象子类提供
*/
RetryingRegistration<F, G, S> newRegistration = checkNotNull(generateRegistration());
// 获取RetryingRegistration中构建的CompletableFuture
CompletableFuture<Tuple2<G, S>> future = newRegistration.getFuture();
/**
* 构建好RetryingRegistration后,对注册结果--RegistrationResponse的后续处理逻辑。具体的处理逻辑由RegisteredRpcConnection的抽象子类提供
* 上面任务执行完成后,需要执行的回调方法。其中参数为上个任务的结果
*/
future.whenCompleteAsync(
(Tuple2<G, S> result, Throwable failure) -> {
// 这个future只会在出现bug时失败,而不会在注册被拒绝时失败
if (failure != null) {
if (failure instanceof CancellationException) {
log.debug("Retrying registration towards {} was cancelled.", targetAddress);
} else {
onRegistrationFailure(failure);
}
} else {
// 注册成功,得到目标节点的Gateway,保存到RegisteredRpcConnection中
targetGateway = result.f0;
/**
* 执行注册成功的回调方法,具体体现在RegisteredRpcConnection的抽象子类中
*/
onRegistrationSuccess(result.f1);
}
}, executor);
return newRegistration;
}
/**
* 生成RetryingRegistration
*/
@Override
protected RetryingRegistration<ResourceManagerId, ResourceManagerGateway, JobMasterRegistrationSuccess> generateRegistration() {
// 构建“JobMaster向ResourceManager注册”所用的RetryingRegistration
return new RetryingRegistration<ResourceManagerId, ResourceManagerGateway, JobMasterRegistrationSuccess>(
log,
getRpcService(),
"ResourceManager",
ResourceManagerGateway.class,
getTargetAddress(),
getTargetLeaderId(),
jobMasterConfiguration.getRetryingRegistrationConfiguration()) {
/**
* 根据ResourceManager的Leader地址解析出它的Gateway后,将JobMaster向ResourceManager注册的具体实现逻辑
*/
@Override
protected CompletableFuture<RegistrationResponse> invokeRegistration(
ResourceManagerGateway gateway, ResourceManagerId fencingToken, long timeoutMillis) {
Time timeout = Time.milliseconds(timeoutMillis);
// JobMaster通过ResourceManager的Gateway,将自己注册到ResourceManager中,返回RegistrationResponse
return gateway.registerJobManager(
jobMasterId,
jobManagerResourceID,
jobManagerRpcAddress,
jobID,
timeout);
}
};
}
还有就是为异步注册成功后为不同的注册结果提供对应不同的具体逻辑实现。由RegisteredRpcConnection的抽象实现子类–ResourceManagerConnection提供。
/**
* 当ResourceManagerConnection创建成功后,会在“成功回调”中向ResourceManager发送“申请Slot计算资源”的请求
*/
@Override
protected void onRegistrationSuccess(final JobMasterRegistrationSuccess success) {
runAsync(() -> {
if (this == resourceManagerConnection) {
// JobMaster向ResourceManager注册成功后,就要去建立RPC连接了
establishResourceManagerConnection(success);
}
});
}
/**
* 注册失败的处理逻辑
*/
@Override
protected void onRegistrationFailure(final Throwable failure) {
handleJobMasterError(failure);
}
准备好RetryingRegistration后,就要去解析ResourceManager的Leader地址为Gateway,并开始让JobMaster向ResourceManager注册了。
/**
* 该方法将目标地址解析为可调用的Gateway,并在此之后开始注册。如:拿到ResourceManager的Gateway后,将JobMaster注册给ResourceManager
*/
@SuppressWarnings("unchecked")
public void startRegistration() {
if (canceled) {
return;
}
try {
// 解析得到目标RPC节点的Gateway
final CompletableFuture<G> rpcGatewayFuture;
// 调用AkkaRpcService#connect()方法,创建并获取目标节点对应的RpcGateway接口
// 如果是FencedRpcGateway接口,就会调用AkkaRpcService#connect()方法传递fendingToken和FencedRpcGateway信息,并创建FencedRpcGateway代理类;
// 否则,就创建RpcGateway代理类
if (FencedRpcGateway.class.isAssignableFrom(targetType)) {
rpcGatewayFuture = (CompletableFuture<G>) rpcService.connect(
targetAddress,
fencingToken,
targetType.asSubclass(FencedRpcGateway.class));
} else {
rpcGatewayFuture = rpcService.connect(targetAddress, targetType);
}
// 解析Gateway成功后的下一个异步任务:开始尝试注册(将上一个异步任务的返回值作为参数传到thenAcceptAsync中。有参数,没有返回值)
CompletableFuture<Void> rpcGatewayAcceptFuture = rpcGatewayFuture.thenAcceptAsync(
(G rpcGateway) -> {
log.info("Resolved {} address, beginning registration", targetName);
/**
* 拿到解析后的ResourceManager的Gateway,开始注册
*/
register(rpcGateway, 1, retryingRegistrationConfiguration.getInitialRegistrationTimeoutMillis());
},
rpcService.getExecutor());
// 如果失败后会重试,除非取消此操作
rpcGatewayAcceptFuture.whenCompleteAsync(
(Void v, Throwable failure) -> {
if (failure != null && !canceled) {
final Throwable strippedFailure = ExceptionUtils.stripCompletionException(failure);
if (log.isDebugEnabled()) {
log.debug(
"Could not resolve {} address {}, retrying in {} ms.",
targetName,
targetAddress,
retryingRegistrationConfiguration.getErrorDelayMillis(),
strippedFailure);
} else {
log.info(
"Could not resolve {} address {}, retrying in {} ms: {}.",
targetName,
targetAddress,
retryingRegistrationConfiguration.getErrorDelayMillis(),
strippedFailure.getMessage());
}
startRegistrationLater(retryingRegistrationConfiguration.getErrorDelayMillis());
}
},
rpcService.getExecutor());
}
catch (Throwable t) {
completionFuture.completeExceptionally(t);
cancel();
}
}
可以看出,在对ResourceManager的Leader地址进行异步解析后,会拿着解析得到的Gateway去执行“注册”操作。也就是RetryingRegistration中定义的抽象invokeRegistration()方法,在初始化RetryingRegistration时就已经为其提供了具体的实现逻辑,即执行ResourceManagerGateway#registerJobManager()方法。
/**
* RPC节点正式开始向目标RPC节点注册,并根据注册结果触发 成功 or 失败的处理逻辑
*/
@SuppressWarnings("unchecked")
private void register(final G gateway, final int attempt, final long timeoutMillis) {
if (canceled) {
return;
}
try {
log.info("Registration at {} attempt {} (timeout={}ms)", targetName, attempt, timeoutMillis);
// 抽象方法的具体实现由抽象子类提供,如:JobMaster拿着ResourceManagerGateway向它进行注册,得到注册结果--RegistrationResponse
CompletableFuture<RegistrationResponse> registrationFuture = invokeRegistration(gateway, fencingToken, timeoutMillis);
/**
* 以上异步注册完成后,执行下一个异步任务
*/
CompletableFuture<Void> registrationAcceptFuture = registrationFuture.thenAcceptAsync(
// 根据注册结果,执行相应的处理逻辑
(RegistrationResponse result) -> {
if (!isCanceled()) {
if (result instanceof RegistrationResponse.Success) {
// registration successful!
S success = (S) result;
// 注册成功,将目标RPC节点的Gateway包装成Tuple2
completionFuture.complete(Tuple2.of(gateway, success));
}
else {
// 注册被拒绝
if (result instanceof RegistrationResponse.Decline) {
RegistrationResponse.Decline decline = (RegistrationResponse.Decline) result;
log.info("Registration at {} was declined: {}", targetName, decline.getReason());
} else {
log.error("Received unknown response to registration attempt: {}", result);
}
log.info("Pausing and re-attempting registration in {} ms", retryingRegistrationConfiguration.getRefusedDelayMillis());
registerLater(gateway, 1, retryingRegistrationConfiguration.getInitialRegistrationTimeoutMillis(), retryingRegistrationConfiguration.getRefusedDelayMillis());
}
}
},
rpcService.getExecutor());
/**
* 失败重试
*/
registrationAcceptFuture.whenCompleteAsync(
(Void v, Throwable failure) -> {
if (failure != null && !isCanceled()) {
if (ExceptionUtils.stripCompletionException(failure) instanceof TimeoutException) {
if (log.isDebugEnabled()) {
log.debug("Registration at {} ({}) attempt {} timed out after {} ms",
targetName, targetAddress, attempt, timeoutMillis);
}
long newTimeoutMillis = Math.min(2 * timeoutMillis, retryingRegistrationConfiguration.getMaxRegistrationTimeoutMillis());
register(gateway, attempt + 1, newTimeoutMillis);
}
else {
log.error("Registration at {} failed due to an error", targetName, failure);
log.info("Pausing and re-attempting registration in {} ms", retryingRegistrationConfiguration.getErrorDelayMillis());
registerLater(gateway, 1, retryingRegistrationConfiguration.getInitialRegistrationTimeoutMillis(), retryingRegistrationConfiguration.getErrorDelayMillis());
}
}
},
rpcService.getExecutor());
}
catch (Throwable t) {
completionFuture.completeExceptionally(t);
cancel();
}
}
注册结束后,会对注册结果–RegistrationResponse进行判断处理。当结果为RegistrationResponse.Success时,会将Gateway和结果包装成Tuple2后返回。
之后会执行注册成功后的处理逻辑(当初构建RetryingRegistration时提到过)
// 注册成功,得到目标节点的Gateway,保存到RegisteredRpcConnection中
targetGateway = result.f0;
/**
* 执行注册成功的回调方法,具体体现在RegisteredRpcConnection的抽象子类中
*/
onRegistrationSuccess(result.f1);
/**
* 当ResourceManagerConnection创建成功后,就到了建立RPC连接的收尾部分了。例如:会在“成功回调”中向ResourceManager发送“申请Slot计算资源”的请求
*/
@Override
protected void onRegistrationSuccess(final JobMasterRegistrationSuccess success) {
runAsync(() -> {
if (this == resourceManagerConnection) {
// JobMaster向ResourceManager注册成功后,就到了建立RPC连接的收尾部分了
establishResourceManagerConnection(success);
}
});
}
注册成功后,就到了建立RPC连接的收尾部分了,诸如:初始化RPC连接的详细连接信息、向ResourceManager发送“申请Slot”的请求等…
3.总结
JobMaster要想跟ResourceManager建立RPC连接,首先得准备好连接所需的RegisteredRpcConnection,它代表了2个RPC节点之间建立的RPC连接。所谓建立RPC连接,就是注册。拿着目标RPC节点的Gateway,将自己注册给它,并拿到注册结果–RegistrationResponse。RegisteredRpcConnection和RetryingRegistration(可以根据目标RPC节点的地址,解析出对应的Gateway)组合使用,完成解析Gateway、执行注册的具体逻辑、对RegistrationResponse(注册结果)的处理逻辑,对注册完成后的后续处理逻辑
1709

被折叠的 条评论
为什么被折叠?



