聊聊flink的slot.idle.timeout配置

本文主要研究一下flink的slot.idle.timeout配置

JobManagerOptions

flink-release-1.7.2/flink-core/src/main/java/org/apache/flink/configuration/JobManagerOptions.java

@PublicEvolving
public class JobManagerOptions {
	//......

	/**
	 * The timeout in milliseconds for a idle slot in Slot Pool.
	 */
	public static final ConfigOption<Long> SLOT_IDLE_TIMEOUT =
		key("slot.idle.timeout")
			// default matches heartbeat.timeout so that sticky allocation is not lost on timeouts for local recovery
			.defaultValue(HeartbeatManagerOptions.HEARTBEAT_TIMEOUT.defaultValue())
			.withDescription("The timeout in milliseconds for a idle slot in Slot Pool.");

	//......
}
复制代码
  • slot.idle.timeout默认为HeartbeatManagerOptions.HEARTBEAT_TIMEOUT.defaultValue(),即50000L毫秒

SlotPool

flink-release-1.7.2/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/slotpool/SlotPool.java

public class SlotPool extends RpcEndpoint implements SlotPoolGateway, AllocatedSlotActions {

	/** The interval (in milliseconds) in which the SlotPool writes its slot distribution on debug level. */
	private static final int STATUS_LOG_INTERVAL_MS = 60_000;

	private final JobID jobId;

	private final SchedulingStrategy schedulingStrategy;

	private final ProviderAndOwner providerAndOwner;

	/** All registered TaskManagers, slots will be accepted and used only if the resource is registered. */
	private final HashSet<ResourceID> registeredTaskManagers;

	/** The book-keeping of all allocated slots. */
	private final AllocatedSlots allocatedSlots;

	/** The book-keeping of all available slots. */
	private final AvailableSlots availableSlots;

	/** All pending requests waiting for slots. */
	private final DualKeyMap<SlotRequestId, AllocationID, PendingRequest> pendingRequests;

	/** The requests that are waiting for the resource manager to be connected. */
	private final HashMap<SlotRequestId, PendingRequest> waitingForResourceManager;

	/** Timeout for external request calls (e.g. to the ResourceManager or the TaskExecutor). */
	private final Time rpcTimeout;

	/** Timeout for releasing idle slots. */
	private final Time idleSlotTimeout;

	private final Clock clock;

	/** Managers for the different slot sharing groups. */
	protected final Map<SlotSharingGroupId, SlotSharingManager> slotSharingManagers;

	/** the fencing token of the job manager. */
	private JobMasterId jobMasterId;

	/** The gateway to communicate with resource manager. */
	private ResourceManagerGateway resourceManagerGateway;

	private String jobManagerAddress;

	//......

	/**
	 * Start the slot pool to accept RPC calls.
	 *
	 * @param jobMasterId The necessary leader id for running the job.
	 * @param newJobManagerAddress for the slot requests which are sent to the resource manager
	 */
	public void start(JobMasterId jobMasterId, String newJobManagerAddress) throws Exception {
		this.jobMasterId = checkNotNull(jobMasterId);
		this.jobManagerAddress = checkNotNull(newJobManagerAddress);

		// TODO - start should not throw an exception
		try {
			super.start();
		} catch (Exception e) {
			throw new RuntimeException("This should never happen", e);
		}

		scheduleRunAsync(this::checkIdleSlot, idleSlotTimeout);

		if (log.isDebugEnabled()) {
			scheduleRunAsync(this::scheduledLogStatus, STATUS_LOG_INTERVAL_MS, TimeUnit.MILLISECONDS);
		}
	}

	/**
	 * Check the available slots, release the slot that is idle for a long time.
	 */
	private void checkIdleSlot() {

		// The timestamp in SlotAndTimestamp is relative
		final long currentRelativeTimeMillis = clock.relativeTimeMillis();

		final List<AllocatedSlot> expiredSlots = new ArrayList<>(availableSlots.size());

		for (SlotAndTimestamp slotAndTimestamp : availableSlots.availableSlots.values()) {
			if (currentRelativeTimeMillis - slotAndTimestamp.timestamp > idleSlotTimeout.toMilliseconds()) {
				expiredSlots.add(slotAndTimestamp.slot);
			}
		}

		final FlinkException cause = new FlinkException("Releasing idle slot.");

		for (AllocatedSlot expiredSlot : expiredSlots) {
			final AllocationID allocationID = expiredSlot.getAllocationId();
			if (availableSlots.tryRemove(allocationID) != null) {

				log.info("Releasing idle slot [{}].", allocationID);
				final CompletableFuture<Acknowledge> freeSlotFuture = expiredSlot.getTaskManagerGateway().freeSlot(
					allocationID,
					cause,
					rpcTimeout);

				freeSlotFuture.whenCompleteAsync(
					(Acknowledge ignored, Throwable throwable) -> {
						if (throwable != null) {
							if (registeredTaskManagers.contains(expiredSlot.getTaskManagerId())) {
								log.debug("Releasing slot [{}] of registered TaskExecutor {} failed. " +
									"Trying to fulfill a different slot request.", allocationID, expiredSlot.getTaskManagerId(),
									throwable);
								tryFulfillSlotRequestOrMakeAvailable(expiredSlot);
							} else {
								log.debug("Releasing slot [{}] failed and owning TaskExecutor {} is no " +
									"longer registered. Discarding slot.", allocationID, expiredSlot.getTaskManagerId());
							}
						}
					},
					getMainThreadExecutor());
			}
		}

		scheduleRunAsync(this::checkIdleSlot, idleSlotTimeout);
	}

	//......
}
复制代码
  • SlotPool在start方法里头,调用scheduleRunAsync方法,延时idleSlotTimeout调度执行checkIdleSlot;checkIdleSlot方法会挨个检查availableSlots的SlotAndTimestamp,判断当前时间与slotAndTimestamp.timestamp的时间差是否超过idleSlotTimeout,超过的话,则放入expiredSlots,之后对expiredSlots挨个进行availableSlots.tryRemove,然后调用TaskManagerGateway.freeSlot进行释放,之后再次调用scheduleRunAsync(this::checkIdleSlot, idleSlotTimeout)进行下一次的延时调度检测

RpcEndpoint

flink-release-1.7.2/flink-runtime/src/main/java/org/apache/flink/runtime/rpc/RpcEndpoint.java

public abstract class RpcEndpoint implements RpcGateway {
	//......

	/**
	 * Execute the runnable in the main thread of the underlying RPC endpoint, with
	 * a delay of the given number of milliseconds.
	 *
	 * @param runnable Runnable to be executed
	 * @param delay    The delay after which the runnable will be executed
	 */
	protected void scheduleRunAsync(Runnable runnable, Time delay) {
		scheduleRunAsync(runnable, delay.getSize(), delay.getUnit());
	}

	/**
	 * Execute the runnable in the main thread of the underlying RPC endpoint, with
	 * a delay of the given number of milliseconds.
	 *
	 * @param runnable Runnable to be executed
	 * @param delay    The delay after which the runnable will be executed
	 */
	protected void scheduleRunAsync(Runnable runnable, long delay, TimeUnit unit) {
		rpcServer.scheduleRunAsync(runnable, unit.toMillis(delay));
	}

	//......
}
复制代码
  • RpcEndpoint提供了scheduleRunAsync,其最后调用的是rpcServer.scheduleRunAsync

小结

  • slot.idle.timeout默认为HeartbeatManagerOptions.HEARTBEAT_TIMEOUT.defaultValue(),即50000L毫秒
  • SlotPool在start方法里头,调用scheduleRunAsync方法,延时idleSlotTimeout调度执行checkIdleSlot;checkIdleSlot方法会挨个检查availableSlots的SlotAndTimestamp,判断当前时间与slotAndTimestamp.timestamp的时间差是否超过idleSlotTimeout,超过的话,则放入expiredSlots,之后对expiredSlots挨个进行availableSlots.tryRemove,然后调用TaskManagerGateway.freeSlot进行释放,之后再次调用scheduleRunAsync(this::checkIdleSlot, idleSlotTimeout)进行下一次的延时调度检测
  • RpcEndpoint提供了scheduleRunAsync,其最后调用的是rpcServer.scheduleRunAsync

doc

`flink.checkpoint.timeout` 和 `flink.checkpoint.interval` 是 Flink 中与检查点相关的两个参数,它们之间存在一定的关系。 - `flink.checkpoint.timeout` 参数定义了执行检查点的超时时间,即当执行检查点操作时,如果超过了指定的超时时间仍未完成,则会被视为失败。 - `flink.checkpoint.interval` 参数定义了两次检查点之间的时间间隔,即多久执行一次检查点。 这两个参数的关系可以通过以下几点来说明: 1. `flink.checkpoint.timeout` 应该大于等于 `flink.checkpoint.interval`。确保超时时间足够长以容纳一个完整的检查点操作,否则可能会导致检查点失败。 2. 如果 `flink.checkpoint.timeout` 被设置得过小,可能会导致检查点操作在超时之前无法完成。在这种情况下,可以适当增加 `flink.checkpoint.timeout` 的值,以便给检查点操作足够的时间来完成。 3. 如果 `flink.checkpoint.interval` 被设置得过小,系统将更频繁地进行检查点操作,从而导致更高的系统开销和资源消耗。因此,在设置 `flink.checkpoint.interval` 时需要综合考虑系统的性能要求和资源限制。 需要根据应用程序的实际情况和需求来评估和调整 `flink.checkpoint.timeout` 和 `flink.checkpoint.interval` 的值。同时,还应该考虑 Flink 集群的配置和硬件资源是否能够支持所选的超时时间和间隔。在设置之后,建议进行性能测试和实际生产环境的实验来验证和优化这两个参数的值。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值