flink任务某个solt被移除掉的记录

Flink任务挂掉的记录

异常描述:

org.apache.flink.util.FlinkException: ==The assigned slot container_e01_1589821551483_0145_01_000012_2 was removed==.
再瞅瞅jobmaster的日志:
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:825)
at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1139)

这个异常描述的倒是很清楚,就是某个solt被移除掉了,初步猜想是solt的资源不足以运行多个task任务,根据日志去瞅瞅源码吧。
(PS:中间走了很多弯路,没耐心的直接拉到最下边看吧)

/**
 * Removes the given slot from the slot manager.
 *
 * @param slotId identifying the slot to remove
 */
private void removeSlot(SlotID slotId) {
	TaskManagerSlot slot = slots.remove(slotId);

	if (null != slot) {
		freeSlots.remove(slotId);

		if (slot.getState() == TaskManagerSlot.State.PENDING) {
			// reject the pending slot request --> triggering a new allocation attempt
			rejectPendingSlotRequest(
				slot.getAssignedSlotRequest(),
				new Exception("The assigned slot " + slot.getSlotId() + " was removed."));
		}

		AllocationID oldAllocationId = slot.getAllocationId();

		if (oldAllocationId != null) {
			fulfilledSlotRequests.remove(oldAllocationId);

			resourceActions.notifyAllocationFailure(
				slot.getJobId(),
				oldAllocationId,
				new FlinkException("The assigned slot " + slot.getSlotId() + " was removed."));
		}
	} else {
		LOG.debug("There was no slot registered with slot id {}.", slotId);
	}
}

先搞明白solts和freesolts,fulfilledSlotRequests,是啥东东:

  private final HashMap<SlotID, TaskManagerSlot> slots;
	private final LinkedHashMap<SlotID, TaskManagerSlot> freeSlots;
	private final HashMap<AllocationID, SlotID> fulfilledSlotRequests;

solts就是所有的solt集合(不知道理解的对不对),freesolts就是空闲的solt集合。
再看看state是啥:

/**
	 * State of the {
   @link TaskManagerSlot}.
	 */
	public enum State {
   
		FREE,
		PENDING,
		ALLOCATED
	}

是个枚举,有三个成员,空闲的,待分配的,分配好的。
AllocationID又是个啥

/**
 * Unique identifier for a physical slot allocated by a JobManager via the ResourceManager
 * from a TaskManager. The ID is assigned once the JobManager (or its SlotPool) first
 * requests the slot and is constant across retries.
 *
 * <p>This ID is used by the TaskManager and ResourceManager to track and synchronize which
 * slots are allocated to which JobManager and which are free.
 *
 * <p>In contrast to this AllocationID, the {
   @link org.apache.flink.runtime.jobmaster.SlotRequestId}
 * is used when a task requests a logical slot from the SlotPool. Multiple logical slot requests
 * can map to one physical slot request (due to slot sharing).
 */
public class AllocationID extends AbstractID {
   

	private static final long serialVersionUID = 1L;

	/**
	 * Constructs a new random AllocationID.
	 */
	public AllocationID() {
   
		super();
	}

	/**
	 * Constructs a new AllocationID with the given parts.
	 *
	 * @param lowerPart the lower bytes of the ID
	 * @param upperPart the higher bytes of the ID
	 */
	public AllocationID(long lowerPart, long upperPart) {
   
		super(lowerPart, upperPart);
	}

	@Override
	public String toString() {
   
		return "AllocationID{" + super.toString() + '}';
	}
}

大致意思就是JobManager通过TaskManager通过ResourceManager分配的物理插槽的唯一标识符。一旦JobManager(或其SlotPool)首次请求该solt,就分配该ID,并且保持不变。TaskManager和ResourceManager使用此ID跟踪和同步哪些solt分配给哪些JobManager,哪些solt可用。
再来看两个方法:

/**
	 * Rejects the pending slot request by failing the request future with a
	 * {
   @link SlotAllocationException}.
	 *
	 * @param pendingSlotRequest to reject
	 * @param cause of the rejection
	 */
	private void rejectPendingSlotRequest(PendingSlotRequest pendingSlotRequest, Exception cause) {
   
		CompletableFuture<Acknowledge> request = pendingSlotRequest.getRequestFuture();

		if (null != request) {
   
			request.completeExceptionally(new SlotAllocationException(cause));
		} else {
   
			LOG.debug("Cannot reject pending slot request {}, since no request has been sent.", pendingSlotRequest.getAllocationId());
		}
	}

/**
	 * Notifies that an allocation failure has occurred.
	 *	通知发生了一个分配的错误
	 * @param jobId to which the allocation belonged
	 * @param allocationId identifying the failed allocation
	 * @param cause of the allocation failure
	 */
@Override
		public void notifyAllocationFailure(JobID jobId, AllocationID allocationId, Exception cause) {
   
			validateRunsInMainThread();

			JobManagerRegistration jobManagerRegistration = jobManagerRegistrations.get(jobId);
			if (jobManagerRegistration != null) {
   
				jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId, cause);
			}
		}
	}

这下捋清楚这个异常的逻辑了,也就是说传来一个soltid,如果对应solt存在,从空闲solt集合中移除掉,然后第一个逻辑判读如果该solt的状态为待分配,那么就抛这个异常来拒绝待分配状态solt的请求。第二个判断逻辑为如果该solt已被分配,也抛该异常。
接着看看看日志中报错的另外两个方法:

public void unregisterTaskManagersAndReleaseResources() {
   
		Iterator<Map.Entry<InstanceID, TaskManagerRegistration>> taskManagerRegistrationIterator =
				taskManagerRegistrations.entrySet().iterator();

		while (taskManagerRegistrationIterator.hasNext()) {
   
			TaskManagerRegistration taskManagerRegistration =
					taskManagerRegistrationIterator.next().getValue();

			taskManagerRegistrationIterator.remove();

			internalUnregisterTaskManager(taskManagerRegistration);

			resourceActions.releaseResource(taskManagerRegistration.getInstanceId(), new FlinkException("Triggering of SlotManager#unregisterTaskManagersAndReleaseResources."));
		}
	}



/**
	 * Unregisters the task manager identified by the given instance id and its associated slots
	 * from the slot manager.
	 *
	 * @param instanceId identifying the task manager to unregister
	 * @return True if there existed a registered task manager with the given instance id
	 */
	public boolean unregisterTaskManager(InstanceID instanceId) {
   
		checkInit();

		LOG.info("Unregister TaskManager {} from the SlotManager.", instanceId);

		TaskManagerRegistration taskManagerRegistration = taskManagerRegistrations.remove(instanceId);

		if (null != taskManagerRegistration) {
   
			internalUnregisterTaskManager(taskManagerRegistration);

			return true;
		} else {
   
			LOG.debug("There is no task manager registered with instance ID {}. Ignoring this message.", instanceId);

			return false;
		}
	}

这两个看不出来,接着看报错日志:

at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:825)
	at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1139)

看着看着怎么就全是关于心跳的代码了,不会是因为心跳超时导致的solt被移除吧。
源码看不出究竟来,继续扣一扣日志:

2020-08-14 08:31:15,949 INFO org.apache.flink.yarn.YarnResourceManager - Closing TaskExecutor connection container_1597019960092_0002_01_000612 because: Container [pid=27359,containerID=container_1597019960092_0002_01_000612] is running beyond physical memory limits. Current usage: 45.1 GB of 45 GB physical memory used; 72.3 GB of 94.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_1597019960092_0002_01_000612 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 27373 27359 27359 27359 (java) 23081 15315 77535395840 11820542 /usr/java/jdk1.8.0_141-cloudera/bin/java -Xms31104m -Xmx31104m -XX:MaxDirectMemorySize=14976m -Dlog.file=/sga/yarn/container-logs/application_1597019960092_0002/container_1597019960092_0002_01_000612/taskmanager.log -Dlogback.configurationFile=file:./logback.xml -Dlog4j.configuration=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner --configDir .
|- 27359 27357 27359 27359 (bash) 0 0 118087680 377 /bin/bash -c /usr/java/jdk1.8.0_141-cloudera/bin/java -Xms31104m -Xmx31104m -XX:MaxDirectMemorySize=14976m -Dlog.file=/sga/yarn/container-logs/application_1597019960092_0002/container_1597019960092_0002_01_000612/taskmanager.log -Dlogback.configurationFile=file:./logback.xml -Dlog4j.configuration=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> /sga/yarn/container-logs/application_1597019960092_0002/container_1597019960092_0002_01_000612/taskmanager.out 2> /sga/yarn/container-logs/application_1597019960092_0002/container_1597019960092_0002_01_000612/taskmanager.err

Container killed on request. Exit code is 143 Container exited with a
non-zero exit code 143

2020-08-14 08:31:15,950 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph -
Source: Custom Source -> Flat Map (2/6)
(c8e8a3a1fe26571fdcdeeaf4e2a85973) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The assigned slot
container_1597019960092_0002_01_000612_0 was removed. at
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
at
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
at
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
at
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
at
org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:825)
at
org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted 0 ( Y a r n R e s o u r c e M a n a g e r . j a v a : 350 ) a t o r g . a p a c h e . f l i n k . r u n t i m e . r p c . a k k a . A k k a R p c A c t o r . h a n d l e R u n A s y n c ( A k k a R p c A c t o r . j a v a : 332 ) a t o r g . a p a c h e . f l i n k . r u n t i m e . r p c . a k k a . A k

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值