Flink任务挂掉的记录
异常描述:
org.apache.flink.util.FlinkException: ==The assigned slot container_e01_1589821551483_0145_01_000012_2 was removed==.
再瞅瞅jobmaster的日志:
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:825)
at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1139)
这个异常描述的倒是很清楚,就是某个solt被移除掉了,初步猜想是solt的资源不足以运行多个task任务,根据日志去瞅瞅源码吧。
(PS:中间走了很多弯路,没耐心的直接拉到最下边看吧)
/**
* Removes the given slot from the slot manager.
*
* @param slotId identifying the slot to remove
*/
private void removeSlot(SlotID slotId) {
TaskManagerSlot slot = slots.remove(slotId);
if (null != slot) {
freeSlots.remove(slotId);
if (slot.getState() == TaskManagerSlot.State.PENDING) {
// reject the pending slot request --> triggering a new allocation attempt
rejectPendingSlotRequest(
slot.getAssignedSlotRequest(),
new Exception("The assigned slot " + slot.getSlotId() + " was removed."));
}
AllocationID oldAllocationId = slot.getAllocationId();
if (oldAllocationId != null) {
fulfilledSlotRequests.remove(oldAllocationId);
resourceActions.notifyAllocationFailure(
slot.getJobId(),
oldAllocationId,
new FlinkException("The assigned slot " + slot.getSlotId() + " was removed."));
}
} else {
LOG.debug("There was no slot registered with slot id {}.", slotId);
}
}
先搞明白solts和freesolts,fulfilledSlotRequests,是啥东东:
private final HashMap<SlotID, TaskManagerSlot> slots;
private final LinkedHashMap<SlotID, TaskManagerSlot> freeSlots;
private final HashMap<AllocationID, SlotID> fulfilledSlotRequests;
solts就是所有的solt集合(不知道理解的对不对),freesolts就是空闲的solt集合。
再看看state是啥:
/**
* State of the {
@link TaskManagerSlot}.
*/
public enum State {
FREE,
PENDING,
ALLOCATED
}
是个枚举,有三个成员,空闲的,待分配的,分配好的。
AllocationID又是个啥
/**
* Unique identifier for a physical slot allocated by a JobManager via the ResourceManager
* from a TaskManager. The ID is assigned once the JobManager (or its SlotPool) first
* requests the slot and is constant across retries.
*
* <p>This ID is used by the TaskManager and ResourceManager to track and synchronize which
* slots are allocated to which JobManager and which are free.
*
* <p>In contrast to this AllocationID, the {
@link org.apache.flink.runtime.jobmaster.SlotRequestId}
* is used when a task requests a logical slot from the SlotPool. Multiple logical slot requests
* can map to one physical slot request (due to slot sharing).
*/
public class AllocationID extends AbstractID {
private static final long serialVersionUID = 1L;
/**
* Constructs a new random AllocationID.
*/
public AllocationID() {
super();
}
/**
* Constructs a new AllocationID with the given parts.
*
* @param lowerPart the lower bytes of the ID
* @param upperPart the higher bytes of the ID
*/
public AllocationID(long lowerPart, long upperPart) {
super(lowerPart, upperPart);
}
@Override
public String toString() {
return "AllocationID{" + super.toString() + '}';
}
}
大致意思就是JobManager通过TaskManager通过ResourceManager分配的物理插槽的唯一标识符。一旦JobManager(或其SlotPool)首次请求该solt,就分配该ID,并且保持不变。TaskManager和ResourceManager使用此ID跟踪和同步哪些solt分配给哪些JobManager,哪些solt可用。
再来看两个方法:
/**
* Rejects the pending slot request by failing the request future with a
* {
@link SlotAllocationException}.
*
* @param pendingSlotRequest to reject
* @param cause of the rejection
*/
private void rejectPendingSlotRequest(PendingSlotRequest pendingSlotRequest, Exception cause) {
CompletableFuture<Acknowledge> request = pendingSlotRequest.getRequestFuture();
if (null != request) {
request.completeExceptionally(new SlotAllocationException(cause));
} else {
LOG.debug("Cannot reject pending slot request {}, since no request has been sent.", pendingSlotRequest.getAllocationId());
}
}
/**
* Notifies that an allocation failure has occurred.
* 通知发生了一个分配的错误
* @param jobId to which the allocation belonged
* @param allocationId identifying the failed allocation
* @param cause of the allocation failure
*/
@Override
public void notifyAllocationFailure(JobID jobId, AllocationID allocationId, Exception cause) {
validateRunsInMainThread();
JobManagerRegistration jobManagerRegistration = jobManagerRegistrations.get(jobId);
if (jobManagerRegistration != null) {
jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId, cause);
}
}
}
这下捋清楚这个异常的逻辑了,也就是说传来一个soltid,如果对应solt存在,从空闲solt集合中移除掉,然后第一个逻辑判读如果该solt的状态为待分配,那么就抛这个异常来拒绝待分配状态solt的请求。第二个判断逻辑为如果该solt已被分配,也抛该异常。
接着看看看日志中报错的另外两个方法:
public void unregisterTaskManagersAndReleaseResources() {
Iterator<Map.Entry<InstanceID, TaskManagerRegistration>> taskManagerRegistrationIterator =
taskManagerRegistrations.entrySet().iterator();
while (taskManagerRegistrationIterator.hasNext()) {
TaskManagerRegistration taskManagerRegistration =
taskManagerRegistrationIterator.next().getValue();
taskManagerRegistrationIterator.remove();
internalUnregisterTaskManager(taskManagerRegistration);
resourceActions.releaseResource(taskManagerRegistration.getInstanceId(), new FlinkException("Triggering of SlotManager#unregisterTaskManagersAndReleaseResources."));
}
}
/**
* Unregisters the task manager identified by the given instance id and its associated slots
* from the slot manager.
*
* @param instanceId identifying the task manager to unregister
* @return True if there existed a registered task manager with the given instance id
*/
public boolean unregisterTaskManager(InstanceID instanceId) {
checkInit();
LOG.info("Unregister TaskManager {} from the SlotManager.", instanceId);
TaskManagerRegistration taskManagerRegistration = taskManagerRegistrations.remove(instanceId);
if (null != taskManagerRegistration) {
internalUnregisterTaskManager(taskManagerRegistration);
return true;
} else {
LOG.debug("There is no task manager registered with instance ID {}. Ignoring this message.", instanceId);
return false;
}
}
这两个看不出来,接着看报错日志:
at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:825)
at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1139)
看着看着怎么就全是关于心跳的代码了,不会是因为心跳超时导致的solt被移除吧。
源码看不出究竟来,继续扣一扣日志:
2020-08-14 08:31:15,949 INFO org.apache.flink.yarn.YarnResourceManager - Closing TaskExecutor connection container_1597019960092_0002_01_000612 because: Container [pid=27359,containerID=container_1597019960092_0002_01_000612] is running beyond physical memory limits. Current usage: 45.1 GB of 45 GB physical memory used; 72.3 GB of 94.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_1597019960092_0002_01_000612 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 27373 27359 27359 27359 (java) 23081 15315 77535395840 11820542 /usr/java/jdk1.8.0_141-cloudera/bin/java -Xms31104m -Xmx31104m -XX:MaxDirectMemorySize=14976m -Dlog.file=/sga/yarn/container-logs/application_1597019960092_0002/container_1597019960092_0002_01_000612/taskmanager.log -Dlogback.configurationFile=file:./logback.xml -Dlog4j.configuration=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner --configDir .
|- 27359 27357 27359 27359 (bash) 0 0 118087680 377 /bin/bash -c /usr/java/jdk1.8.0_141-cloudera/bin/java -Xms31104m -Xmx31104m -XX:MaxDirectMemorySize=14976m -Dlog.file=/sga/yarn/container-logs/application_1597019960092_0002/container_1597019960092_0002_01_000612/taskmanager.log -Dlogback.configurationFile=file:./logback.xml -Dlog4j.configuration=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> /sga/yarn/container-logs/application_1597019960092_0002/container_1597019960092_0002_01_000612/taskmanager.out 2> /sga/yarn/container-logs/application_1597019960092_0002/container_1597019960092_0002_01_000612/taskmanager.errContainer killed on request. Exit code is 143 Container exited with a
non-zero exit code 1432020-08-14 08:31:15,950 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph -
Source: Custom Source -> Flat Map (2/6)
(c8e8a3a1fe26571fdcdeeaf4e2a85973) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The assigned slot
container_1597019960092_0002_01_000612_0 was removed. at
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:893)
at
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:863)
at
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:1058)
at
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:385)
at
org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:825)
at
org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted 0 ( Y a r n R e s o u r c e M a n a g e r . j a v a : 350 ) a t o r g . a p a c h e . f l i n k . r u n t i m e . r p c . a k k a . A k k a R p c A c t o r . h a n d l e R u n A s y n c ( A k k a R p c A c t o r . j a v a : 332 ) a t o r g . a p a c h e . f l i n k . r u n t i m e . r p c . a k k a . A k