flink on yarn 启动问题
有两个月没发文了,这次在沉寂了两个月后带来的是我们在日常启动flink时大都会遇到的一个问题,不过也困扰了我断断续续有两天时间,后来在拨云见日之后发现豁然开朗的本质,其实,我们的解决问题时,有时候还是只浮于表面,而没有看到深藏冰面以下的那一角,所以继续努力是少不了的,那么切开问题,我们来从现象入手吧!
1.flink启动现象
我们还在用flink on yarn,所以问题是在yarn上启动flink任务后,运行的vcores和containers还有taskManager数以及slot数和我们设置的大相径庭,在程序运行一会之后便又会回归正常,那么,究竟是什么原因产生的这种现象呢,下来我们先具体看下实际的案例。
![](https://i-blog.csdnimg.cn/blog_migrate/b0937ac3eba6560293463b2c1be111e7.png)
![](https://i-blog.csdnimg.cn/blog_migrate/57580ea397bb0c6f477edb3bfc50b441.png)
![](https://i-blog.csdnimg.cn/blog_migrate/7c2eff7bfc0dcebb31d965663076f53c.png)
![](https://i-blog.csdnimg.cn/blog_migrate/8686a686f2148f3be5084f2793ae41bb.png)
![](https://i-blog.csdnimg.cn/blog_migrate/8d66bda16d2407c6476a70b30b8520ab.png)
![](https://i-blog.csdnimg.cn/blog_migrate/dc46ca3a342e584543ea8d74fbcfc329.png)
![](https://i-blog.csdnimg.cn/blog_migrate/26cab64e29a4b4cba206c554f5d8c50d.png)
1.1 问题介绍
实际任务中,我们设置的container为2个,每个container上的slot为2个,共计4个slot。而在yarn队列上实际运行时,我们的子队列最大5个container,9个vcore(内存暂没发现问题,此处我们不做介绍),而我们子队列上有其他任务运行时和只有我们当前任务运行时两种情况下,我们都进行了测试,下来,我们分情况介绍一下。
1.1.1 有其他任务
我们的4个任务都会占用3container和接近5个vcore,共计12container,19vcore(这个估算我是觉得没看懂),具体的队列情况,我们要去yarn队列这块分析下:
- 总的资源为96vcores;
- 非root队列分配36vcores;
- 实时队列分配10vcores;
- 我们此时这个队列分配9个vcores;
- 说明实际占用的实时队列下的小队列资源已满。
此时,我们通过yarnUI可以看出,我们的任务占用资源不多不少,正好3个container,5个vcores,没有多啊,我们先放下疑虑,看看只有该任务运行时的情况。
1.1.2 只有一个任务
在只有当前任务运行时,那就很顺畅了,我们通过不停地F5可以发现,container、vcore从:
(1,1) -> (2,3) -> (3,5) -> (4,7) -> (5,9) -> (4,7) -> (3,5), 这个变化很线性,实际上最大我们的任务占用了5个container和9个vcore,其中除过jobManager
和Appmaster占用的1个container1个vcore之外,等于我们有4个container,8个vcore,正常来说,这里面只有2个container,4个vcore是我们需要用的,多余的2个container,4个Vcores后来又被释放了,这是现象,至于这个问题的本质是什么,可能我这会能想到的有几种可能:
- 我们的yarn队列上其他的资源被我们这个任务全部占用甚至超额占用;
- flink的jobManager在分配taskManager时,按照一定的机制来分配多余的资源。
具体是什么原因呢,我们要通过下面第二章的源码走读来找到原因,下来,先审视一下我们的启动程序。
1.2 flink启动程序
启动程序如下:
/streaming/flink-1.6.0/bin/flink run -s hdfs://hdfs/user/96490559/savepoint-fa82 -C
file:/mnt/streaming/live/jars/mysql-connector-java-5.1.40.jar -n -m yarn-cluster -c com.scheduler
-yn 2 -ys 2 -p 4 -ytm 1024 -yqu KSC_27CF -ynm test_qa --yarnship /streaming/live/jars
-d /streaming/live/streaming-dist-flink.jar
2.源码分析
首先,我们通过flink on yarn的客户端来看一下taskmanager及slot的设置:
2.1 设置相关参数
FlinkyarnSessionCli
private ClusterSpecification createClusterSpecification(Configuration configuration, CommandLine cmd) {
if (cmd.hasOption(container.getOpt())) { // number of containers is required option!
LOG.info("The argument {} is deprecated in will be ignored.", container.getOpt());
}
// TODO: The number of task manager should be deprecated soon
final int numberTaskManagers;
if (cmd.hasOption(container.getOpt())) {
//通过cmd命令行中设置的container数来设置taskmananger数
numberTaskManagers = Integer.valueOf(cmd.getOptionValue(container.getOpt()));
} else {
numberTaskManagers = 1;
}
// JobManager Memory
final int jobManagerMemoryMB = ConfigurationUtils.getJobManagerHeapMemory(configuration).getMebiBytes();
// Task Managers memory
final int taskManagerMemoryMB = ConfigurationUtils.getTaskManagerHeapMemory(configuration).getMebiBytes();
//按照我们每个taskmanager设置的vcore数设置slot数
int slotsPerTaskManager = configuration.getInteger(TaskManagerOptions.NUM_TASK_SLOTS);
return new ClusterSpecification.ClusterSpecificationBuilder()
.setMasterMemoryMB(jobManagerMemoryMB)
.setTaskManagerMemoryMB(taskManagerMemoryMB)
.setNumberTaskManagers(numberTaskManagers)
.setSlotsPerTaskManager(slotsPerTaskManager)
.createClusterSpecification();
}
其次,我们从yarn集群的这个描述类来看具体资源获取这块的一些内容,包含Vcores和memory,这里我们需要关注的点是Vcores以及container的获取及设置。
AbstractYarnClusterDescriptor
public ApplicationReport startAppMaster(
Configuration configuration,
String applicationName,
String yarnClusterEntrypoint,
JobGraph jobGraph,
YarnClient yarnClient,
YarnClientApplication yarnApplication,
ClusterSpecification clusterSpecification) throws Exception {
//获取可用的最大核数
final int numYarnMaxVcores;
try {
numYarnMaxVcores = yarnClient.getNodeReports(NodeState.RUNNING)
.stream()
.mapToInt(report -> report.getCapability().getVirtualCores())
.max()
.orElse(0);
} catch (Exception e) {
throw new YarnDeploymentException("Couldn't get cluster description, please check on the YarnConfiguration", e);
}
int configuredVcores = flinkConfiguration.getInteger(YarnConfigOptions.VCORES, clusterSpecification.getSlotsPerTaskManager());
// 配置的核数应小于可用核数
if (configuredVcores > numYarnMaxVcores) {
throw new IllegalConfigurationException(
String.format("The number of requested virtual cores per node %d" +
" exceeds the maximum number of virtual cores %d available in the Yarn Cluster." +
" Please note that the number of virtual cores is set to the number of task slots by default" +
" unless configured in the Flink config with '%s.'",
configuredVcores, numYarnMaxVcores, YarnConfigOptions.VCORES.key()));
}
}
ApplicationReport report = startAppMaster(
flinkConfiguration,
applicationName,
yarnClusterEntrypoint,
jobGraph,
yarnClient,
yarnApplication,
validClusterSpecification);
// Create application via yarnClient
final YarnClientApplication yarnApplication = yarnClient.createApplication();
final GetNewApplicationResponse appResponse = yarnApplication.getNewApplicationResponse();
Resource maxRes = appResponse.getMaximumResourceCapability();
final int yarnMinAllocationMB = yarnConfiguration.getInt(YarnConfiguration.RM_SCHEDULER_MINIMUM_ALLOCATION_MB, 0);
freeClusterMem = getCurrentFreeClusterResources(yarnClient);
final ClusterSpecification validClusterSpecification;
try {
validClusterSpecification = validateClusterResources(
clusterSpecification,
yarnMinAllocationMB,
maxRes,
freeClusterMem);
}
// 设置Flink on YARN的相关配置
appMasterEnv.put(YarnConfigKeys.ENV_TM_COUNT, String.valueOf(clusterSpecification.getNumberTaskManagers()));//设置taskmanager数
appMasterEnv.put(YarnConfigKeys.ENV_TM_MEMORY, String.valueOf(clusterSpecification.getTaskManagerMemoryMB()));
appMasterEnv.put(YarnConfigKeys.FLINK_JAR_PATH, remotePathJar.toString());
appMasterEnv.put(YarnConfigKeys.ENV_APP_ID, appId.toString());
appMasterEnv.put(YarnConfigKeys.ENV_CLIENT_HOME_DIR, homeDir.toString());
appMasterEnv.put(YarnConfigKeys.ENV_CLIENT_SHIP_FILES, envShipFileList.toString());
appMasterEnv.put(YarnConfigKeys.ENV_SLOTS, String.valueOf(clusterSpecification.getSlotsPerTaskManager()));//设置每个taskmanager上的slot数
appMasterEnv.put(YarnConfigKeys.ENV_DETACHED, String.valueOf(detached));
appMasterEnv.put(YarnConfigKeys.ENV_ZOOKEEPER_NAMESPACE, getZookeeperNamespace());
appMasterEnv.put(YarnConfigKeys.FLINK_YARN_FILES, yarnFilesDir.toUri().toString());
public ApplicationReport startAppMaster(
Configuration configuration,
String applicationName,
String yarnClusterEntrypoint,
JobGraph jobGraph,
YarnClient yarnClient,
YarnClientApplication yarnApplication,
ClusterSpecification clusterSpecification) throws Exception {
// ------------------ Initialize the file systems -------------------------
// initialize file system
// Copy the application master jar to the filesystem
// Create a local resource to point to the destination jar path
// hard coded check for the GoogleHDFS client because its not overriding the getScheme() method.
......
// Setup jar for ApplicationMaster
Path remotePathJar = setupSingleLocalResource(
"flink.jar",
fs,
appId,
flinkJarPath,
localResources,
homeDir,
"");
// 为TaskManager设置相关参数,是针对总的slot和每个TaskManager上的slot数来设置
configuration.setInteger(
TaskManagerOptions.NUM_TASK_SLOTS,
clusterSpecification.getSlotsPerTaskManager()
);
configuration.setString(
TaskManagerOptions.TASK_MANAGER_HEAP_MEMORY,
clusterSpecification.getTaskManagerMemoryMB() + "m");
......
}
TaskManagerOptions
@Documentation.CommonOption(position = Documentation.CommonOption.POSITION_PARALLELISM_SLOTS)
// 此处的slot数为我们operator中的并行度数,是每个TaskManager上可运行slot的数,默认值1,我们的Vcores设置为每个TaskManager为2,所以该值也为2
public static final ConfigOption<Integer> NUM_TASK_SLOTS =
key("taskmanager.numberOfTaskSlots")
.defaultValue(1)
.withDescription("The number of parallel operator or user function instances that a single TaskManager can" +
" run. If this value is larger than 1, a single TaskManager takes multiple instances of a function or" +
" operator. That way, the TaskManager can utilize multiple CPU cores, but at the same time, the" +
" available memory is divided between the different operator or function instances. This value" +
" is typically proportional to the number of physical CPU cores that the TaskManager's machine has" +
" (e.g., equal to the number of cores, or half the number of cores).");
//设置taskManager的堆内存大小,默认为1G
/**
* JVM heap size for the TaskManagers with memory size.
*/
@Documentation.CommonOption(position = Documentation.CommonOption.POSITION_MEMORY)
public static final ConfigOption<String> TASK_MANAGER_HEAP_MEMORY =
key("taskmanager.heap.size")
.defaultValue("1024m")
.withDescription("JVM heap size for the TaskManagers, which are the parallel workers of" +
" the system. On YARN setups, this value is automatically configured to the size of the TaskManager's" +
" YARN container, minus a certain tolerance value.");
在我们flink的job运行中,jobmanager负责执行单个flink作业图,它里面针对单个graph有相关的资源申请和释放的相关方法,下来我们具体看一下。
JobMaster
// 初始化4个taskManager,这里是???
this.registeredTaskManagers = new HashMap<>(4);
//申请taskManager资源
@Override
public CompletableFuture<Collection<SlotOffer>> offerSlots(
final ResourceID taskManagerId,
final Collection<SlotOffer> slots,
final Time timeout) {
Tuple2<TaskManagerLocation, TaskExecutorGateway> taskManager = registeredTaskManagers.get(taskManagerId);
if (taskManager == null) {
return FutureUtils.completedExceptionally(new Exception("Unknown TaskManager " + taskManagerId));
}
final TaskManagerLocation taskManagerLocation = taskManager.f0;
final TaskExecutorGateway taskExecutorGateway = taskManager.f1;
final RpcTaskManagerGateway rpcTaskManagerGateway = new RpcTaskManagerGateway(taskExecutorGateway, getFencingToken());
return slotPoolGateway.offerSlots(
taskManagerLocation,
rpcTaskManagerGateway,
slots);
}
@Override
public CompletableFuture<SerializedInputSplit> requestNextInputSplit(
final JobVertexID vertexID,
final ExecutionAttemptID executionAttempt) {
//获取执行图的执行单元
final Execution execution = executionGraph.getRegisteredExecutions().get(executionAttempt);
if (execution == null) {
......
}
final ExecutionJobVertex vertex = executionGraph.getJobVertex(vertexID);
if (vertex == null) {
......
}
final InputSplitAssigner splitAssigner = vertex.getSplitAssigner();
if (splitAssigner == null) {
......
}
//通过执行单元来获取slot和taskId
final LogicalSlot slot = execution.getAssignedResource();
final int taskId = execution.getVertex().getParallelSubtaskIndex();
final String host = slot != null ? slot.getTaskManagerLocation().getHostname() : null;
final InputSplit nextInputSplit = splitAssigner.getNextInputSplit(host, taskId);
try {
//将切分好的输入序列化成对象
final byte[] serializedInputSplit = InstantiationUtil.serializeObject(nextInputSplit);
return CompletableFuture.completedFuture(new SerializedInputSplit(serializedInputSplit));
} catch (Exception ex) {
......
return FutureUtils.completedExceptionally(reason);
}
}
FlinkresourceManager
// 此方法使资源框架主框架同步,重新检查可用的和挂起的worker容器集,并在需要时分配容器。
//这个方法不会自动释放worker,因为对于这个资源管理器,哪个worker可以被释放是不可见的。
//相反,JobManager必须显式地释放单个worker
private void checkWorkersPool() {
int numWorkersPending = getNumWorkerRequestsPending();
int numWorkersPendingRegistration = getNumWorkersPendingRegistration();
// sanity checks
Preconditions.checkState(numWorkersPending >= 0,
"Number of pending workers should never be below 0.");
Preconditions.checkState(numWorkersPendingRegistration >= 0,
"Number of pending workers pending registration should never be below 0.");
// see how many workers we want, and whether we have enough
int allAvailableAndPending = startedWorkers.size() +
numWorkersPending + numWorkersPendingRegistration;
//resourcemanager设置taskmanager数 -->> 池子大小 —(开始的+挂起的+已注册的)
int missing = designatedPoolSize - allAvailableAndPending;
if (missing > 0) {
requestNewWorkers(missing);
}
}
YarnFlinkresourceManager
@Override
protected void requestNewWorkers(int numWorkers) {
final long mem = taskManagerParameters.taskManagerTotalMemoryMB();
final int containerMemorySizeMB;
if (mem <= Integer.MAX_VALUE) {
containerMemorySizeMB = (int) mem;
} else {
containerMemorySizeMB = Integer.MAX_VALUE;
LOG.error("Decreasing container size from {} MB to {} MB (integer value overflow)",
mem, containerMemorySizeMB);
}
//遍历taskmanager,然后资源管理客户端 添加资源请求申请
for (int i = 0; i < numWorkers; i++) {
numPendingContainerRequests++;
LOG.info("Requesting new TaskManager container with {} megabytes memory. Pending requests: {}",
containerMemorySizeMB, numPendingContainerRequests);
// Priority for worker containers - priorities are intra-application
Priority priority = Priority.newInstance(0);
// Resource requirements for worker containers
int taskManagerSlots = taskManagerParameters.numSlots();
int vcores = config.getInteger(YarnConfigOptions.VCORES, Math.max(taskManagerSlots, 1));
Resource capability = Resource.newInstance(containerMemorySizeMB, vcores);
resourceManagerClient.addContainerRequest(
new AMRMClient.ContainerRequest(capability, null, null, priority));
}
// make sure we transmit the request fast and receive fast news of granted allocations
resourceManagerClient.setHeartbeatInterval(FAST_YARN_HEARTBEAT_INTERVAL_MS);
}
// ------------------------------------------------------------------------
//
// ------------------------------------------------------------------------
private void containersAllocated(List<Container> containers) {
//得到我们需要的和已注册的TaskManager
final int numRequired = getDesignatedWorkerPoolSize();
final int numRegistered = getNumberOfStartedTaskManagers();
for (Container container : containers) {
numPendingContainerRequests = Math.max(0, numPendingContainerRequests - 1);
// 判断是否返回container,是否启动TaskManager
if (numRegistered + containersInLaunch.size() < numRequired) {
// 启动TaskManager
final YarnContainerInLaunch containerInLaunch = new YarnContainerInLaunch(container);
final ResourceID resourceID = containerInLaunch.getResourceID();
containersInLaunch.put(resourceID, containerInLaunch);
......
try {
// 设置一个特殊的环境变量来唯一标识container
taskManagerLaunchContext.getEnvironment()
.put(ENV_FLINK_CONTAINER_ID, resourceID.getResourceIdString());
nodeManagerClient.startContainer(container, taskManagerLaunchContext);
}
catch (Throwable t) {
// failed to launch the container
containersInLaunch.remove(resourceID);
// return container, a new one will be requested eventually
......
containersBeingReturned.put(container.getId(), container);
resourceManagerClient.releaseAssignedContainer(container.getId());
}
} else {
// 返回过多的container并释放资源
containersBeingReturned.put(container.getId(), container);
resourceManagerClient.releaseAssignedContainer(container.getId());
}
}
updateProgress();
// 如果我们不等待其他容器,我们可以使用常规的心跳
if (numPendingContainerRequests <= 0) {
resourceManagerClient.setHeartbeatInterval(yarnHeartbeatIntervalMillis);
}
// 确保我们再次检查worker或容器的状态至少一次,以防一些容器不能正常工作
triggerCheckWorkers();
}
2.2 释放多余资源
JobMaster
//释放空余的资源
private CompletableFuture<Acknowledge> releaseEmptyTaskManager(ResourceID resourceId) {
return disconnectTaskManager(resourceId, new FlinkException(String.format("No more slots registered at JobMaster %s.", resourceId)));
}
@Override
public CompletableFuture<Acknowledge> disconnectTaskManager(final ResourceID resourceID, final Exception cause) {
log.debug("Disconnect TaskExecutor {} because: {}", resourceID, cause.getMessage());
taskManagerHeartbeatManager.unmonitorTarget(resourceID);
//1.slotPool先按照id释放多余的slot资源
CompletableFuture<Acknowledge> releaseFuture = slotPoolGateway.releaseTaskManager(resourceID, cause);
//2.已注册的TaskManager再去取消多余的taskManager
Tuple2<TaskManagerLocation, TaskExecutorGateway> taskManagerConnection = registeredTaskManagers.remove(resourceID);
if (taskManagerConnection != null) {
taskManagerConnection.f1.disconnectJobManager(jobGraph.getJobID(), cause);
}
return releaseFuture;
}
SlotPool
@Override
public CompletableFuture<Acknowledge> releaseTaskManager(final ResourceID resourceId, final Exception cause) {
if (registeredTaskManagers.remove(resourceId)) {
releaseTaskManagerInternal(resourceId, cause);
}
return CompletableFuture.completedFuture(Acknowledge.get());
}
private void releaseTaskManagerInternal(final ResourceID resourceId, final Exception cause) {
final Set<AllocatedSlot> removedSlots = new HashSet<>(allocatedSlots.removeSlotsForTaskManager(resourceId));
//具体如何去释放
//1.先触发指定负载的释放(如果有效载荷可以释放,然后从slot中取出)
for (AllocatedSlot allocatedSlot : removedSlots) {
allocatedSlot.releasePayload(cause);
}
//2.将需要移除的TaskManager上的slot添加到对应的类中
removedSlots.addAll(availableSlots.removeAllForTaskManager(resourceId));
//3.移除slot
for (AllocatedSlot removedSlot : removedSlots) {
TaskManagerGateway taskManagerGateway = removedSlot.getTaskManagerGateway();
taskManagerGateway.freeSlot(removedSlot.getAllocationId(), cause, rpcTimeout);
}
}
RpcTaskManagerGateway
@Override
public CompletableFuture<Acknowledge> freeSlot(AllocationID allocationId, Throwable cause, Time timeout) {
//移除slot
return taskExecutorGateway.freeSlot(
allocationId,
cause,
timeout);
}
TaskExecutor
@Override
public CompletableFuture<Acknowledge> freeSlot(AllocationID allocationId, Throwable cause, Time timeout) {
//取消空闲的taskManager
freeSlotInternal(allocationId, cause);
return CompletableFuture.completedFuture(Acknowledge.get());
}
private void freeSlotInternal(AllocationID allocationId, Throwable cause) {
checkNotNull(allocationId);
log.debug("Free slot with allocation id {} because: {}", allocationId, cause.getMessage());
try {
final JobID jobId = taskSlotTable.getOwningJob(allocationId);
final int slotIndex = taskSlotTable.freeSlot(allocationId, cause);
if (slotIndex != -1) {
if (isConnectedToResourceManager()) {
// the slot was freed. Tell the RM about it
ResourceManagerGateway resourceManagerGateway = establishedResourceManagerConnection.getResourceManagerGateway();
resourceManagerGateway.notifySlotAvailable(
establishedResourceManagerConnection.getTaskExecutorRegistrationId(),
new SlotID(getResourceID(), slotIndex),
allocationId);
}
if (jobId != null) {
// check whether we still have allocated slots for the same job
if (taskSlotTable.getAllocationIdsPerJob(jobId).isEmpty()) {
try {
//通过jobLeader服务移除job
jobLeaderService.removeJob(jobId);
} catch (Exception e) {
log.info("Could not remove job {} from JobLeaderService.", jobId, e);
}
//关闭与jobManager的连接
closeJobManagerConnection(
jobId,
new FlinkException("TaskExecutor " + getAddress() +
" has no more allocated slots for job " + jobId + '.'));
}
}
}
} catch (SlotNotFoundException e) {
log.debug("Could not free slot for allocation id {}.", allocationId, e);
}
localStateStoresManager.releaseLocalStateForAllocationId(allocationId);
}
启动时候会根据提交作业最后生成的ExecutionGraph的并发度(也就是作业可运行的最大并行度,即可用资源数)来向resourceManager申请slot,
然后resourceManager就会启动container <yn> slot<ys> memory <-ytm>,最后ExecutionGraph切分的(map1,map2,map3...)在各个slot部署启动,多余的没用到的container就会被关闭。