扩展Kubernetes作业以进行Unity仿真-CSDN博客

Unity Simulation enables product developers, researchers, and engineers to smoothly and efficiently run thousands of instances of parameterized Unity builds in batch in the cloud. Unity Simulation allows you to parameterize a Unity project in ways that will change from run to run. You can also specify simulation output data necessary for your end application, whether that be the generation of training data for machine learning, the testing and validation of AI algorithms, or the evaluation and optimization of modeled systems. With Unity Simulation, there is no need to install and manage batch computing software or server clusters that you use to run your jobs, allowing you to focus on analyzing results and solving problems. This blog post showcases how our engineers are continually innovating to ensure that our customers’ jobs run as fast and as cost-effectively as possible on Unity Simulation.

Unity Simulation使产品开发人员，研究人员和工程师能够在云中批量高效地运行数千个参数化Unity构建实例。 Unity Simulation允许您以不同的运行方式对Unity项目进行参数化。您还可以指定最终应用程序所需的模拟输出数据，无论是用于机器学习的训练数据生成，AI算法的测试和验证，还是建模系统的评估和优化。借助Unity Simulation，无需安装和管理用于运行作业的批处理计算软件或服务器集群，从而使您可以专注于分析结果和解决问题。这篇博客文章展示了我们的工程师如何不断创新，以确保客户的工作在Unity Simulation上尽可能快且经济高效地运行。

Unity Simulation和Kubernetes (Unity Simulation and Kubernetes)

Unity Simulation leverages Kubernetes, an open source system, to containerize, schedule, and execute simulation jobs across the right number and type of compute instances. Kubernetes allows for easy download of simulation output data to a cloud storage location to connect to design, training, and testing workflows. By leveraging Kubernetes, you can run multiple simulations at a time without having to worry about compute resource allocation or capacity planning.

Unity Simulation利用开源系统Kubernetes，在正确数量和类型的计算实例上容器化，安排和执行模拟作业。 Kubernetes允许将模拟输出数据轻松下载到云存储位置，以连接到设计，培训和测试工作流。通过利用Kubernetes，您可以一次运行多个仿真，而不必担心计算资源分配或容量规划。

The following concepts are fundamental to understanding Unity Simulation and Kubernetes:

以下概念是了解Unity Simulation和Kubernetes的基础：

Run Definition: This specifies the name and description of the simulation, a set of application parameters for the simulation execution, system parameters specifying the compute resources to use, and the Unity build id, which references an uploaded Unity executable.
运行定义 ：它指定仿真的名称和说明，用于仿真执行的一组应用程序参数，指定要使用的计算资源的系统参数以及引用上载的Unity可执行文件的Unity构建ID。
Kubernetes Job: When you deploy Kubernetes, you get a cluster. A Kubernetes cluster consists of a set of worker machines, called Nodes, that run containerized applications. The worker Node(s) host the Pods that are the components of the application workload. A Pod is the basic execution unit for a Kubernetes system and represents the processes being run on your cluster. A Kubernetes Job is managed by the system level Job Controller that supervises Pods participating in a batch process that runs for a certain amount of time and then completes. Where the term “Job” appears in this blog, it always refers to a Kubernetes Job.
Kubernetes职位 ：部署Kubernetes时，会得到一个集群。 Kubernetes集群由一组运行节点的应用程序机器(称为节点)组成。工作节点托管Pod，这些Pod是应用程序工作负载的组成部分。 Pod是Kubernetes系统的基本执行单元，代表集群上正在运行的进程。 Kubernetes作业由系统级别的作业控制器管理，该作业控制器监督Pod参与批处理，该批处理运行一定时间然后完成。该博客中出现“工作”一词时，始终指Kubernetes工作。
Kubernetes Controller and Operator: A Kubernetes Controller is responsible for incrementally moving the current state of a resource toward the desired state. The Kubernetes Job Controller creates one or more Pods and ensures a specified number of them complete successfully. A Kubernetes Operator is a controller that follows this pattern but is extended to embody specific operational knowledge required to run a workload. Our Simulation Job Operator has an understanding of the Kubernetes Autoscaler and how its behavior can affect the current state versus the desired state, as we discuss in this article.
Kubernetes控制器和操作员 ：Kubernetes控制器负责将资源的当前状态逐步移至所需状态。 Kubernetes作业控制器会创建一个或多个Pod，并确保成功完成指定数量的Pod。 Kubernetes Operator是遵循这种模式的控制器，但已扩展为体现运行工作负载所需的特定操作知识。我们的模拟作业操作员了解Kubernetes Autoscaler及其行为如何影响当前状态和期望状态，正如我们在本文中讨论的那样。

A Kubernetes Job creates one or more Pods and ensures that the correct number of Pods successfully complete. A work queue is typically used to distribute tasks to the Pods assigned to the Job. The application process running in the container can pick tasks from the queue in parallel or separately as needed. The Job’s parallelism parameter is used to determine the number of parallel Pods that the Job runs simultaneously (or in other words, the number of concurrent simulation instances for the run execution). The Job’s completions parameter determines the number of Pods that must successfully finish.

Kubernetes作业会创建一个或多个Pod，并确保成功完成正确数量的Pod。工作队列通常用于将任务分配到分配给作业的Pod。容器中运行的应用程序进程可以根据需要并行或分别从队列中选择任务。 Job的并行度参数用于确定Job同时运行的并行Pod的数量(或换句话说，用于运行执行的并发模拟实例的数量)。 Job的完成次数参数确定必须成功完成的Pod数。

The Unity Simulation scheduler orchestrates run executions using the Kubernetes Job and the queue design pattern. The scheduler enqueues a message for each simulation instance to an independent run execution queue before submitting the Job to the Kubernetes cluster.

Unity Simulation调度程序使用Kubernetes作业和队列设计模式来协调运行执行。在将Job提交给Kubernetes集群之前，调度程序将每个模拟实例的消息排队到独立的运行执行队列中。

The following diagram shows how queues are used to distribute the messages to the Pods for executing Jobs. This diagram shows a Job with a parallelism of four. In other words, there are four instances of a Unity project running in the simulation.

下图显示了如何使用队列将消息分发到Pod以执行作业。此图显示了并行度为4的Job。换句话说，模拟中有四个运行的Unity项目实例。

问题 (The problem)

Our application of Kubernetes differs from most in that we use a combination of batch processing and autoscaling. We discovered that batch processing Jobs, combined with the Kubernetes Autoscaler behavior, leads to an unexpected interaction that results in a significant waste of compute resources and Job inefficiencies. The Kubernetes Autoscaler alternates between scaling up and scaling down the cluster, and the Job Controller reports incorrect state. This leads to overblown estimates of Job time, inaccurate reports after Job completion, and overall CPU inefficiency.

我们的Kubernetes应用程序与大多数应用程序不同，因为我们结合使用了批处理和自动缩放功能。我们发现批处理Jobs与Kubernetes Autoscaler行为相结合会导致意外交互，从而导致计算资源的大量浪费和Job低效。 Kubernetes Autoscaler在放大和缩小群集之间交替，并且作业控制器报告错误状态。这会导致对作业时间的估计过分，作业完成后的报告不准确以及总体CPU效率低下。

During the Job lifecycle, the completion count should either stay the same or increase, but our metrics showed Jobs whose count decreased. The incorrect completion counts caused the Job Controller to create more Pods to satisfy the completion count requirement, which determines the number of Pods that must successfully finish. The Job requested more Pods, causing the Kubernetes Autoscaler to add nodes to the cluster. The newly created Pods completed immediately after they were created because there were no remaining tasks in the queue. The added nodes quickly became idle after completing the Pods because the completion count was reached. This caused the Autoscaler to remove the idle nodes from the cluster, causing the Pod completion count to decrease.

在Job生命周期中，完成计数应该保持不变或增加，但是我们的指标显示Job的计数减少了。错误的完成计数导致作业控制器创建更多的Pod以满足完成计数要求，后者确定了必须成功完成的Pod数量。作业请求了更多Pod，从而导致Kubernetes Autoscaler将节点添加到集群中。新创建的Pod在创建后立即完成，因为队列中没有剩余的任务。在完成Pods之后，由于达到了完成计数，因此添加的节点很快变得空闲。这导致Autoscaler从群集中删除空闲节点，从而导致Pod完成计数减少。

This behavior causes the following vicious cycle:

此行为导致以下恶性循环：

The scaling up left our cluster unavailable to execute other work because the problematic Job was utilizing all of the cluster’s resources. At best, this wastes resources and, at worst, makes the service unavailable.

向上扩展使集群无法执行其他工作，因为有问题的Job正在利用集群的所有资源。充其量，这会浪费资源，而最坏的情况是使服务不可用。

The problem is described in detail in the following steps.

以下步骤将详细描述该问题。

Step 1:

第1步：

Google Cloud Platform we use Google Cloud Platform上运行的产品，我们使用托管的Kubernetes解决方案 GKE, a managed Kubernetes solution. Let us assume the GKE Cluster has one running node, Node1, capable of hosting five pods. The new Job requires 15 pods and causes the GKE cluster to add two nodes, Node2 and Node3, to increase capacity to run the 15 pods. GKE 。让我们假设GKE群集具有一个正在运行的节点Node1，该节点能够承载五个Pod。新作业需要15个Pod，并使GKE群集添加两个节点Node2和Node3，以增加运行15个Pod的能力。

Step 2:

第2步：

All Pods are ‘active’ on the GKE cluster.

在GKE集群上，所有Pod都处于“活动状态”。

Step 3:

第三步：

Five pods on Node1 go to the ‘complete’ state (green) for the Job.

Node1上的五个Pod进入作业的“完成”状态(绿色)。

Step 4:

第4步：

The Pods in Node1 are completed, so it becomes idle and is scaled down. When Node1 is removed from the cluster, its completion count of five is lost, so the completion count for the Job drops to zero when it should be five. This triggers the error condition because only ten pods are accounted for while the Job expects 15 pods to exist. The Job Controller requests five new pods, which causes the Autoscaler to add a node to the cluster again.

Node1中的Pod已完成，因此变为空闲状态并按比例缩小。从群集中删除Node1时，其完成计数为5丢失，因此Job的完成计数应为5时降至0。这会触发错误情况，因为仅占10个容器，而Job希望存在15个容器。作业控制器请求五个新的Pod，这将使Autoscaler将节点再次添加到群集。

We need to dive deep to understand this problem better and carefully study the Job Controller source code in Kubernetes. The SyncJob function synchronizes the state of the Job based on the current state of the pods it manages. SyncJob calls getStatus to get the number of successful and failed Pods for the Job. The Pods are retrieved by querying for the Pods currently existing in the cluster using a selector in the getPodsForJob function.

我们需要深入研究以更好地理解此问题，并仔细研究Kubernetes中的Job Controller源代码。 SyncJob函数根据其管理的窗格的当前状态来同步作业的状态。 SyncJob调用getStatus以获取作业的成功和失败Pod数。通过使用getPodsForJob函数中的选择器查询集群中当前存在的Pod来检索Pod 。

Unfortunately, when a node is removed from the Kubernetes cluster, this deletes the metadata for the Pods that ran on that node. When the Job Controller queries Kubernetes for a Job’s Pods, after the Autoscaler has taken a node down, the Job Controller receives incorrect completion counts. We easily reproduced this behavior by creating a simple Job that executes a single long-running sleep command and many shorter sleep commands in separate tasks.

不幸的是，当从Kubernetes集群中删除节点时，这将删除在该节点上运行的Pod的元数据。当作业控制器向Kubernetes查询作业的Pod时，在Autoscaler关闭节点后，作业控制器会收到不正确的完成计数。我们通过创建一个简单的Job来轻松重现此行为，该Job在单独的任务中执行单个长时间运行的sleep命令和许多较短的sleep命令。

解决方案 (The solution)

After getting more familiar with the Job Controller source code, we realized we could fix the scaling problem by persisting the status of the Pods. This ensures that the Pod metadata is captured even when it is not available in the Kubernetes cluster. We found that developing an Operator for executing simulations is beneficial for other reasons too.

熟悉Job Controller源代码后，我们意识到可以通过保持Pod的状态来解决缩放问题。这样可以确保即使在Kubernetes集群中Pod元数据不可用时也可以捕获它。我们发现，出于其他原因，开发用于执行仿真的运算符也是有益的。

The custom resource definition and Operator we implemented is very similar to the current Kubernetes Job Controller with a fix for the autoscaling issue. Our Simulation Job (SimJob) Operator updates a list of the unique successful and failed Pods each time the SimJob Operator’s control loop runs. The current state of the Pods determines the current state of the SimJob in the Kubernetes cluster and the data store that contains the unique set of successful and failed Pods.

我们实现的自定义资源定义和运算符与当前的Kubernetes作业控制器非常相似，并修复了自动缩放问题。每当SimJob操作员的控制循环运行时，我们的Simulation Job(SimJob)操作员都会更新唯一成功和失败Pod的列表。 Pod的当前状态确定Kubernetes集群和包含唯一一组成功和失败Pod的数据存储中SimJob的当前状态。

The following diagram shows how our SimJob Operator maintains the correct ‘completion’ count even when the cluster is scaled down:

下图显示了即使集群规模缩小，我们的SimJob Operator仍如何保持正确的“完成”计数：

We have run the SimJob Operator in production for the past two months. It has successfully executed over 1,000 simulations with nearly 50,000 total execution instances (or Pods in other words: there can be one or more simulation instances per Pod). We can safely autoscale our simulation run executions without risking the availability of the cluster and the Unity Simulation service in turn. We are very happy with this trend and are excited to continue to improve and add new features to the SimJob Operator.

在过去的两个月中，我们已经在生产中运行SimJob Operator。它已经成功执行了1,000多个模拟，总共有近50,000个执行实例(或Pod：每个Pod可以有一个或多个模拟实例)。我们可以安全地自动扩展我们的模拟运行执行的规模，而又不会危及集群和Unity Simulation服务的可用性。我们对这一趋势感到非常满意，并很高兴继续改进并向SimJob Operator添加新功能。

结论 (Conclusion)

Unity Simulation is at the forefront of data-driven artificial intelligence, whether that be the generation of training data for machine learning, the testing and validation of AI algorithms, or the evaluation and optimization of modeled systems. Our teams continue to innovate daily to provide the best-managed simulation service ecosystem.

Unity Simulation处于数据驱动型人工智能的最前沿，无论是用于机器学习的训练数据生成，AI算法的测试和验证，还是建模系统的评估和优化。我们的团队每天都在不断创新，以提供最佳管理的模拟服务生态系统。

If you’d like to join us to work on exciting Unity Simulation and AI challenges, we are hiring for several positions; please apply!

如果您想加入我们，共同应对激动人心的Unity Simulation和AI挑战，我们正在招聘多个职位；请申请！

Learn more about Unity Simulation.

了解有关Unity Simulation的更多信息。