函数节流在react中写法_cpu在kubernetes中节流

最新推荐文章于 2024-08-23 20:39:14 发布

weixin_26749889

最新推荐文章于 2024-08-23 20:39:14 发布

阅读量272

点赞数

文章标签： react lambda

原文链接：https://lambda.grofers.com/cpu-throttling-in-kubernetes-a-postmortem-b9b433d24b03

版权

函数节流在react中写法

Kubernetes is a crucial part of our infrastructure. We don’t just deploy applications on Kubernetes in production, but we also use it heavily for our CI/CD and developer infrastructure. While developing our CI/CD infrastructure, we dealt with a particular performance issue of our dev and CI environments taking up a lot of time to spin up.

Kubernetes是我们基础架构的关键部分。我们不仅在生产环境中在Kubernetes上部署应用程序，而且还将其大量用于CI / CD和开发人员基础架构。在开发我们的CI / CD基础结构时，我们处理了开发人员和CI环境的特定性能问题，这需要花费大量时间来启动。

In this article, we will attempt to go deep into the issue that degraded performance of our applications and how we finally solved that issue.

在本文中，我们将尝试深入研究降低应用程序性能的问题以及如何最终解决该问题。

背景 (Background)

At Grofers, we follow a microservice architecture where all critical components like payments, carts, inventory, etc. are organised as a microservice. Therefore developers cannot work on multiple services at the same time in the same namespace. This led us to adopt a design where each developer has his own namespace with all services deployed in an isolated environment for testing and debugging. This is what we internally call ‘Grofers-in-a-namespace’. To achieve this, we have developed an in-house tool called mft. Mft is used to create new namespaces in Kubernetes and inject necessary dependencies from Vault and Consul.

在Grofers，我们遵循微服务架构，其中所有关键组件(如付款，购物车，库存等)都组织为微服务。因此，开发人员无法在同一名称空间中同时处理多个服务。这导致我们采用一种设计，其中每个开发人员都有自己的命名空间，所有服务都部署在隔离的环境中以进行测试和调试。这就是我们内部称为“ Grofers-in-a-namespace”的名称。为了实现这一目标，我们开发了一种称为mft的内部工具。 Mft用于在Kubernetes中创建新的名称空间，并从Vault和Consul注入必要的依赖项。

All the services necessary to run a copy of Grofers are then deployed via common Jenkins setup. The services are then exposed via ingress for developers to use with the test suite or use as the endpoint for debug apps for application testing.

然后，通过通用的Jenkins安装程序部署运行Grofers副本所需的所有服务。然后，这些服务将通过入口公开给开发人员，以供测试套件使用或用作调试应用程序测试的终结点。

问题 (The Issue)

We adopted USE-RED dashboards for our services to help us track critical metrics. This allowed us to optimise the infrastructure further for getting maximum performance out of it.

我们为服务采用了USE-RED仪表板，以帮助我们跟踪关键指标。这使我们能够进一步优化基础架构，以充分利用其性能。

When we started to analyse such metrics we observed that certain applications took a large amount of time to start and get ready, something that developers did not observe in the local setup. Also, some of our Jenkins jobs took a lot of time to create the environments and get them ready, something that we previously did not account for. This led us to revisit all metrics to identify what could have caused this slow startup in our Kubernetes infrastructure.

当我们开始分析此类指标时，我们发现某些应用程序需要花费大量时间来启动和准备，这是开发人员在本地设置中未观察到的。此外，我们在詹金斯(Jenkins)的一些工作花费了很多时间来创建环境并将其准备就绪，而这是我们以前没有考虑的。这导致我们重新审视所有指标，以确定可能导致Kubernetes基础架构启动缓慢的原因。

Furthermore, these issues were not being observed in production environments. The production deployments have slightly different manifests, with much more liberal CPU and memory allocation. Also, our production cluster has a lot more headroom for scaling which led us to believe that it is an infrastructure specific issue.

此外，在生产环境中未观察到这些问题。生产部署的清单稍有不同，CPU和内存分配更为自由。此外，我们的生产集群还有更多的扩展空间，这使我们认为这是基础架构特定的问题。

根本原因 (Root Cause)

After debugging for almost a month, we decided to revisit our testing and Kubernetes setup to help isolate the problem. During the optimisation process of our RAV (Regression And Verification) testing, we started plotting all Kubernetes metrics that could affect our containers performance. One interesting metric that we identified was CPU throttling (container_cpu_cfs_throttled_seconds_total). Once we plotted that metric, we found shocking and interesting results. Some of our most critical services were getting CPU throttled and we had no idea. Furthermore, we observed that in our CI and dev environments, this was happening a lot with some specific containers at startup time — these were containers where we were running some kind of CPU intensive operation at startup time.

经过近一个月的调试，我们决定重新访问我们的测试和Kubernetes设置，以帮助隔离问题。在我们的RAV(R egression 钕 V erification)测试的优化过程中，我们开始绘制所有Kubernetes指标可能会影响我们的容器的性能。我们确定的一个有趣的指标是CPU节流( container_cpu_cfs_throttled_seconds_total )。绘制该度量标准后，我们发现了令人震惊且有趣的结果。我们一些最关键的服务正在节流CPU，我们一无所知。此外，我们观察到在CI和dev环境中，在启动时使用某些特定的容器会发生很多情况，这些容器是我们在启动时运行某种CPU密集型操作的容器。

We immediately started planning a cause and effect analysis and came up with the following causes:

我们立即开始计划因果分析，并提出以下原因：

Incorrect CPU limit of containers that causes the application to hit the limits quickly causing Kubernetes to throttle it.
不正确的容器CPU限制导致应用程序Swift达到限制，导致Kubernetes对其进行限制。
Background activity like GC that triggers after some time causing CPU cycles to increase. This can also be caused by incorrect heap sizes for JVM based applications.
后台活动(如GC)在一段时间后触发，导致CPU周期增加。这也可能是由于基于JVM的应用程序的堆大小不正确引起的。
Some periodic CPU intensive activity on the node that steals CPU cycles available to cgroups, also hinting towards CPU limits that were not decided to keep periodic spikes in the application logic
节点上的某些周期性CPU密集型活动会窃取cgroup可用的CPU周期，这也暗示了尚未决定保持应用程序逻辑中的周期性峰值的CPU限制

什么是CPU节流 (What is CPU Throttling)

Almost all container orchestrators rely on the kernel control group (cgroup) mechanisms to manage resource constraints. When hard CPU limits are set in a container orchestrator, the kernel uses Completely Fair Scheduler (CFS) Cgroup bandwidth control to enforce those limits. The CFS-Cgroup bandwidth control mechanism manages CPU allocation using two settings: quota and period. When an application has used its allotted CPU quota for a given period, it gets throttled until the next period.

几乎所有的容器编排器都依赖于内核控制组(cgroup)机制来管理资源约束。在容器协调器中设置硬CPU限制后，内核将使用完全公平调度程序(CFS) Cgroup带宽控制来实施这些限制。 CFS-Cgroup带宽控制机制使用两个设置来管理CPU分配：配额和周期。当应用程序在给定时间段内使用其分配的CPU配额时，它将受到限制直到下一个时间段。

All CPU metrics for a cgroup are located in /sys/fs/cgroup/cpu,cpuacct/<container>. Quota and period settings are in cpu.cfs_quota and cpu.cfs_period .

cgroup的所有CPU指标位于/sys/fs/cgroup/cpu,cpuacct/<container> 。配额和期间设置位于cpu.cfs_quota和cpu.cfs_period 。

You can also view throttling metrics in cpu.stat. Inside cpu.stat you’ll find:

您还可以在cpu.stat中查看限制指标。在cpu.stat中，您会发现：

nr_periods — number of periods that any thread in the cgroup was runnable
nr_periods — cgroup中任何线程可运行的周期数
nr_throttled — number of runnable periods in which the application used its entire quota and was throttled
nr_throttled —应用程序使用其全部配额并受到限制的可运行周期数
throttled_time — sum total amount of time individual threads within the cgroup were throttled.
acceleratord_time — cgroup中各个线程被限制的总时间。

监视内存限制以防止OOM终止 (Monitoring Memory Limits for OOM Kills)

Another interesting metric to consider is the number of container restarts due to OOM. This highlights the containers are that frequently hitting the memory limits specified in their Kubernetes manifests.

要考虑的另一个有趣指标是由于OOM而导致的容器重新启动的次数。这突出显示了那些经常达到其Kubernetes清单中指定的内存限制的容器。

kube_pod_container_status_terminated_reason{reason=”OOMKilled”})

解析度 (Resolution)

So a quick solution to the problem was increasing the limits between 10–25% to ensure that the peaks are hit less often, or avoided altogether.

因此，解决该问题的快速方法是将限制增加10-25％之间，以确保减少或完全避免峰的出现。

After identification of the root cause, we came up with some possible fixes. We took into account the following considerations:

在确定根本原因之后，我们提出了一些可能的解决方案。我们考虑了以下注意事项：

CPU throttling is primarily because of low CPU limits. Its limits that actually affect the Cgroup behaviour. So a quick solution to the problem was increasing the limits between 10–25% to ensure that the peaks are hit less often, or avoided altogether. This also does not affect the resource requirements for starting pods as requests remain untouched.

CPU节流主要是因为CPU限制较低。它的限制实际上影响了Cgroup的行为。因此，解决此问题的快速方法是将限制提高10-25％之间，以确保减少或完全避免峰的出现。这也不会影响启动容器的资源要求，因为请求保持不变。

Meanwhile, for certain intensive applications, especially those utilising JVM based systems, we decided to profile the application again for correct CPU and memory requirements as JVM is notorious for high resource consumption. Tweaking JVM parameters will be the right fix in the long term for such applications.

同时，对于某些密集型应用程序，尤其是那些利用基于JVM的系统的应用程序，由于JVM众所周知会占用大量资源，因此我们决定再次对应用程序进行概要分析，以达到正确的CPU和内存要求。从长远来看，对于此类应用程序，调整JVM参数将是正确的解决方案。

我们学到的知识和下一步 (What We Learned and Next Steps)

It was an insightful experience for us. We realised that some of the lesser looked (or in this case overlooked) metrics can have a deep impact on application performance. We also had some amazing insights into the CFS and cgroups, and how the kernel handles resource virtualisation.

对我们来说，这是一次有见地的经历。我们意识到，一些看起来不太好(或在这种情况下被忽略)的指标可能会对应用程序性能产生深远影响。我们还对CFS和cgroup以及内核如何处理资源虚拟化有一些惊人的见解。

Based on this exercise, we came up with an application profiling plan for our major applications and added CPU throttling to one of our core suspects of poor application performance.

基于此练习，我们提出了针对主要应用程序的应用程序性能分析计划，并将CPU限制添加到了我们对应用程序性能不佳的核心怀疑之一中。