ultimate grid_Ultimate Kubernetes资源规划指南

ultimate grid

Understanding allocatable CPU/memory on Kubernetes nodes and optimizing resource usage.

了解Kubernetes节点上可分配的CPU /内存并优化资源使用。

Capacity planning for Kubernetes is a critical step to running production workloads on clusters optimized for performance and cost. Given too little resources, Kubernetes may start to throttle CPU or kill pods with out-of-memory (OOM) error. On the other hand, if the pods demand too much, Kubernetes will struggle to allocate new workloads and waste idle resources.

Kubernetes的容量规划是在针对性能和成本进行了优化的群集上运行生产工作负载的关键步骤。 如果资源太少,Kubernetes可能会开始节制CPU或杀死内存不足(OOM)错误的Pod。 另一方面,如果吊舱需求过多,Kubernetes将很难分配新的工作负载并浪费空闲资源。

Unfortunately, capacity planning for Kubernetes is not simple. Allocatable resources depend on underlying node type as well as reserved system and Kubernetes components (e.g. OS, kubelet, monitoring agents). Also, the pods require some fine-tuning of resource requests and limits for optimal performance. In this guide, we will review some Kubernetes resource allocation concepts and optimization strategies to help estimate capacity usage and modify the cluster accordingly.

不幸的是,Kubernetes的容量规划并不简单。 可分配资源取决于基础节点类型以及保留的系统和Kubernetes组件(例如OS,kubelet,监视代理)。 此外,吊舱还需要对资源请求和限制进行一些微调,以实现最佳性能。 在本指南中,我们将回顾一些Kubernetes资源分配概念和优化策略,以帮助估计容量使用情况并相应地修改集群。

可分配的CPU和内存 (Allocatable CPU & Memory)

One of the first things to understand is that not all the CPU and memory on the Kubernetes nodes can be used for your application. The available resources in each node are divided in the following way:

首先要了解的是, 并非Kubernetes节点上的所有CPU和内存都可以用于您的应用程序 。 每个节点中的可用资源按以下方式划分:

  1. Resources reserved for the underlying VM (e.g. operating system, system daemons like sshd, udev)

    为基础VM保留的资源(例如,操作系统,系统守护程序(如sshd,udev)
  2. Resources need to run Kubernetes (e.g. kubelet, container runtime, kube-proxy)

    需要运行Kubernetes的资源(例如kubelet容器运行时kube-proxy )

  3. Resources for other Kubernetes-related add-ons (e.g. monitoring agents, node problem detector, CNI plugins)

    其他Kubernetes相关插件的资源(例如, 监视代理节点问题检测器CNI插件 )

  4. Resources available for my applications

    适用于我的应用程序的资源
  5. Capacity determined by the eviction threshold to prevent system OOMs

    由收回阈值确定的防止系统OOM的容量
Node allocation diagram
Node Allocation
节点分配

For self-administered clusters (e.g. kubeadm), each of these resources can be configured via system-reserved, kube-reserved, and eviction-threshold flags. For managed Kubernetes clusters, cloud providers detail node resource allocation per VM type (GKE and AKS explicitly state usage, whereas EKS values are estimated from EKS AMI or EKS bootstrap comments).

对于自我管理的集群(例如kubeadm),可以通过系统保留标志,kube保留标志和逐出阈值标志配置这些资源中的每一个。 对于托管的Kubernetes集群,云提供商详细介绍了每种VM类型的节点资源分配( GKEAKS明确说明了使用情况,而EKS值是根据EKS AMI或EKS 引导注释估算的)。

Let’s take GKE as an example. First, GKE reserves 100 MiB of memory on each node for the eviction threshold.

让我们以GKE为例。 首先,GKE在每个节点上为驱逐阈值保留100 MiB的内存。

For CPU, GKE reserves:

对于CPU,GKE保留:

  • 6% of the first core

    第一核心的6%
  • 1% of the next core (up to 2 cores)

    下一个核心的1%(最多2个核心)
  • 0.5% of the next 2 cores (up to 4 cores)

    接下来的2个核心的0.5%(最多4个核心)
  • 0.25% of any cores above 4 cores

    4个核以上的任何核的0.25%

For memory, GKE reserves:

为了记忆,GKE保留:

  • 255 MiB of memory for machines with less than 1 GB of memory

    内存少于1 GB的计算机为255 MiB
  • 25% of the first 4GB of memory

    前4GB内存的25%
  • 20% of the next 4GB of memory (up to 8GB)

    接下来的4GB内存的20%(最多8GB)
  • 10% of the next 8GB of memory (up to 16GB)

    接下来的8GB内存的10%(最多16GB)
  • 6% of the next 112GB of memory (up to 128GB)

    接下来的112GB内存的6%(最高128GB)
  • 2% of any memory above 128GB

    超过128GB的任何内存的2%

Using the general-purpose n1-standard-1 VM type (1 vCPU, 3.75GB memory), we are then left with:

使用通用的n1-standard-1 VM类型(1个vCPU,3.75GB内存),然后剩下:

  • Allocatable CPU = 1vCPU - (0.06 * 1vCPU) = 0.94 vCPU

    可分配的CPU = 1vCPU-(0.06 * 1vCPU)= 0.94 vCPU
  • Allocatable memory = 3.75GB - (100MiB - 0.25 * 3.75GB) =2.71GB

    可分配的内存= 3.75GB-(100MiB-0.25 * 3.75GB)= 2.71GB

Before we run any applications, we can see that we only have ~75% of the underlying node’s memory and ~95% of the CPU. On the other hand, bigger nodes are less impacted by the system and Kubernetes overhead. As shown below, an n1-standard-96 node leaves 99% of the CPU and 96% of the memory for your applications.

在运行任何应用程序之前,我们可以看到我们仅拥有约75%的基础节点内存和约95%的CPU。 另一方面,较大的节点受系统和Kubernetes开销的影响较小。 如下所示,n1-standard-96节点为您的应用程序保留了99%的CPU和96%的内存。

(The full list of allocatable memory and CPU resources for each machine type can be found here with caveats for Windows Server nodes.)

( 每种机器类型的可分配内存和CPU资源的完整列表都可以在 此处 找到 并附带Windows Server节点的注意事项 。)

Image for post
GKE Documentation GKE文档

资源不对称 (Resource Asymmetry)

Now that we understand allocatable resources, the next challenge is dealing with resource asymmetry. Some applications may be CPU-intensive (e.g. machine learning workloads, video streaming), whereas some may be memory-intensive (e.g. Redis). Given such resource asymmetry, kube-scheduler will try its best to schedule each workload to the most optimal node given the resource constraints. Kube-scheduler’s decision to schedule pods on different nodes is guided by a scoring algorithm influenced by resource requirements, anti-/affinity rules, data locality, and inter-workload interference. Although it is possible to tune the scheduler’s performance in terms of latency (i.e. time to schedule a new pod) and the node scoring threshold for scheduling decisions, selecting the correct node type is critical to avoid unnecessary scaling and unused resources on each node.

现在我们了解了可分配的资源,下一个挑战是处理资源不对称问题。 一些应用程序可能会占用大量CPU资源(例如,机器学习工作负载,视频流),而某些应用程序可能会占用大量内存(例如Redis)。 考虑到这种资源不对称性,kube-scheduler将尽力将每个工作负荷调度到给定资源约束的最佳节点。 Kube-scheduler决定在不同节点上调度Pod的决定是由受资源需求,反/亲和力规则,数据局部性和工作间干扰影响的评分算法指导的。 尽管可以根据等待时间(即,安排新Pod的时间)和用于计划决策的节点评分阈值来调整计划程序的性能,但选择正确的节点类型对于避免不必要的扩展和每个节点上未使用的资源至关重要。

One method to deal with resource asymmetry is to create multiple node pools for different application types. For example, stateless applications may run on general-purpose, preemptible nodes, whereas databases may be scheduled to run on CPU or memory-optimized nodes. This can be controlled by node taints and affinity rules to schedule specific workloads to tainted nodes only.

解决资源不对称的一种方法是为不同的应用程序类型创建多个节点池。 例如,无状态应用程序可以在通用的可抢占节点上运行,而数据库可以安排为在CPU或内存优化的节点上运行。 这可以由节点污点和关联性规则控制,以仅将特定工作负载调度到受节点。

Image for post
GCP Blog GCP博客

Even with multiple node pools and affinity rules set, Kubernetes resource usage may become suboptimal over time given its dynamic nature:

即使设置了多个节点池和相似性规则,鉴于其动态性质,Kubernetes资源的使用也会随着时间的流逝而变得次优:

  • New nodes may be added to the cluster to deal with a higher load.

    可以将新节点添加到群集以处理更高的负载。
  • Nodes may fail or be recreated for cluster upgrades.

    节点可能会失败或为群集升级而重新创建。
  • Taints or pod/node affinity rules may change to deal with new application requirements.

    污渍或豆荚/节点相似性规则可能会更改,以应对新的应用程序要求。
  • Some nodes may become under- or over-utilized after applications are deleted or added.

    在删除或添加应用程序之后,某些节点可能变得利用率不足或过度使用。

To rebalance the pods across the nodes, run descheduler as a Job or CronJob inside the Kubernetes cluster. Descheduler is a kubernetes-sig project that includes seven strategies (RemoveDuplicates, LowNodeUtilization, RemovePodsViolatingInterPodAntiAffinity, RemovePodsViolatingNodeAffinity, RemovePodsiolatingNodeTaints, RemovePodsHavingTooManyRestarts, and PodLifeTime) to optimize node resource usage automatically.

要在节点之间重新平衡Pod,请在Kubernetes集群中将descheduler作为Job或CronJob运行。 Descheduler是一个kubernetes-sig项目,包括七个策略( RemoveDuplicatesLowNodeUtilizationRemovePodsViolatingInterPodAntiAffinityRemovePodsViolatingNodeAffinityRemovePodsiolatingNodeTaintsRemovePodsHavingTooManyRestartsPodLifeTime ),以自动优化节点资源的使用。

资源范围和配额 (Resource Ranges & Quotas)

Finally, we must understand and define resource ranges for our application. Kubernetes provides two basic configurable parameters for resource management:

最后,我们必须了解并定义应用程序的资源范围。 Kubernetes为资源管理提供了两个基本的可配置参数:

  • Requests: lower bound on resource usage per workload

    请求 :每个工作负载的资源使用率下限

  • Limits: upper bound on resource usage per workload

    限制 :每个工作负载的资源使用上限

Image for post

Kubernetes scheduler takes the request parameter per workload and allocates the prescribed CPU and memory. This is the minimum resource usage of the workload, but the application may actually use less or more than this threshold. The limit, on the other hand, sets the maximum resource usage. When the workload uses more than the limit, kubelet will throttle the CPU or issue an OOM kill message. In an ideal world, resource utilization will be 100%, but in reality, resource usage is often irregular and hard to predict.

Kubernetes调度程序采用每个工作负载的请求参数,并分配指定的CPU和内存。 这是工作负载的最小资源使用量,但是应用程序实际使用的资源可能少于或超过此阈值。 另一方面,该限制设置最大资源使用量。 当工作负载使用的资源超过限制时,kubelet将限制CPU或发出OOM终止消息。 在理想的世界中,资源利用率将是100%,但实际上,资源利用率通常是不规则的且难以预测。

Another thing to consider is the application’s quality of service (QoS). Combining requests and limits, we can provide the following QoS:

要考虑的另一件事是应用程序的服务质量(QoS)。 结合请求和限制,我们可以提供以下QoS:

  • Guaranteed: For relatively predictable workloads (e.g. CPU-bound web servers, scheduled jobs), we can explicitly provision set resources for guaranteed QoS. To enable this, specify the limit or set the request equal to the limit.

    保证的 :对于相对可预测的工作负载(例如,受CPU限制的Web服务器,计划的作业),我们可以显式设置集合资源以保证QoS。 要启用此功能,请指定限制或将请求设置为等于限制。

  • Burstable: For workloads that need to consume more resources based on traffic or data (e.g. Elasticsearch, data analytics engine), we can set the request resources based on the steady-state usage, but give a higher limit to allow the workload to vertically scale.

    突发性 :对于需要根据流量或数据消耗更多资源的工作负载(例如,Elasticsearch,数据分析引擎),我们可以根据稳态使用量设置请求资源,但设置了更高的限制以允许工作负载垂直扩展。

  • Best Effort: Finally, for workloads with unknown resource usage, leave both the requests and limits unspecified to allow the workload to use all available resources on the node. From the scheduler perspective, these workloads are treated as the lowest priority and will be evicted before guaranteed or burstable workloads.

    尽力而为 :最后,对于资源使用未知的工作负载,请不要指定请求和限制,以允许工作负载使用节点上的所有可用资源。 从调度程序的角度来看,这些工作负载被视为最低优先级,并且将在保证或突发工作负载之前被逐出。

So how do we set reasonable requests and limits to avoid performance degradation and optimize for cost? A good rule of thumb is to benchmark typical usage and give a 25% margin for limits. We don’t want to set the requests or the limits too high to block other applications from utilizing resources and making it harder on the scheduler to allocate resources for our workload. From the initial resource configuration, run some load tests to record performance degradation or failures. If the workload starts to slow down or gets killed, double the limit and continue the tests. On the other hand, if there’s no significant change in performance, try decreasing the requests and limits to free up resources for the cluster.

那么,我们如何设置合理的要求和限制来避免性能下降和成本优化呢? 一个好的经验法则是对典型用法进行基准测试,并给极限值留出25%的余量。 我们不想将请求或限制设置得太高,以至于阻止其他应用程序利用资源,并使调度程序更难为我们的工作负载分配资源。 从初始资源配置中,运行一些负载测试以记录性能下降或失败。 如果工作负载开始变慢或被杀死,则将限制加倍并继续测试。 另一方面,如果性能没有明显变化,请尝试减少请求和限制以释放集群资源。

Image for post
Sysdig Sysdig

Finally, we can use Vertical Pod Autoscaler (VPA) to automatically set requests and maintain ratios between limits and requests based on historical usage. VPA can automatically scale down pods that are over-requesting resources and scale up pods that are under-requesting. However, VPA comes with some limitations:

最后,我们可以使用Vertical Pod Autoscaler(VPA)来自动设置请求,并根据历史使用情况保持限制与请求之间的比率。 VPA可以自动缩小资源超额使用的Pod,并缩小资源不足的Pod。 但是,VPA具有一些限制:

  • Updating running pods is an experimental feature of VPA. Updates from VPA causes all running containers to be restarted.

    更新正在运行的Pod是VPA的一项实验功能。 VPA更新导致所有正在运行的容器重新启动。
  • VPA cannot be used with Horizontal Pod Autoscaler on CPU or memory currently. A workaround is to specify custom or external metrics to combine both VPA and HPA.

    VPA当前无法与CPU或内存上的Horizo​​ntal Pod Autoscaler一起使用。 一种解决方法是指定自定义指标或外部指标,以结合使用VPA和HPA。

  • VPA is not ready for use with JVM-based workloads due to limited visibility in actual memory usage.

    由于实际内存使用情况的可视性有限,VPA尚未准备好与基于JVM的工作负载一起使用。

If you do not wish to rely on VPA to update running workloads, we can use VPA in recommendation mode instead or use a tool like Goldilocks from FairwindsOps to see recommended values on a dashboard. VPA only knows about resource usage on the cluster, so we still need to run some load tests to better characterize our applications.

如果您不希望依靠VPA来更新正在运行的工作负载,我们可以改为在推荐模式下使用VPA,或者使用FairwindsOps的Goldilocks之类的工具在仪表板上查看推荐值。 VPA只知道群集上的资源使用情况,因此我们仍然需要运行一些负载测试以更好地表征我们的应用程序。

Image for post
Goldilocks Goldilocks

结论 (Conclusion)

Careful resource planning and capacity analysis will add predictability and improve system resiliency. Don’t wait until things break in production to find out what resource requests and limits should be set for your workloads. Profile and benchmark typical resource utilization, choose the appropriate node type and node pools, and define QoS with resource ranges. Take advantage of VPAs or validation tools like Goldilocks to craft the optimized cluster for your applications.

仔细的资源计划和容量分析将增加可预测性并提高系统弹性。 不要等到生产中断后才能确定应该为您的工作负载设置哪些资源请求和限制。 对典型的资源利用率进行分析和基准测试,选择适当的节点类型和节点池,并使用资源范围定义QoS。 利用VPA或Goldilocks之类的验证工具为您的应用程序创建优化的集群。

翻译自: https://medium.com/dev-genius/ultimate-kubernetes-resource-planning-guide-449a4fddd1d6

ultimate grid

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值