kubernetes 集群_调整Kubernetes集群大小的技巧

最新推荐文章于 2024-03-13 09:34:59 发布

weixin_26746861

最新推荐文章于 2024-03-13 09:34:59 发布

阅读量562

点赞数

文章标签： kubernetes java python 算法机器学习

原文链接：https://medium.com/better-programming/tips-for-rightsizing-your-kubernetes-cluster-e0a8f1093d8d

版权

kubernetes 集群

Managing a Kubernetes cluster is not a one-size-fits-all problem. There are many ways to rightsize your cluster, and it is vital to design your application for reliability and robustness.

管理Kubernetes集群不是万能的问题。调整群集大小的方法有很多，对于可靠性和健壮性设计应用程序至关重要。

As site-reliability and DevOps engineers, you often need to understand the requirements for the applications you will be running on your cluster and then consider various factors while designing it.

作为站点可靠性和DevOps工程师，您通常需要了解将在群集上运行的应用程序的要求，然后在设计时考虑各种因素。

Choosing the correct node size is critical when building applications for scale. Multiple smaller nodes and few large nodes form the two ends of the spectrum. Given a cluster requirement for a total of 24GB memory and 12 CPUs, should you choose 12 1-CPU/2GB machines or two 6-CPU/12GB machines?

在构建规模应用程序时，选择正确的节点大小至关重要。多个较小的节点和几个较大的节点形成了频谱的两端。考虑到群集要求总共24GB内存和12个CPU，您应该选择12台1-CPU / 2GB计算机还是两台6-CPU / 12GB计算机？

Common sense tells us that we need to choose somewhere in the middle, but let’s understand what to factor in when making a decision. Let’s have a look at what both ends of the spectrum provide us before deciding on the approach.

常识告诉我们，我们需要在中间位置进行选择，但让我们了解做出决定时应考虑的因素。在决定采用哪种方法之前，让我们先看看频谱的两端为我们提供了什么。

高可用性 (High Availability)

Kubernetes provides high availability for your applications by default if you use at least two worker nodes. However, in terms of choosing between a few large worker nodes or many small worker nodes, we need to understand what we get.

如果您至少使用两个工作程序节点，默认情况下， Kubernetes会为您的应用程序提供高可用性。但是，就在几个大型工作程序节点还是多个小型工作程序节点之间进行选择而言，我们需要了解所得到的。

If we have a cluster with two large worker nodes and we lose one, we lose half the cluster capacity. Unless you overprovision your node capacity by 100%, it is a call for disaster as your cluster won’t be able to cope with double the load. Having multiple worker nodes reduces the risk by n. For example, if you have a ten-node cluster and you lose one, you lose just 10% of your cluster capacity.

如果我们有一个包含两个大型工作节点的集群，而我们损失了一个，则该集群的容量将损失一半。除非您将节点容量超额提供100％，否则这将是灾难，因为群集将无法应付两倍的负载。具有多个工作程序节点可将风险降低n。例如，如果您有一个由十个节点组成的群集，而您损失了一个节点，则仅损失群集容量的10％。

Winner: Many small nodes

优胜者：许多小节点

管理费用 (Management Overhead)

With more nodes in the picture, you have to take care of more servers, patch and update them and maintain them. With fewer nodes, it becomes easier to manage. With DevOps and SRE, using tools such as Ansible or Puppet can simplify this for you by automating these aspects.

在图中有更多节点时，您必须照顾更多服务器，对其进行补丁和更新并维护它们。节点越少，管理起来就越容易。使用DevOps和SRE，使用诸如Ansible或Puppet之类的工具可以通过使这些方面自动化来简化此过程。

If you are using a managed Kubernetes cluster, your cloud provider takes care of patching and upgrading. Therefore, with modern tools, this does not result in a significant advantage.

如果您使用的是托管Kubernetes集群，则您的云提供商将负责修补和升级。因此，使用现代工具不会带来明显的优势。

Winner: None

优胜者：无

轻松安排容器 (Ease of Scheduling Containers)

Kubernetes looks at multiple aspects while scheduling containers on worker nodes. Some of them are resource availability and container resource requests and limits. With a large number of resources allocated to a node, it becomes effortless for Kubernetes to schedule a node on a cluster with fewer large worker nodes in comparison to smaller nodes.

Kubernetes在工作节点上调度容器时会考虑多个方面。其中一些是资源可用性以及容器资源请求和限制。通过为节点分配大量资源，与较小的节点相比，Kubernetes可以轻松地在集群上调度具有较少大型工作节点的节点。

Winner: A few large nodes

优胜者：一些大型节点

节点自动缩放 (Node Auto-Scaling)

A lot of cloud providers allow automatic horizontal scaling of your Kubernetes worker nodes. If you have large worker nodes, scaling is a bit clunky. For example, if you add a third node to a two-node cluster, you increase the capacity by a whopping 33%, while in a ten-node cluster, adding one more increases the capacity by only 9%. That results in smoother scaling and low resource wastage.

许多云提供商都允许自动水平扩展Kubernetes工作者节点。如果您有大型工作节点，则扩展会比较麻烦。例如，如果将第三个节点添加到两个节点的群集中，则容量增加了33％，而在十个节点的群集中，再添加一个则容量仅增加了9％。这样可以实现更顺畅的扩展和较低的资源浪费。

Winner: Many small nodes

优胜者：许多小节点

易于维护 (Ease of Maintenance)

All nodes in a Kubernetes cluster do not stay live indefinitely. You will have to bring them down for OS patching, Kubernetes version upgrades, and other maintenance activities. If you have fewer worker nodes, it becomes difficult to do maintenance without disrupting your applications. Also, bringing a node down severely overloads the others, so you would need to take downtime in most cases. The same is not required if you have multiple small nodes. You can do maintenance one node at a time, and Kubernetes can easily handle rescheduling your containers into other free nodes.

Kubernetes集群中的所有节点不会无限期地保持活动状态。您将不得不关闭它们进行操作系统修补，Kubernetes版本升级和其他维护活动。如果工作节点较少，那么在不中断应用程序的情况下很难进行维护。此外，将节点关闭会使其他节点严重过载，因此在大多数情况下，您需要停机。如果您有多个小节点，则不需要这样做。您可以一次维护一个节点，Kubernetes可以轻松地将容器重新安排到其他空闲节点中。

Winner: Many small nodes

优胜者：许多小节点

Kubelet高架 (Kubelet Overhead)

Kubelet is responsible for interacting with the underlying container runtime to run containers within the worker nodes. If you have multiple containers running on a single worker node, a typical case with large node clusters, it might cause Kubelet to choke as it has to take care of multiple containers.

Kubelet负责与基础容器运行时进行交互以在工作程序节点中运行容器。如果在单个工作节点上运行多个容器(这是大型节点群集的典型情况)，则可能会导致Kubelet阻塞，因为它必须照顾多个容器。

Kubelet is not optimised to manage more than a certain number of containers, and a lot of cloud providers limit the number of containers you can have on a single worker node. Therefore having fewer large worker nodes might not be the right solution when you have a multitude of smaller containers.

Kubelet并未针对管理一定数量以上的容器进行优化，并且许多云提供商限制了您在单个工作节点上可以拥有的容器数量。因此，当拥有多个较小的容器时，拥有较少的大型工作程序节点可能不是正确的解决方案。

Winner: Many small nodes

优胜者：许多小节点

系统开销 (System Overhead)

Conversely, multiple nodes choke your master servers as they have to interact with and manage numerous worker nodes. Kubernetes is not optimised to handle more than 500 nodes per cluster (though they claim to manage up to 5000 nodes).

相反，由于多个节点必须与大量工作节点进行交互和管理，因此它们使您的主服务器阻塞。 Kubernetes并未针对每个集群处理500个以上的节点进行优化(尽管它们声称可以管理多达5000个节点)。

Therefore, you should be careful of how many nodes you add to your cluster, and optimise the size of the node with the required number. Most use cases may not require that amount of nodes, but it isn’t uncommon to see such setups in high-end applications.

因此，您应注意要添加到群集中的节点数，并使用所需数目优化节点的大小。大多数用例可能不需要那么多节点，但是在高端应用程序中看到这种设置并不少见。

Winner: A few large nodes

优胜者：一些大型节点

调整节点大小 (Rightsizing Your Nodes)

Well, it depends, as always! The right answer lies somewhere in between, and you should not choose either of the two extremes.

好吧，这取决于往常！正确的答案介于两者之间，您不应该选择两个极端中的任何一个。

These are some of the rules that I follow, and it works pretty well for me. You should look for the answers to the below questions to arrive at the right size.

这些是我遵循的一些规则，对我来说效果很好。您应该寻找以下问题的答案，以得到正确的大小。

您正在运行哪种类型的应用程序？ (What type of applications are you running?)

Are you running a microservices application that contains multiple containers, each having a low footprint, or several monoliths requiring GBs of memory and cores of CPUs? Databases? Or a mixed load?

您正在运行的微服务应用程序包含多个容器，每个容器的占地面积都很小，或者是几个需要GB内存和CPU核心的整体组件？数据库？还是混合负载？

您可以将类似的工作量放在一起吗？ (Can you club similar workloads together?)

For mixed workloads, can you separate the workloads according to a category? For example, if you are planning to run a microservices application that interacts with several MySQL databases, then you can have a group each for your microservices and databases. Similarly, if you are running a mixture of monoliths and microservices, for example, an ELK stack for monitoring the cluster, you can put that into another group. For simplicity, I have just mentioned three categories, but you can create as many as you like.

对于混合工作负载，是否可以按类别分开工作负载？例如，如果您计划运行一个与多个MySQL数据库进行交互的微服务应用程序，则可以为您的微服务和数据库分别创建一个组。同样，如果您运行的是整体与微服务的混合，例如，用于监视集群的ELK堆栈，则可以将其放入另一个组。为了简单起见，我刚才提到了三个类别，但是您可以创建任意多个类别。

特定应用类别的最大资源利用率 (Maximum resource utilisation of a particular application category)

If you can group your applications into multiple brackets, you then need to understand the maximum resource requirements for one application in each category. For example, if you are running microservices, monoliths, and databases on the cluster, you need to get the maximum utilisation of a single instance of a database, a microservice, and a monolith.

如果您可以将应用程序分为多个类别，则需要了解每个类别中一个应用程序的最大资源要求。例如，如果在集群上运行微服务，整体和数据库，则需要最大程度地利用数据库，微服务和整体的单个实例。

获取每个类别的应用程序数 (Get the number of applications per category)

The next step is to get the planned number of applications per type of application. You may be using, for instance, eight MySQL databases, 200 microservices, and nine monoliths. Make a note of them. Don’t overestimate. Kubernetes is designed for scalability, and you can always use the cluster autoscaling feature in some providers. Otherwise, you can add nodes manually later.

下一步是获取每种类型的应用程序的计划数量。例如，您可能正在使用八个MySQL数据库，200个微服务和九个整体。记下它们。不要高估。 Kubernetes专为可伸缩性而设计，您始终可以在某些提供程序中使用群集自动伸缩功能。否则，您可以稍后手动添加节点。

将每个类别放置在单独的节点池上 (Place each category on separate node pools)

The best way to rightsize your node is to create a node pool per category. Most Kubernetes clusters support multiple node pools, and heterogeneous nodes are allowed in on-premise setups.

调整节点大小的最佳方法是为每个类别创建一个节点池。大多数Kubernetes集群都支持多个节点池，并且在本地设置中允许使用异构节点。

计划高可用性 (Plan for high availability)

We need to have at least two nodes per node pool to ensure that we design for high availability. Kubernetes recommends a maximum of 110 containers per node. There are also some system containers running on the nodes, so bear that in mind.

每个节点池至少需要有两个节点，以确保我们为实现高可用性而设计。 Kubernetes建议每个节点最多110个容器。节点上还运行着一些系统容器，因此请记住这一点。

You may have to adjust this value according to your cloud provider, as most offer varying levels of hard limits on the number of pods you can run per node. The idea is to plan to schedule fewer pods than the limit.

您可能必须根据您的云提供商调整此值，因为大多数服务提供商对每个节点可以运行的Pod数量提供不同级别的硬性限制。这个想法是计划安排比限制更少的窗格。

将您的容器平方到节点以进行容量管理 (Square your containers to nodes for capacity management)

If we plan for a failure of one node at a time, you should overprovision each node in such a way that they can handle the additional capacity requirement for workloads running on that node.

如果我们计划一次使一个节点发生故障，则应按每个节点超额配置的方式，使它们可以处理该节点上运行的工作负载的额外容量要求。

Square your containers and nodes for the best possible capacity management. For example, if we have 20 containers, the best way to schedule them is to have four pods per node as in that case, we have to overprovision each node only by 20% instead of 100% if we choose two nodes.

对您的容器和节点进行平方处理，以获得最佳的容量管理。例如，如果我们有20个容器，则安排它们的最佳方法是每个节点有四个容器，因为在这种情况下，如果选择两个节点，则每个节点只能超额配置20％，而不是100％。

How did I arrive at this figure? Just look at the nearest lower square number and take the root of that. That should give you the maximum containers you should have per node.

我是怎么得出这个数字的？只需查看最近的下平方号并取其根即可。那应该为您提供每个节点应具有的最大容器数。

考虑Kubernetes系统开销 (Consider Kubernetes system overhead)

Consider the fact that Kubernetes system resources consume some CPU and memory per node, so the available memory is less than provisioned. Check with your cloud provider for that value and add that to the node capacity.

考虑Kubernetes系统资源在每个节点上消耗一些CPU和内存的事实，因此可用内存少于配置的内存。向您的云提供商咨询该值，并将其添加到节点容量。

总结一下 (Summing It Up)

Considering everything, you can use the following to arrive at an optimal figure:

考虑到所有因素，您可以使用以下方法得出最佳数值：

The number of containers per node = Square root of the closest lower perfect square to the total number of containers, provided the number of containers per node doesn’t exceed the recommended value

每个节点的容器数=与容器总数最接近的下完美平方的平方根，前提是每个节点的容器数不超过建议值

Number of nodes = Total number of containers / The number of containers per node

节点数=容器总数/每个节点的容器数

Overprovision factor = Number of containers per node * max resource per container / (Number of nodes — Max planned unavailable nodes)

超额配置系数=每个节点的容器数*每个容器的最大资源/(节点数-计划的最大不可用节点数)

Node capacity = max resource required per container * the number of containers per node + overprovision factor + Kubernetes system resource requirements

节点容量=每个容器所需的最大资源*每个节点的容器数+超额配置系数+ Kubernetes系统资源需求

例 (Example)

To understand how it works, let’s look at an example.

要了解它是如何工作的，我们来看一个示例。

Let’s say we need two node pools:

假设我们需要两个节点池：

Microservices — 200 microservices with 0.1 core and 100MB RAM max resource requirement per container
微服务—每个容器200个微服务，具有0.1核心和100MB RAM最大资源需求
Databases — 20 PostgreSQL databases with 2 cores and 4GB RAM max resource requirement per container
数据库— 20个PostgreSQL数据库，具有2个核心和每个容器4GB RAM的最大资源需求

Assuming that the Kube system resources utilise 0.5 cores and 0.5 GB of RAM and we plan for one node to fail at a time

假设Kube系统资源使用0.5个内核和0.5 GB RAM，并且我们计划一次使一个节点发生故障

For the microservices node pool:

对于微服务节点池：

Nearest lower perfect square number = 196

最近的下完美平方数= 196

Number of Containers per node = sqrt(196) = 14

每个节点的容器数= sqrt(196)= 14

Number of Nodes = 200/14 = 14.28 ~ 15

节点数= 200/14 = 14.28〜15

Max planned unavailable nodes = 1

计划的最大不可用节点数= 1

Overprovision factor = 14 * (0.1 core + 100MB RAM)/(15–1) = 0.1 core + 100MB RAM

预留空间系数= 14 *(0.1核心+ 100MB RAM)/(15–1)= 0.1核心+ 100MB RAM

Node Capacity = (0.1 core + 100MB RAM) * 14 + 0.1 core + 100MB RAM + 0.5 cores + 500MB RAM = 2 core + 2GB RAM

节点容量=(0.1核心+ 100MB RAM)* 14 + 0.1核心+ 100MB RAM + 0.5核心+ 500MB RAM = 2核心+ 2GB RAM

For the database pool:

对于数据库池：

Nearest lower perfect square number = 16

最近的下完美平方数= 16

Number of Containers per node = sqrt(16) = 4

每个节点的容器数= sqrt(16)= 4

Number of Nodes = 20/4 = 5

节点数= 20/4 = 5

Max planned unavailable nodes = 1

计划的最大不可用节点数= 1

Overprovision factor = 4 * (2 core + 4GB RAM)/(5–1) = 2 core + 4GB RAM

预留空间系数= 4 *(2核+ 4GB RAM)/(5-1)= 2核+ 4GB RAM

Node Capacity = (two core + 4GB RAM) * 4 + 2 core + 4GB RAM + 0.5 cores + 0.5 GB RAM = 10.5 core + 20.5 GB RAM

节点容量=(两个内核+ 4GB RAM)* 4 + 2个内核+ 4GB RAM + 0.5内核+ 0.5 GB RAM = 10.5内核+ 20.5 GB RAM

Therefore, for the microservices pool, you would need 15 worker nodes with 2 cores and 2 GB RAM each, and for the database pool, you would need five worker nodes with 10.5 cores and 20.5 GB RAM each. You can round the numbers to the next higher available machine type for simplicity.

因此，对于微服务池，您将需要15个具有2个核心和2 GB RAM的工作节点，对于数据库池，您将需要5个具有10.5核心和20.5 GB RAM的工作节点。为了简单起见，您可以将数字四舍五入到下一个更高的可用机器类型。

结论 (Conclusion)

Designing a Kubernetes cluster is more of an art than an exact science. Though the above are helpful guidelines, they are still guidelines, and the best way you can learn is by provisioning your cluster and optimising it as you learn more.

设计Kubernetes集群更多是一门艺术，而不是一门精确的科学。尽管以上内容是有用的准则，但它们仍然是准则，您可以学习的最佳方法是通过配置群集并在学习更多内容时对其进行优化。

Thanks for reading! I hope you enjoyed the article!

谢谢阅读！希望您喜欢这篇文章！