gke下载_我们如何在GKE上升级Kubernetes

最新推荐文章于 2024-07-22 22:58:19 发布

weixin_26705651

最新推荐文章于 2024-07-22 22:58:19 发布

阅读量392

点赞数

文章标签： python java

原文链接：https://blog.gojekengineering.com/how-we-upgrade-kubernetes-on-gke-91812978a055

版权

gke下载

Gojek Long阅读 (Gojek Long Reads)

If you’re running Kubernetes on GKE, chances are there’s already some form of upgrades for the Kubernetes clusters in place. Given that their release cycle is quarterly, you would have a minor version bump every quarter in the upstream. This is certainly a high velocity for version releases.

如果您在GKE上运行Kubernetes，则可能已经存在某种形式的Kubernetes集群升级。鉴于它们的发布周期是季度一次，因此上游的每个季度都会有次要的版本增加。对于版本发布来说，这无疑是一个很高的速度。

The focus of this blog is how one can attempt to keep up with this release cycle.

本博客的重点是如何尝试跟上此发布周期。

Although quite a few things are GKE specific, there are a lot of elements which apply to any kubernetes cluster in general, irrespective of it being self-hosted or managed.

尽管有很多事情是GKE特有的，但一般来说，有许多元素适用于任何kubernetes集群，无论它是自托管还是托管的。

Let’s quickly set context on what exactly a Kubernetes cluster is.

让我们快速设置Kubernetes集群到底是什么的上下文。

Kubernetes集群的组件 (Components of a Kubernetes cluster)

Any Kubernetes cluster would consist of master and worker nodes. These two sets of nodes are for different kind of workloads that run.

任何Kubernetes集群都将由主节点和工作节点组成。这两组节点用于运行的不同类型的工作负载。

The master nodes in GKE are managed by Google Cloud itself. So, what does it entail?

GKE中的主节点由Google Cloud本身管理。那么，这意味着什么呢？

Components like api-server, controller-manager, Etcd, scheduler, etc., needn’t be managed by you in this case. The operational burden just got smaller!

在这种情况下，无需管理api-server，controller-manager，Etcd，Scheduler等组件。运营负担变小了！

Image for post — ubernetesio ubernetesio

Here’s a quick summary of the above image:

以下是上图的快速摘要：

Scheduler: Schedules your pods to nodes

Scheduler ：将您的Pod调度到节点

Controller manager: Consists of a set of controllers used to control the existing state of the cluster and reconcile it with the state stored in Etcd

控制器管理器 ：由一组控制器组成，这些控制器用于控制集群的现有状态并将其与存储在Etcd中的状态进行协调

Api-server: The entry point to the cluster, and here’s where each component comes to interact with other components.

Api服务器 ：集群的入口点，这里是每个组件与其他组件进行交互的地方。

我们如何创建集群 (How we create a cluster)

We use Terraform along with Gitops to manage the state of everything related to GCP. I’ve also heard good things about Pulumi, which could be a feasible choice. But, always remember:

我们将Terraform和Gitops一起使用来管理与GCP相关的所有事物的状态。我也听说过有关Pulumi的好消息，这可能是一个可行的选择。但是，请始终记住：

Having the power of being able to declaratively configure the state of your infrastructure cannot be understated.

能够以声明方式配置基础结构状态的能力不可低估。

We have a bunch of cluster creation modules inside our private Terraform repository. This makes the creation of our GKE cluster so simple that it’s literally just a call to the module, along with some defaults and custom arguments. These custom arguments vary along with the cluster, or a git commit and push. The next thing one sees is the terraform plan, right in the comfort of the CI. If everything looks good, the following step would be to do a terraform apply in the same pipeline stage.

我们的私有Terraform存储库中有很多集群创建模块。这使我们的GKE集群的创建变得如此简单，以至于它实际上只是对模块的调用，还有一些默认值和自定义参数。这些自定义参数随群集或git commit和push的变化而变化。接下来看到的就是在CI舒适的情况下进行的地形规划。如果一切看起来不错，那么下一步将是在同一管道阶段中进行地形应用。

With a few contextual things about how we are managing the terraform state of the cluster, let’s move on to a few defaults which we’ve set.

关于如何管理集群的地形状态的一些背景知识，让我们继续一些已设置的默认值。

By default, one should always choose regional clusters. The advantage of this is that the GKE will maintain replicas of the control plane across zones, which makes the control plane resilient to zonal failures. Since the api-server is the entry to all the communication and interaction, when this goes down, it’s nothing but losing control/access to the cluster. That said, the workload will continue to run unaffected (if they don’t depend on the api-server or k8s control plane.)

默认情况下，应该始终选择区域集群。这样做的好处是，GKE将在整个区域内维护控制平面的副本，这使控制平面可以应对区域性故障。由于api服务器是所有通信和交互的入口，因此当它出现故障时，除了失去对集群的控制/访问之外，什么也没有。也就是说，工作负载将继续不受影响地运行(如果它们不依赖于api服务器或k8s控制平面)。

Components like Istio, Prometheus operator, or a good old Kubectl which depend on the api-server, may not function momentarily as the control-plane would be getting upgraded in case your cluster is not a regional cluster.

诸如Istio，Prometheus运算符或依赖于api服务器的旧版本Kubectl之类的组件可能暂时无法运行，因为如果您的群集不是区域性群集，则控制平面将被升级。

Although, in the case of regional clusters, I haven’t personally seen any service degradation/downtime/latency increase while the master upgrades itself.

尽管就区域集群而言，我个人还没有看到在主服务器自身升级时任何服务降级/停机/延迟增加的情况。

在升级任何东西之前先进行主升级 (Master upgrades come before upgrading anything)

This is because the control plane needs to be upgraded first and then the rest of the worker nodes.

这是因为控制平面需要先升级，然后再升级其余的工作节点。

When the master nodes are being upgraded (you will not see the nodes in GKE, but it would be running somewhere as part of VMs/borg pods or whatever google is using as an abstraction), the workloads running on it like the controller-manager, scheduler, Etcd, the api-server are the components which are getting upgraded to the version of k8s you’re setting them to.

当主节点被升级时(您不会在GKE中看到这些节点，但是它会作为VM / Borg Pod的一部分或Google用作抽象的任何部分运行)，在其上运行的工作负载就像controller-manager ，排程器，Etcd，api服务器是正在升级到您要为其设置的k8s版本的组件。

Once the master upgrades are done, we move to the worker node upgrades. The process of master node upgradation is very opaque in nature, as GKE manages the upgrade for you and not the cluster operator, which might not give you a lot of visibility on what exactly is happening. But nevertheless, if you just want to learn on what’s happening inside, you can try typhoon and try upgrading the control plane of the cluster brought up using that. I used to live upgrade the control plane of a self hosted k8s cluster. You can check out more about this here: DevOps Days 2018 India Talk.

完成主服务器升级后，我们将移至工作节点升级。主节点升级的过程本质上是非常不透明的，因为GKE由您而不是集群运营商来管理升级，这可能无法使您对正在发生的事情有很多了解。但是，尽管如此，如果您仅想了解内部发生的情况，则可以尝试台风并尝试升级使用该台风带来的集群的控制平面。我曾经现场升级了自托管k8s集群的控制平面。您可以在此处查看有关此内容的更多信息： DevOps Days 2018 India Talk 。

GKE群集主服务器已升级。接下来是什么？ (GKE cluster master is upgraded. What next?)

The next obvious thing after the GKE master node upgrade, is to upgrade the worker nodes. In the case of GKE, you would have node pools, which would in turn be having nodes being managed by the node pools.

GKE主节点升级之后，下一个显而易见的事情就是升级工作节点。对于GKE，您将拥有节点池，而节点池又将使节点由节点池管理。

Why different node pools, you ask?

您问为什么要使用不同的节点池？

One can use separate node pools to run different kinds of node pools, which can then be used to segregate the workloads which run on the nodes of that node pool. One node pool can be tainted to run only prometheus pods, and the prometheus deployment object can then tolerate that taint to get scheduled on that node.

一个人可以使用单独的节点池来运行不同种类的节点池，然后可以将它们用于隔离在该节点池的节点上运行的工作负载。可以污染一个节点池以仅运行Prometheus Pod，然后Prometheus部署对象可以容忍该污染以在该节点上进行调度。

什么是工作节点？ (What consists of the worker nodes?)

This is the part of the compute infra, which is what you interact with if you are on GKE. These are node pools where your workloads will run.

这是计算基础结构的一部分，如果您使用的是GKE，这就是您要与之交互的部分。这些是将在其中运行工作负载的节点池。

The components which make up the worker nodes, (excluding your workload) would be:

组成工作程序节点的组件(不包括工作负载)将是：

Kube-proxy
库贝代理
Container-runtime (docker, for example)
容器运行时(例如docker)
Kubelet
Kubelet

On a high level, kube-proxy is responsible for the translation of your service’s clusterIP to podIP, and nodePort.

从较高的角度来看，kube-proxy负责将服务的clusterIP转换为podIP和nodePort。

Kubelet is the process, which actually listens to the api-server for incoming instructions to schedule/delete pods to the node in which it is running. This instruction is in turn translated to the api instruction set which the container runtime (e.g.: docker, podman) understands.

Kubelet是一个过程，它实际上在api服务器上侦听传入的指令，以将pod调度/删除到运行该节点的节点。然后将该指令翻译为容器运行时(例如docker，podman)可以理解的api指令集。

These 3 components are managed by GKE, and whenever the nodes are being upgraded, kube-proxy and kubelet get upgraded.

这3个组件由GKE管理，每当升级节点时，kube-proxy和kubelet都会升级。

The container runtime need not receive and update while you upgrade. GKE has its own mechanism of changing the image versions of the control plane pods to do it.

升级时，容器运行时不需要接收和更新。 GKE拥有自己的机制来更改控制平面舱的图像版本。

We haven’t seen a downtime/service degradation happening due to these components getting upgraded on the cluster.

由于这些组件已在集群上升级，因此我们还没有看到停机/服务降级的情况发生。

One interesting thing to note here is that, the worker nodes can run a few versions behind the version of the master nodes. These exact versions can be tested out on staging clusters, just to have more confidence while doing the production upgrade. I’ve made an observation that if master is on 1.13.x, the nodes run just fine even if they’re on 1.11.x. Only a 2 minor version skew is recommended.

这里要注意的一件事是，工作节点可以在主节点版本之后运行几个版本。可以在暂存群集上测试这些确切的版本，以在进行产品升级时更加放心。我观察到，如果master在1.13.x上，则即使它们在1.11.x上，节点也可以正常运行。建议仅使用2个次要版本偏斜。

升级到特定版本时要检查什么？ (What to check while upgrading to a certain version?)

Since a major release cycle is quarterly for Kubernetes, one thing the operators have to compulsorily check is the release notes and the changelog for each version bump, as they usually entail quite a few api deletions and major changes.

由于Kubernetes的主要发布周期是每季度一次，因此操作员必须强制检查的一件事是每个版本的发行说明和变更日志，因为它们通常需要相当多的api删除和重大更改。

如果群集在升级时是区域性的，会发生什么？ (What happens if the cluster is regional while upgrading it?)

If the cluster is regional, the node upgrade happens zone by zone. You can control the number of nodes which can be upgraded at once using the surge configuration for the node pool. Turning off the autoscaling for the node pool is also recommended during the node upgrade upgrade.

如果群集是区域性的，则节点升级将逐区域进行。您可以使用节点池的喘振配置来控制可以立即升级的节点数。还建议在节点升级升级期间关闭节点池的自动缩放。

If surge upgrade is enabled, a surge node with the upgraded version is created and it waits till the kubelet registers itself to the api-server, marking it ready after the kubelet reports the node as healthy back to the api-server. At this point, the api-server can direct the kubelet running on the surge node to schedule any workload pods.

如果启用了浪涌升级，则会创建具有升级版本的浪涌节点，并等待直到kubelet将其自身注册到api服务器，然后在kubelet将节点报告为运行状况良好之后再将其标记为就绪。此时，api服务器可以指导在喘振节点上运行的kubelet调度任何工作负载容器。

In case of a regional cluster, another node from the same zone is picked up to be drained, after which the node is cordoned and its workload is rescheduled. At the end of this, the node gets deleted and removed from the nodepool.

在区域集群的情况下，将拾取来自同一区域的另一个节点以将其排干，此后对该节点进行警戒并重新安排其工作负载。最后，该节点将被删除并从节点池中删除。

发布渠道 (Release channels)

Setting a release channel is something which is highly recommended, we set it to stable for the production clusters, and the same for our integration clusters, with that being set, the nodes will always run the same version of kubernetes as the master nodes (excluding the small amount of time when the master is getting upgraded.)

强烈建议您设置发布渠道，对于生产集群，我们将其设置为稳定，对于集成集群，我们将其设置为稳定(设置了该设置)，节点将始终运行与主节点相同版本的kubernetes(不包括主机升级的时间很短。)

There are 3 release channels, depending on how fast you want to keep up with the kubernetes versions released upstream:

有3个发行渠道，具体取决于您要跟上上游发布的kubernetes版本的速度：

Rapid
快速
Regular (default)
常规(默认)
Stable
稳定

Setting maintenance windows will allow one to control when these upgrade operations are supposed to kick in. Once set, the cluster cannot be upgraded/downgraded manually. Hence, if one doesn’t really care about granular control, they cannot choose this option.

设置维护窗口将使您能够控制何时应该执行这些升级操作。设置后，将无法手动升级/降级集群。因此，如果一个人真的不关心粒度控制，他们将无法选择此选项。

Since I haven’t personally downgraded a master version, I suggest you try this out on a staging cluster if you really need to. Although, if you look at the docs, downgrading master is not really recommended.

由于我尚未亲自降级为主版本，因此建议您根据需要在暂存群集上进行尝试。虽然，如果您看一下文档，实际上不建议降级master。

Downgrading a node pool version is not possible, but a new node pool can always be created with the said version of kubernetes and the older node pool can be deleted.

无法降级节点池版本，但是始终可以使用上述版本的kubernetes创建新的节点池，并且可以删除较旧的节点池。

升级到1.14.x或更高版本时的网络陷阱 (Networking gotchas while upgrading to a version 1.14.x or above)

If you are running a version lesser than 1.14.x and don’t have the ip-masq-agent running and if your destination address range falls under the CIDRs 10.0.0.0/8, 172.16.0.0/12 or 192.168.0.0/16, the packets in the egress traffic will not be masqueraded, which means that the node IP will be seen in this case.

如果您正在使用的版本比1.14.x较小，不具备ip-masq-agent运行，如果你的目标地址范围落在CIDRs下10.0.0.0/8，172.16.0.0/12或192.168.0.0/16 ，则不会伪装出口流量中的数据包，这意味着在这种情况下将看到节点IP。

The default behaviour after 1.14.x (and on COS) is packets flowing from the pods stop getting NAT’d. This can cause disruption as you might not have whitelisted the pod address range.

1.14.x(和COS )之后的默认行为是从Pod流出的数据包停止进行NAT处理。这可能会导致中断，因为您可能没有将Pod地址范围列入白名单。

One way is to add the ip-masq-agent and the config for the nonMasqueradeCIDRs list the destination CIDR’s like 10.0.0.0/8 (for example if this is where your destination component like postgres lies), in this case the packets will use the podIP as the source address when the destination (postgres) receives the traffic and not the nodeIP.

一种方法是添加ip-masq-agent和ip-masq-agent的配置，列出目标CIDR，如10.0.0.0/8(例如，如果这是您的目标组件(如postgres所在的位置))，在这种情况下，数据包将使用当目标(postgres)收到流量而不是nodeIP时，podIP作为源地址。

是否可以一次升级多个节点池？ (Can multiple node pools be upgraded at once?)

No, you can’t. GKE doesn’t allow this.

不，你不能。 GKE不允许这样做。

Even when you’re upgrading one node pool, the node which gets upgraded and picked for upgraded is not something you’ll have control over.

即使在升级一个节点池时，也无法控制要升级并选择要升级的节点。

进行升级时，服务会不会出现停机时间？ (Would there be a downtime for the services when we do an upgrade?)

Let’s start with the master component. If you have a regional cluster, since the upgradation happens zone by zone, even if your service is making use of the k8s api-server to do something, it will not get affected. You could try replicating the same for the staging setup, assuming both have similar config.

让我们从主组件开始。如果您有区域性群集，则由于升级是逐区域进行的，即使您的服务正在使用k8s api服务器来做某事，也不会受到影响。您可以尝试为登台设置复制相同的文件，假设两者的配置相似。

如何防止/最小化已部署服务的停机时间？ (How to prevent/minimise downtime for the services deployed?)

For stateless applications, the simplest thing to do is to increase the replicas to reflect the number of zones in which your nodes are present. But, scheduling pods on each node across zone is not necessary. Kubernetes doesn’t handle this by default, but gives the primitives to handle this case.

对于无状态应用程序，最简单的操作是增加副本以反映节点所在区域的数量。但是，不必跨区域在每个节点上调度Pod。 Kubernetes默认情况下不处理此问题，但提供了处理这种情况的原语。

If you want to distribute pods across zones, you can apply podantiaffinity in the deployment spec for your service, with the topologyKey set to http://failure-domain.beta.kubernetes.io/zone, for the scheduler to try scheduling it across zones. Here’s a more detailed blog on scheduling rules which you can specify.

如果要跨区域分布容器，则可以在服务的部署规范中应用podantiaffinity ，并将topologyKey设置为http://failure-domain.beta.kubernetes.io/zone ，以使调度程序尝试在整个系统中调度它区域。这是您可以指定的有关调度规则的更详细的博客。

Distribution across zones will make the service resilient to zonal failures.

跨区域分布将使服务对区域故障具有弹性。

The reason we increase the replicas to greater than 1 is because when the nodes get upgraded, the node gets drained and cordoned, and the pods get bumped out from that node.

我们将副本增加到大于1的原因是，当节点升级时，该节点被耗尽和封锁，并且吊舱被从该节点撞出。

In case the service which has only 1 replica scheduled on the node set for upgrade by GKE, while the scheduler tries finding itself a new node, there would be no other pod which is serving requests. In this case, it would cause a temporary downtime.

如果在节点集上安排了仅1个副本供GKE升级的服务，而调度程序则尝试寻找自己的新节点，则不会有其他Pod在服务请求。在这种情况下，将导致暂时的停机。

One thing to note here is that, if you’re using PodDisruptionBudget(PDB), and if the running replicas are the same as the minAvailable specified in the PDB rule, the upgrade will just not happen. This is because the node will not be able to drain the pod(s) as it respects the PDB budget. Hence the solutions are:

这里要注意的一件事是，如果您使用PodDisruptionBudget(PDB) ，并且正在运行的副本与PDB规则中指定的minAvailable相同，则升级不会发生。这是因为节点尊重PDB预算，因此将无法耗尽Pod。因此，解决方案是：

To increase the pods such that the running pods are > minAvailable specified in the PDB
要增加Pod，以使正在运行的Pod在PDB中指定为> minAvailable
To remove the PDB rule specified
删除指定的PDB规则

For statefulSets, a small downtime might have to be taken while upgrading. This is because when the pods on which the stateful set is scheduled get bumped out, the pvc claim will again be made by another pod when it gets scheduled on the other node.

对于statefulSets ，在升级时可能需要花费少量的停机时间。这是因为当计划有状态集的Pod超出标准时，pvc声明将在另一个Pod安排在另一个节点上时再次由另一个Pod提出。

这些升级步骤似乎平凡。 (These steps of upgrade may seem mundane.)

Agreed, they are mundane. But there’s nothing stopping anyone from having a tool do these things. As compared to eks-rolling-update, GKE is way easier, which makes fewer touch points and cases where things can go wrong:

同意，他们是世俗的。但是，没有什么能阻止任何人使用工具来做这些事情。与eks-rolling-update相比，GKE更加容易，它减少了接触点和出现问题的情况：

The pdb budget is the hurdle for the upgrade if you don’t pay attention
如果您不注意，pdb预算是升级的障碍
Replicas are 1 or so for services
服务副本数约为1
Quite a few replicas would be in pending or crashloopbackoff
相当多的副本将处于挂起或崩溃循环退避状态
Statefulsets are an exception and need hand holding
状态集是一个例外，需要掌握

For most of the above, the initial step is to follow a fixed process (playbook) and run through it for each cluster during an upgrade. Even though the task is mundane, one would know which checks to follow and what to do to check the sanity of the cluster after the upgrade is done.

对于上述大多数情况，第一步是遵循固定过程(剧本)，并在升级过程中针对每个群集对其进行遍历。即使任务很普通，在升级完成后，人们也会知道应该执行哪些检查以及如何检查群集的完整性。

Setting replicas to 1 is just being plain naive. Let the deployment tool have defaults replicas of a minimum of (3 zones in 1 region assuming you have podantiaffinity and a best case effort gets logged by the scheduler).

将副本设置为1只是天真。让部署工具具有至少为的默认副本(假设您具有podantiaffinity，并且调度程序将记录最佳情况，则在1个区域中有3个区域)。

For the pods in pending state:

对于处于待处理状态的Pod：

You are either trying to request CPU/memory which is not available in any node in the node pools, which means you’re not sizing your pods correctly
您正在尝试请求节点池中任何节点中不可用的CPU /内存，这意味着您没有正确调整pod的大小
Or there are a few deployments which are hogging resources
或有一些部署正在占用资源

Either way, it’s a smell that you do not have enough visibility into your cluster. For statefulsets, you may not be able to prevent a downtime.

无论哪种方式，都有一种气味是您对群集没有足够的可见性。对于有状态集，您可能无法防止停机。

After all the upgrades are done, one can backfill the upgraded version numbers and other things back to the terraform config in the git repo.

完成所有升级后，可以将升级后的版本号和其他内容回填到git repo中的terraform配置中。

Once you have repeated the steps above, you can start automating a few things.

重复上述步骤后，即可开始自动化一些操作。

我们自动化的 (What we have automated)

We have automated the whole analysis of what pods are running in the cluster. We extract these information in an excel sheet:

我们已自动完成了集群中正在运行的Pod的整个分析。我们将这些信息提取到Excel工作表中：

Replicas of the pods
豆荚的副本
Age of the pod
吊舱的年龄
Status of the pods
吊舱状态
Which pods are in pending/crashloopbackoff
哪些Pod在待处理/ crashloopbackoff中
Node cpu/memory utilisation
节点CPU /内存利用率

The same script handles inserting the team ownership details of the service, by querying our service registry and storing that info.

该脚本通过查询我们的服务注册表并存储该信息来处理插入服务的团队所有权详细信息。

So all of the above details, at your tips, by just running the script in your command line and switching context to your clusters context.

因此，只需在命令行中运行脚本并将上下文切换到群集上下文，即可向您提示所有上述详细信息。

As of now, the below operations are being done via the CLI:

到目前为止，正在通过CLI执行以下操作：

Upgrading the master nodes to a certain version
将主节点升级到特定版本
Disabling surge upgrades/autoscaling the nodes
禁用浪涌升级/自动缩放节点
Upgrading the node pool(s)
升级节点池
Reenabling surge upgrades/autoscaling
重新启用浪涌升级/自动缩放
Setting maintenance window and release channel if already not set
设置维护窗口和发布渠道(如果尚未设置)

The next step would be to automate the sequence in which these operations are done, and codify the learnings and edge cases to the tool.

下一步将是自动完成这些操作的顺序，并将学习和边际情况编入工具。

Although this is a bit tedious, the laundry has to be done. There’s no running away from it. 🤷‍♂️

尽管这有点乏味，但必须洗衣服。没有逃避之路。 ♂‍♂️

Until we reach a point where the whole/major chunk of this process is automated, our team will rotate people around doing cluster upgrades. One person gets added to the roster, while another who is already in the roster from the last week will drive the upgrade for the week, while giving context to the person who has just joined.

在达到该过程的整个/大部分自动化的程度之前，我们的团队将围绕人员进行集群升级。一个人被添加到花名册中，而另一个上周已经在花名册中的人将推动本周的升级，同时为刚加入的人提供背景信息。

This helps in quick context sharing. The person who has just joined gets to upgrade the clusters by following the playbooks, thereby filling the gaps as we go forward.

这有助于快速上下文共享。刚加入的人可以通过遵循剧本来升级集群，从而填补我们前进的空白。

The interesting part is that we always emerge out of the week with something improved, some automation implemented, or some docs added — All this while also allocating dev time for automation explicitly in the sprint.

有趣的是，我们总是在一周中脱颖而出，进行一些改进，实现一些自动化或添加一些文档-所有这些同时还在sprint中显式分配自动化的开发时间。

尾注 (Ending notes)

GKE has a stable base which allows us to focus more on building the platform on top of it, rather than managing the underlying system. This improves the developer productivity by building out tooling on top of the primitives k8s gives you.

GKE拥有稳定的基础，这使我们能够更加专注于在其之上构建平台，而不是管理基础系统。通过在k8s提供的原语之上构建工具来提高开发人员的生产率。

If you compare this to something like running your own k8s cluster on top of VMs, there is a massive overhead of managing/upgrading/replacing components/nodes of your self managed cluster which in itself requires dedicated folks to handhold the cluster at times.

如果将其与在VM上运行自己的k8s集群进行比较，则存在管理/升级/更换自我管理集群的组件/节点的巨大开销，这本身就需要专人负责。

So if you really have the liberty, a managed solution is the way to go. As someone who has managed prod self hosted k8s clusters, I’d say it’s definitely not easy, and if possible, it should be delegated so the focus could be on other problems.

因此，如果您确实有自由，那么就可以采用托管解决方案。作为管理产品自托管k8s集群的人，我想说这绝对不容易，并且如果可能的话，应该委派它，以便将重点放在其他问题上。

翻译自: https://blog.gojekengineering.com/how-we-upgrade-kubernetes-on-gke-91812978a055

gke下载

weixin_26705651

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
gke下载_我们如何在GKE上升级Kubernetes

gke下载 Gojek Long阅读 (Gojek Long Reads)If you’re running Kubernetes on GKE, chances are there’s already some form of upgrades for the Kubernetes clusters in place. Given that their release cycle is qua...
复制链接

扫一扫