vms软件_通过自动修复和自动更新为您的vms提供稳定的脉冲

最新推荐文章于 2024-06-10 22:18:41 发布

郝ren

最新推荐文章于 2024-06-10 22:18:41 发布

阅读量491

点赞数

文章标签： python git linux

原文链接：https://medium.com/google-cloud/give-your-vms-a-steady-pulse-with-autohealing-and-autoupdates-ae2c0828ecc9

版权

vms软件

规模季节 (Season of Scale)

“Season of Scale” is a blog and video series to help enterprises and developers build scale and resilience into your design patterns. In this series we plan on walking you through some patterns and practices for creating apps that are resilient and scalable, two essential goals of many modern architecture exercises.

“规模的季节”是一个博客和视频系列，可帮助企业和开发人员将规模和适应性构建到您的设计模式中。在本系列中，我们计划引导您完成一些模式和实践，以创建具有弹性和可扩展性的应用程序，这是许多现代体系结构练习的两个基本目标。

In Season 1, we’re covering Infrastructure Automation and High Availability:

在第1季中，我们将介绍基础架构自动化和高可用性：

In this article I’ll walk you through how to use autohealing and autoupdates to create health checks and maintain HA for GCP Compute Engine instances.

在本文中，我将引导您逐步了解如何使用自动修复和自动更新来创建运行状况检查并维护HA for GCP Compute Engine实例。

观看视频 (Check out the video)

评论 (Review)

So far we have looked at how Critter Junction launched and globally scaled a their gaming app on Compute Engine. With their growing daily active users, we helped them set up auto scaling and global load balancing to handle globally distributed and constantly rising traffic. Today let’s learn how they can make this social critter app more scalable by gracefully replacing failed instances.

到目前为止，我们已经研究了Critter Junction如何在Compute Engine上启动并在全球范围内扩展其游戏应用程序。随着他们日常活动用户的增长，我们帮助他们设置了自动扩展和全局负载平衡，以处理在全球范围内分布并不断增长的流量。今天，让我们学习如何通过优雅地替换发生故障的实例来使此社交活动应用程序更具可伸缩性。

游戏梦night (A gaming nightmare)

To keep their users from risking their daily game streaks, Critter Junction need to make sure their app is available all the time without interruptions.

为了避免用户冒险进行日常游戏，Criter Junction需要确保他们的应用始终可用，并且不会受到干扰。

One way to do that is to set up High Availability or HA at all layers of the stack. Though that can mean distributed databases, networks, and application servers, we’re focusing on their game servers running on Compute Engine.

一种方法是在堆栈的所有层上设置高可用性或HA。尽管这可能意味着分布式数据库，网络和应用程序服务器，但我们将重点放在运行在Compute Engine上的游戏服务器上。

We know that managed instance groups provide features such as autoscaling, regional (multiple zone) deployments, autohealing and auto-updating. Two features that can be tacked onto your configuration of Compute Engine are autohealing and autoupdates.

我们知道托管实例组提供的功能包括自动缩放，区域(多个区域)部署，自动修复和自动更新。您可以在Compute Engine的配置中添加两个功能：自动修复和自动更新。

Autohealing helps proactively identify and replace the unhealthy instances (that are not responding) with healthy ones.

自动修复有助于主动识别不正常的实例并将不正常的实例替换为健康的实例。

Auto-updates help update the instances without disrupting the service

自动更新有助于在不中断服务的情况下更新实例

自动修复 (Autohealing)

Let’s focus on Autohealing for a bit.

让我们专注于自动修复。

The first step is to create a health check, which not only detects whether the machine is running or not but also detects application-specific issues such as freezing, crashing, or overloading. If an instance is deemed unhealthy, new instances are created by the managed instance group.

第一步是创建运行状况检查，它不仅可以检测计算机是否正在运行，还可以检测特定于应用程序的问题，例如冻结，崩溃或过载。如果某个实例被认为不正常，则托管实例组将创建新实例。

We’re building on the instance configuration we created in the previous article.

我们以上一篇文章中创建的实例配置为基础。

First, create a health check in Compute Engine and give it a name.
首先，在Compute Engine中创建运行状况检查并为其命名。
Set the protocol to HTTP.
将协议设置为HTTP 。
You can set the health check on any path, but let’s say the path is /health.
您可以在任何路径上设置运行状况检查，但是假设该路径为/ health。

In our demo app we added code that ensures that /health returns 200 OK response when healthy, and HTTP 500 internal server error when unhealthy.

在我们的演示应用程序中，我们添加了代码，以确保/ health正常时返回200 OK响应，不正常时返回HTTP 500内部服务器错误。

制定健康标准 (Set up the health criterion)

Set Check interval to 10, which means every 10 seconds the service will be probed for health.
将“检查间隔”设置为10 ，这意味着将每10秒对服务进行一次健康检查。
Set timeout to 5. which means we wait for max 5 seconds for a response to a probe.
将超时设置为5 。这意味着我们最多等待5秒钟才能响应探测。
Set a Healthy threshold to 2, which defines the number of sequential probes that must succeed for the instance to be considered healthy.
将运行状况阈值设置为2 ，该阈值定义了必须成功进行的连续探测的次数，实例才能被视为运行状况良好。
And finally, set an unhealthy threshold to 3, which defines the number of sequential probes that must fail for the instance to be considered unhealthy.
最后，将不正常的阈值设置为3 ，该阈值定义了必须失败才能使实例被认为不正常的顺序探测的数量。
And then create.
然后创建。

As a best practice, you want the health check to be conservative so you don’t preemptively delete and recreate instances.

作为最佳实践，您希望运行状况检查比较保守，这样就不会抢先删除和重新创建实例。

向现有实例添加运行状况检查 (Add a health check to an existing instance)

Now, let’s go to our instance group we created in the last episodes and add a health check to it.

现在，让我们转到上一集中创建的实例组，并向其添加运行状况检查。

Select the health check with an initial delay of 90 seconds.
选择初始时间为90秒的运行状况检查。

Ideally this initial delay should be long enough for the instance to be fully running and ready to respond as healthy.

理想情况下，此初始延迟应足够长，以使实例完全运行并准备好健康响应。

模拟失败 (Simulate failures)

Let’s have some fun with this and simulate failures now.

让我们一起玩一下，现在模拟失败。

For that, we go to the VM instance and click on external IP and make it unhealthy.
为此，我们转到虚拟机实例，然后单击外部IP，使其不正常。
Wait for the autohealer to take action and you’ll see that the green checkmark next to the instance turns into a spinner, indicating that the autohealer has started rebooting that instance.
等待自动修复程序执行操作，您将看到实例旁边的绿色对勾变成旋转器，表明自动修复程序已开始重新启动该实例。

更新实例时该怎么办？ (What about when you update an instance?)

One of the other concerns when it comes to HA is applying updates to instances without impacting the service. Managed instances groups allow you to control the speed and scope of the update rollout to minimize disruptions to your application. You can also perform partial rollouts, which allows for canary testing.

关于HA的另一个问题之一是在不影响服务的情况下将更新应用于实例。托管实例组使您可以控制更新推出的速度和范围，以最大程度地减少对应用程序的破坏。您还可以执行部分部署，以进行金丝雀测试。

Let’s see that in action now!

让我们现在来看一下！

On our instance group click on the rolling update button. Rolling means it’s used for gradual updates.
在我们的实例组上，单击滚动更新按钮。滚动表示它用于逐步更新。
Add a second template for canary testing and select target size as 20%.
添加第二个模板进行金丝雀测试，然后将目标尺寸选择为20％。

This means we want to send 20% of the traffic to the new instances for canary testing

这意味着我们希望将20％的流量发送到新实例进行金丝雀测试

3. Now, update mode is by default proactive which means Compute Engine actively schedules actions to apply the requested updates to instances as necessary. In many cases, this often means deleting and recreating instances proactively.

3.现在，默认情况下，更新模式默认为主动模式，这意味着Compute Engine会主动计划一些操作，以根据需要将请求的更新应用于实例。在许多情况下，这通常意味着主动删除和重新创建实例。

You can choose to perform an opportunistic update if a proactive update is potentially too disruptive. An opportunistic update is only applied when you manually initiate the update on selected instances or when new instances are created by the managed instance group.
如果主动更新可能会破坏性很大，则可以选择执行机会更新。仅当您在选定实例上手动启动更新或托管实例组创建新实例时，才会应用机会更新。
Max surge means how many more instances are you willing to spin up as a part of this update. The higher value here speeds up the update, but costs more for new instances. So you face a tradeoff between cost and speed.
最大浪涌意味着您愿意作为此更新的一部分增加更多实例。较高的值可加快更新速度，但新实例的成本更高。因此，您面临成本与速度之间的权衡。
Max unavailable and min wait time: Keep them as zero but these parameters are used to control how disruptive the update is to your service and to control the rate at which the update is deployed.
最长不可用时间和最短等待时间 ：将它们保持为零，但是这些参数用于控制更新对您的服务的破坏程度，并控制更新的部署速度。

就是这样！ (And that’s it!)

Through our help setting up two high availability features within managed instance groups, Critter Junction has a much more resilient architecture. Autohealing proactively identifies unhealthy instances and heals them, while auto updates update instances without disrupting the service. Stay tuned to find out what more in store for Critter Junction.

通过我们的帮助，在托管实例组内设置了两个高可用性功能，Criter Junction具有更具弹性的体系结构。自动修复可主动识别不正常的实例并对其进行修复，而自动更新可在不中断服务的情况下更新实例。请继续关注，以了解更多有关Critter Junction的内容。

And remember, always be architecting.

记住，永远是建筑师。

后续步骤和参考： (Next steps and references:)

Follow this blog series on Google Cloud Platform Medium.
在Google Cloud Platform Medium上关注此博客系列。
Reference: Setting up health checking and autohealing.
参考：设置健康检查和自动修复。
Follow Season of Scale video series and subscribe to Google cloud platform YouTube channel.
关注“规模效应”视频系列，并订阅Google云平台YouTube频道。
Want more stories? Give me a shout on Medium, and Twitter.
需要更多故事吗？让我在Medium和Twitter上大喊大叫。
Enjoy the ride with us through this miniseries and learn more about Google Cloud solutions!
通过这个小型系列与我们一起享受旅程，并详细了解Google Cloud解决方案！

翻译自: https://medium.com/google-cloud/give-your-vms-a-steady-pulse-with-autohealing-and-autoupdates-ae2c0828ecc9

vms软件

郝ren

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
vms软件_通过自动修复和自动更新为您的vms提供稳定的脉冲

vms软件规模季节 (Season of Scale)“Season of Scale” is a blog and video series to help enterprises and developers build scale and resilience into your design patterns. In this series we plan on walking you...
复制链接

扫一扫