拍时缩放

最新推荐文章于 2024-09-15 22:31:42 发布

weixin_26739145

最新推荐文章于 2024-09-15 22:31:42 发布

阅读量140

点赞数

文章标签： python java android

原文链接：https://build.thebeat.co/rds-scaling-at-beat-e0d58cd7f1df

版权

At Beat, we run all of our Relational Database Workloads on AWS RDS Aurora. We currently operate around 60 Aurora Clusters spanning 3 AWS Regions that sum up to around 160 Database (DB) Instances. Managing them through Terraform and Terragrunt makes all our DB operations easier.

在Beat ，我们在AWS RDS Aurora上运行所有关系数据库工作负载。当前，我们在大约3个AWS区域中运行着大约60个Aurora集群，这些集群总计约160个数据库(DB)实例。通过Terraform和Terragrunt管理它们使我们所有的数据库操作变得更加容易。

In this post, we will present:

在这篇文章中，我们将介绍：

How we enabled Autoscaling for our Aurora Clusters to automatically add and remove DB instances in an Aurora cluster based on the average CPU Utilization
我们如何为Aurora群集启用自动缩放以根据平均CPU利用率自动在Aurora群集中添加和删除数据库实例
How we made it trivial to scale up or down our Aurora clusters by using Blue-Green deployment.
通过使用蓝绿色部署，我们如何轻松地扩展或缩减Aurora群集。

COVID-19：推动因素 (COVID-19: The push-factor)

The aforementioned DB count was way bigger before COVID-19 hit the globe, including all the markets that we operate in Latin America, we had more than 200 DB Instances on those 60 Aurora Clusters. Fortunately, AWS Aurora can scale to big numbers ranging from a mere writer instance with 1vCPU — 1GB RAM up to 1 writer and 15 reader replicas of 96 vCPU — 768GB RAM each and 64TiB of fast SSD cluster storage. We had some markets on peak time running with 10+ DB instances in a cluster with the biggest instance type of 96 vCPU and 768GB RAM.

在COVID-19上市之前，上述数据库数量要大得多，包括我们在拉丁美洲运营的所有市场，在这60个Aurora集群上我们有200多个数据库实例。幸运的是，AWS Aurora可以扩展到从具有1vCPU(1GB RAM)的单个写入器实例到最多1个写入器以及15个96 vCPU(每个768GB RAM)和64TiB快速SSD群集存储的读取器副本的范围。我们在高峰时间有一些市场，在群集中运行10个以上数据库实例，最大实例类型为96 vCPU和768GB RAM。

In response to the pandemic crisis, we undertook various tasks under the scope of cost optimizations. One of those tasks was to enable the Autoscaling on our Aurora Clusters and scale down our clusters since there were curfews on our operating markets. Due to the nature of our Aurora Clusters setup, which includes RDS Custom Endpoints, we can’t enable the native Autoscaling mechanism. We use custom endpoints because we have DB instances that are utilized for different purposes, like Hadoop HDFS migration, and we don’t want application traffic reaching those instances. You can read more about Cluster Endpoints on the AWS Documentation.

为了应对大流行危机，我们在成本优化范围内承担了各种任务。其中一项任务是在我们的Aurora群集上启用自动缩放并缩小群集，因为我们的运营市场存在宵禁。由于我们的Aurora群集设置(包括RDS自定义端点)的性质，我们无法启用本地自动缩放机制。我们使用自定义端点，因为我们有用于不同目的的数据库实例，例如Hadoop HDFS迁移，并且我们不希望应用程序流量到达这些实例。您可以在AWS文档上阅读有关集群端点的更多信息。

看看我们的Aurora群集设置 (A look into our Aurora Cluster Setup)

A typical DB cluster at Beat looks like the following:

Beat的典型数据库集群如下所示：

The main DB Cluster Writer instance
主数据库集群编写器实例
Various DB Reader instances specifically for our application
专门用于我们的应用程序的各种DB Reader实例
A dedicated DB Reader instance for HDFS Migration workloads
用于HDFS迁移工作负载的专用DB Reader实例
A couple of other dedicated DB Reader instances for tasks like Business Intelligence queries, that we don’t want to affect our application.
我们不想影响应用程序的其他两个专用DB Reader实例，用于诸如商业智能查询之类的任务。

Image for post — *A typical Aurora RDS Cluster at Beat* *Beat的典型Aurora RDS群集*

Two out of four DB reader instances are the application readers (mx-aurora-green-1 & mx-aurora-green-2), and the other two DB reader instances are instances dedicated to other tasks (mx-aurora-hdfs-migration & mx-aurora-slave-bi-1) which reach over 90% CPU Utilization regularly in the day.

四个数据库读取器实例中有两个是应用程序读取器(mx-aurora-green-1和mx-aurora-green-2)，另外两个数据库读取器实例是专用于其他任务的实例(mx-aurora-hdfs-migration ＆mx-aurora-slave-bi-1)，每天可以正常达到90％以上的CPU使用率。

Using the default AWS Autoscaling mechanisms would not work for our setup. AWS uses the Aurora Cluster Readers CPUUtilization CloudWatch Metric to calculate the cluster average CPU utilization across all the DB Reader instances, which in our case would include the application readers, the dedicated HDFS migration reader, and the rest of the BI reader instances. This does not work for us because we need our Aurora Clusters to scale out to meet our application needs and not be affected by the other instances in the cluster which, in the above setup, account for 50% of the Aurora Cluster Readers CPUUtilization CloudWatch Metric.

使用默认的AWS Autoscaling机制不适用于我们的设置。 AWS使用Aurora群集读取器CPUUtilization CloudWatch指标来计算所有数据库读取器实例的群集平均CPU利用率，在我们的示例中，它将包括应用程序读取器，专用的HDFS迁移读取器以及其他BI读取器实例。这对我们不起作用，因为我们需要横向扩展Aurora群集以满足我们的应用程序需求，并且不受群集中其他实例的影响，在上述设置中，该实例占Aurora群集读取器CPUUtilization CloudWatch指标的50％。

For example, if the two dedicated DB reader instances sit on 90% of CPU Utilization and the two application DB reader instances sit on 20% of CPU Utilization we take a total average of 55% CPU Utilization, which in our case would violate the 40% CPU Utilization threshold that we have and start adding replicas to the cluster when they are not needed. So much for trying to optimize our RDS costs.

例如，如果两个专用的数据库读取器实例占CPU利用率的90％，而两个应用程序数据库读取器实例占CPU利用率的20％，则我们的平均总CPU利用率为55％，在我们的情况下，这将违反40我们拥有的CPU使用率百分比阈值，并在不需要副本时开始将副本添加到群集中。试图优化我们的RDS成本非常重要。

We are looking for a Senior DevOps Engineer. Apply here.

我们正在寻找高级DevOps工程师。在这里申请。

追求自动缩放机制 (In the pursuit of an Autoscaling mechanism)

The reduced traffic we experienced in our systems due to the global pandemic crisis, gave us the time and space to find an efficient solution to performing autoscaling tasks in our Aurora DB Clusters. Since we can’t use the AWS Autoscaling method out of the box, we had to look for a different, tailored-made way.

由于全球大流行危机，我们系统中的流量减少，这使我们有时间和空间来寻找有效的解决方案，以在Aurora数据库集群中执行自动伸缩任务。由于我们不能立即使用AWS Autoscaling方法，因此我们不得不寻找一种不同的量身定制的方法。

换一种方式，但是还不够好 (Another way, but not good enough)

So, we started experimenting with Log Groups produced by the enhanced monitoring enabled on our instances and the CloudWatch Custom Metrics. We created a pattern that filters out the irrelevant RDS Replicas and matches only our application Reader Replicas for a given Aurora Cluster. Every minute, we match the pattern only for our application reader replicas and we add to our custom metric the actual CPU utilization for each replica.

因此，我们开始试验由实例和CloudWatch自定义指标启用的增强监控所产生的日志组。我们创建了一个模式，该模式可以过滤掉不相关的RDS副本，并仅匹配给定Aurora集群的应用程序Reader副本。每分钟，我们仅匹配应用程序读取器副本的模式，并将每个副本的实际CPU利用率添加到自定义指标中。

Then, we use Terraform to create a Custom Scaling Policy using the above custom metric, and voila our RDS Clusters are now targeting a specific value using the custom metric and always keep our Application Reader Replicas below 40% of average CPU utilization by scaling out and in to meet this target value.

然后，我们使用Terraform通过上述自定义指标创建自定义扩展策略，瞧瞧，我们的RDS群集现在正使用该自定义指标来定位特定值，并始终通过向外扩展和扩展将我们的Application Reader副本始终保持在平均CPU利用率的40％以下达到这个目标值。

Same to the default AWS Autoscaling mechanism, this is not a good option for us either. Due to the information in a log stream coming from an RDS instance, we can only have the DBIdentifier and the OS Metrics which means we won’t be able to tell readers and writer instances apart and, also, the Autoscaling created instances don’t give us a way to tell which cluster they belong to since they have a naming convention like “application-autoscaling-<UUID>”.

与默认的AWS Autoscaling机制相同，这对我们也不是一个好选择。由于来自RDS实例的日志流中的信息，我们只能使用DBIdentifier和OS指标，这意味着我们将无法区分读写器实例，并且Autoscaling创建的实例也无法做到这一点由于它们具有类似“ application-autoscaling- <UUID>”的命名约定 ，因此提供了一种方法来告诉它们属于哪个集群。

正确的方式 (The right way)

Although we wanted to use the Target Tracking Autoscaling Policy which allows us to always keep the Cluster DB readers around a predefined average CPU utilization, it’s not possible without having to create our mechanism.

尽管我们希望使用目标跟踪自动扩展策略，该策略使我们能够始终将Cluster DB读取器保持在预定义的平均CPU利用率附近，但是如果没有创建我们的机制，这是不可能的。

So, we went with the step scaling autoscaling policy using custom CloudWatch alarms. We use CloudWatch expressions to produce the CloudWatch alarm, which looks like this:

因此， 我们采用了自定义CloudWatch警报的逐步扩展自动扩展策略 。我们使用CloudWatch表达式生成CloudWatch警报，如下所示：

(sum_reader_replicas_cpu — sum_excluded_replicas_cpu) / (total_number_reader_replicas — total_number_excluded_replicas)

sum_reader_replicas_cpu is the SUM of CPUUtilization Metric with Dimensions the Role = READER and the DBClusterIdentifier = <Aurora Cluster>sum_excluded_replicas_cpu is the SUM of CPUUtilization Metric with Dimension the DBInstanceIdentifier = <Aurora Cluster Reader DB instance, ie. HDFS migration instance> for the excluded replicastotal_number_reader_replicas is the SampleCount of CPUUtilization Metric with Dimensions the Role = READER and the DBClusterIdentifier = <Aurora Cluster>total_number_excluded_replicas is the number of excluded reader replicas coming from our Terraform code

sum_reader_replicas_cpu是SUM CPU利用率度量与尺寸的Role = READER和DBClusterIdentifier = <Aurora Cluster> sum_excluded_replicas_cpu是SUM CPU利用率度量与尺寸的DBInstanceIdentifier = <Aurora Cluster Reader DB instance, ie. HDFS migration instance> 排除副本的DBInstanceIdentifier = <Aurora Cluster Reader DB instance, ie. HDFS migration instance> total_number_reader_replicas是CPUUtilization指标的SampleCount ，其维度为Role = READER ， DBClusterIdentifier = <Aurora Cluster> total_number_excluded_replicas是来自Terraform代码的排除读取器副本的数量

The above CloudWatch expressions allow us to create the CloudWatch alarm that makes our autoscaling possible using the Step Scaling Policy of AWS.

上面的CloudWatch表达式使我们能够创建CloudWatch警报，该警报使我们可以使用 AWS 的 逐步扩展策略 进行自动扩展。

In the image above you can see the difference between the metric that AWS uses with Autoscaling by default (1), and our CloudWatch Alarm Expression one (3). This happens because the HDFS migration reader instance is running the nightly migrations, while the application has almost no traffic.

在上图中，您可以看到AWS默认使用自动缩放的指标(1)与我们的CloudWatch警报表达式(3)之间的差异。发生这种情况是因为HDFS迁移读取器实例正在每晚进行迁移，而应用程序几乎没有流量。

The Custom Metric from Log Groups (2) and the Alarm Expression are fairly close (3). However, one would see a bigger difference when the writer DB instance has a high CPU Utilization and when autoscaling instances are being added to an Aurora Cluster.

来自日志组的自定义指标(2)和警报表达式相当接近(3)。但是，当写入器数据库实例的CPU使用率较高并且将自动扩展实例添加到Aurora群集时，将会看到更大的差异。

蓝绿色用于放大/缩小 (Blue-Green for scale-up/down)

Scaling our clusters up and down was a manual and error-prone task for the DevOps team. Having to scale our DB instances from one instance type to another required us to change the instances one by one, starting with the readers and then failover the writer and then scale that instance as well. Now, it’s a trivial task with Terragrunt and Blue-Green deployments.

对于DevOps团队而言，上下扩展集群是一项手动且容易出错的任务。必须将数据库实例从一种实例类型扩展到另一种实例类型，这要求我们从读取器开始，然后故障转移写入器，然后再扩展该实例，一个接一个地更改实例。现在，对于Terragrunt和Blue-Green部署而言，这是一项微不足道的任务。

使缩放操作变得微不足道 (Making scaling operations trivial)

Blue-Green is a way to deploy a new version of your application, or in this case new instance types in our Aurora clusters, without having downtime. While the one color (blue) is active, one deploys the new version to the other color (green) and slowly drains the traffic from the old color (blue) by increasing the traffic to the new color (green).

蓝绿色是一种在不停机的情况下部署新版本的应用程序或在我们的Aurora群集中部署新实例类型的方法。 当一种颜色(蓝色)处于活动状态时，一种会将新版本部署到另一种颜色(绿色)，并通过增加对新颜色(绿色)的访问量来缓慢消耗旧颜色(蓝色)的访问量。

This is the approach we took with our Aurora Clusters. We need to have the same instance type for our writer and our application reader replicas, and so we added this logic on our Terraform Aurora module that allows us to scale up and down our clusters without downtime.

这是我们对Aurora群集采取的方法。我们的编写器和应用程序读取器副本必须具有相同的实例类型，因此我们在Terraform Aurora模块中添加了此逻辑，使我们能够在不停机的情况下扩展和缩减集群。

To achieve this, we added a new RDS Custom Cluster endpoint (blue-green) pointing to the application readers, either blue or green and a custom endpoint for the Hadoop HDFS instance. At any given time there is one active group color (blue or green) and the traffic is routed to this group via the blue-green cluster endpoint, except for when we need to scale. When this time comes, we add instances to the inactive group color which are also added automatically to the blue-green cluster endpoint and serve some of the traffic. Then, we swap the groups by deactivating the old color and activating the new one which triggers the removal of the old color DB instances and the Writer Failover to a new color DB instance (which incurs around 30 seconds of downtime for the writer).

为实现此目的，我们添加了一个新的RDS自定义集群端点(蓝绿色)，指向应用程序读取器(蓝色或绿色)以及一个Hadoop HDFS实例的自定义端点。在任何给定时间，都有一个活动的组颜色(蓝色或绿色)，流量通过蓝绿色群集端点路由到该组，除非需要扩展。到了这个时候，我们将实例添加到不活动的组颜色中，这些实例也会自动添加到蓝绿色集群端点中，并提供一些流量。然后，我们通过停用旧颜色并激活新颜色来交换组，这会触发旧颜色数据库实例的删除和Writer Failover到新的颜色数据库实例的转移(这会导致写入者大约30秒钟的停机时间)。

接下来呢？ (Where next?)

Let’s recap our achievements:

让我们回顾一下我们的成就：

We made it possible to use AWS Autoscaling Mechanism by tweaking the custom expressions that CloudWatch alarms use to trigger the Autoscaling, leaving room for our on-call Engineers to enjoy a good night’s sleep
通过调整CloudWatch警报用于触发自动缩放的自定义表达式，我们可以使用AWS自动缩放机制，从而为我们的值班工程师留出充足的睡眠
Blue-Green deployments on Aurora Clusters made it effortless to change our clusters’ instance types.
Aurora群集上的蓝绿色部署使更改群集实例类型变得毫不费力。

Of course, there is still room for improvement on our Aurora Clusters when it comes to Autoscaling. As a next step, we want to try to swap the StepScalingPolicy with the TargetTrackingPolicy which is more dynamic. We also look forward to the PredictiveScaling becoming available for the RDS. This will allow RDS to use historic data and add instances to the cluster moments before the traffic kicks in.

当然，在自动缩放方面，我们的Aurora群集仍有改进的空间。下一步，我们想尝试将StepScalingPolicy与更具动态性的TargetTrackingPolicy交换。我们也期待PredictiveScaling可用于RDS。这将允许RDS使用历史数据，并在流量开始之前将实例添加到群集时刻。

Read more articles by the Beat DevOps team.

阅读 Beat BeatOps团队的 更多文章 。

To join us on the ride, check out all our open positions and apply.

要加入我们的行列， 请查看我们所有的空缺职位 并申请。

About AlexisAlexis Polyzos is a DevOps Engineer and a member of the DevOps Group at Beat. He is passionate about new technologies and how they can be used to build reliable, scalable, and highly available systems.

关于Alexis Alexis Polyzos是DevOps工程师，也是Beat的DevOps集团的成员。他对新技术以及如何将其用于构建可靠，可扩展且高度可用的系统充满热情。