北弗吉尼亚（US-EAST-1）地区的Amazon Kinesis事件总结 - 译

最新推荐文章于 2023-07-31 16:24:51 发布

N拳超人

最新推荐文章于 2023-07-31 16:24:51 发布

阅读量1.2k

点赞数

分类专栏：译文文章标签：分布式缓存 aws 云计算

原文链接：https://aws.amazon.com/message/11201/

版权

译文专栏收录该内容

3 篇文章 0 订阅

订阅专栏

原文链接 -> https://aws.amazon.com/message/11201/

总结

Kinesis线程数超过了系统的限制导致关键数据缓存不能被创建。因为内部实现原因导致不能被很快的重启恢复，然后介绍一些短/中期的恢复方案。紧接着，介绍了Kinesis 如何影响Cognito和Cloudwatch，并且cloudwatch怎么间接的影响reactive auto scaling/lambda/ecs/eks还有service health dashboard等。

译文

We wanted to provide you with some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on November 25th, 2020.

我们(Amzn)想要给你们提供一些额外的信息 - 关于2020年11月25日发生在北美弗吉尼亚地区(US-EAST-1)的AWS服务中断事件。

‌

Amazon Kinesis enables real-time processing of streaming data. In addition to its direct use by customers, Kinesis is used by several other AWS services. These services also saw impact during the event. The trigger, though not root cause, for the event was a relatively small addition of capacity that began to be added to the service at 2:44 AM PST, finishing at 3:47 AM PST. Kinesis has a large number of “back-end” cell-clusters that process streams. These are the workhorses in Kinesis, providing distribution, access, and scalability for stream processing. Streams are spread across the back-end through a sharding mechanism owned by a “front-end” fleet of servers. A back-end cluster owns many shards and provides a consistent scaling unit and fault-isolation. The front-end’s job is small but important. It handles authentication, throttling, and request-routing to the correct stream-shards on the back-end clusters.

Amazon Kinesis 主要功能是实时处理流数据。除了被客户直接使用之外，Kinesis还被作为内部的实现被其他几种AWS服务使用。这些AWS 服务也在此次事件中受到影响。此次事件的触发(不是根本原因)的原因是给整个服务的集群加入一小部分的新容量，这些容量是从2:44 AM PST 到 3:47 AM PST被加入的。 Kinesis 内部有很多用来处理流数据的backend cell-clusters(蜂窝集群 - 类似蜂窝结构的计算机集群)。它们是Kinesis的"中坚力量" 用来对流数据的处理进行分布、访问、拓展。流通过分片的机制被分配到backend，这些分片的机制是由frontend 集群服务控制的。任意backend 集群都管理着多个分片并提供一个统一的可拓展单元和故障隔离。 frontend的任务很微不足道但确是非常重要. 它处理包括认证、节流和将请求分发到正确的backend 流分片上。

‌

The capacity addition was being made to the front-end fleet. Each server in the front-end fleet maintains a cache of information, including membership details and shard ownership for the back-end clusters, called a shard-map. This information is obtained through calls to a microservice vending the membership information, retrieval of configuration information from DynamoDB, and continuous processing of messages from other Kinesis front-end servers. For the latter communication, each front-end server creates operating system threads for each of the other servers in the front-end fleet. Upon any addition of capacity, the servers that are already operating members of the fleet will learn of new servers joining and establish the appropriate threads. It takes up to an hour for any existing front-end fleet member to learn of new participants.

额外的计算容量被加入到frontend的机组中。frontend 机组中的每台服务器都维护着一个信息缓存，它包括成员详细信息和后端集群的分片所有权（称为分片映射）。获取这些信息的方式是调用能提供成员信息的微服务；从dynamodb 读取配置信息; 处理来自其他Kinesis frontend 机组的信息。对于后者，每个frontend的服务器会给在机组中的其他服务器创建一个操作系统线程。在增加容量后，机组中的已有服务器会知道机组中有新的成员加入并为这些新加入的成员建立适当的线程。这个过程需要最多一个小时的时间来完成。

‌

At 5:15 AM PST, the first alarms began firing for errors on putting and getting Kinesis records. Teams engaged and began reviewing logs. While the new capacity was a suspect, there were a number of errors that were unrelated to the new capacity and would likely persist even if the capacity were to be removed. Still, as a precaution, we began removing the new capacity while researching the other errors. The diagnosis work was slowed by the variety of errors observed. We were seeing errors in all aspects of the various calls being made by existing and new members of the front-end fleet, exacerbating our ability to separate side-effects from the root cause. At 7:51 AM PST, we had narrowed the root cause to a couple of candidates and determined that any of the most likely sources of the problem would require a full restart of the front-end fleet, which the Kinesis team knew would be a long and careful process. The resources within a front-end server that are used to populate the shard-map compete with the resources that are used to process incoming requests. So, bringing front-end servers back online too quickly would create contention between these two needs and result in very few resources being available to handle incoming requests, leading to increased errors and request latencies. As a result, these slow front-end servers could be deemed unhealthy and removed from the fleet, which in turn, would set back the recovery process. All of the candidate solutions involved changing every front-end server’s configuration and restarting it. While the leading candidate (an issue that seemed to be creating memory pressure) looked promising, if we were wrong, we would double the recovery time as we would need to apply a second fix and restart again. To speed restart, in parallel with our investigation, we began adding a configuration to the front-end servers to obtain data directly from the authoritative metadata store rather than from front-end server neighbors during the bootstrap process.

在5:15 AM PST, 第一个关于Kinesis数据put/get 错误请求的警报被触发。工程团队开始介入并查看日志进行排查。新加入的容量首当其冲的被当做怀疑对象，尽管发现很多错误是和新加容量不相关的，即便将新的容量移除也很可能并不能解决问题。但是，作为一种猜测，我们还是开始移除新的容量并且与此同时我们还继续进行排查其他的可能性。因为发现了各种各样的错误，诊断的工作被大大减缓。在排查的过程中我们发现了来自新的或者旧的frontend服务器请求的各种错误，这间接地加剧了排查的难度使我们不能将副作用和根本原因进行分开。在7:51 AM PST, 我们已将根本原因缩小为几个候选对象，并确定问题的任何最可能根源都需要完全重新启动frontend机组，并且Kinesis团队知道这这个重启的过程将是漫长并且小心的。在frontend服务器中，被用来创建分片映射的资源和被用来处理传入请求的资源相互竞争。因此，如果使前端服务器过快地恢复在线状态会在这两个需求之间造成争用，并导致只有少量的资源被分配给用于处理传入请求，从而导致错误和请求等待时间增加。最终导致这些高延迟的frontend 服务器被定义为不健康的，并将他们从机组中移除，这反过来又会阻碍恢复的过程。所有候选解决方案都需要更改每个frontend服务器的配置并重启。有个方案（一个似乎正在造成内存压力的问题）看起来是最有可能是根本原因, 但是如果我们错了那么最终恢复的时间会被加倍，因为再一次的修复同样需要重启。为了加快重新启动速度，在进行调查的同时，我们开始向前端服务器添加配置，以便直接从权威元数据数据库获取数据，而不是在重启过程中从前端服务器邻节点获取。

‌

At 9:39 AM PST, we were able to confirm a root cause, and it turned out this wasn’t driven by memory pressure. Rather, the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration. As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters. We didn’t want to increase the operating system limit without further testing, and as we had just completed the removal of the additional capacity that triggered the event, we determined that the thread count would no longer exceed the operating system limit and proceeded with the restart. We began bringing back the front-end servers with the first group of servers taking Kinesis traffic at 10:07 AM PST. The front-end fleet is composed of many thousands of servers, and for the reasons described earlier, we could only add servers at the rate of a few hundred per hour. We continued to slowly add traffic to the front-end fleet with the Kinesis error rate steadily dropping from noon onward. Kinesis fully returned to normal at 10:23 PM PST.

在9:39AM PST, 我们最终确认了错误的根本原因，它不是由内存压力造成的。而是因为新加入的容量导致机组中所有服务器超过操作系统配置允许的最大线程数。一旦超过这个限制，缓存就无法被建立，frontend服务器就没有分片映射，并最终导致无法将请求路由到 backend 服务器。我们不想在不做进一步测试的情况下盲目地提高操作系统的限制，并且我们已经把新加的容量给移除掉，因此现在的线程数应该不会超过操作系统的限制，综上我们决定开始重启。第一组回归正常的frontend Kinesis服务器在10:07AM PST 的时候开始处理请求。 frontend有几千个服务器组成，根据前面的介绍，我们只能按每小时几百个服务器的速率来进行重启。随着Kinesis错误率从中午开始稳步下降，我们继续缓慢地增加了frontend机组的流量。 Kinesis在上10:23PM PST 完全恢复正常。

‌

For Kinesis, we have a number of learnings that we will be implementing immediately. In the very short term, we will be moving to larger CPU and memory servers, reducing the total number of servers and, hence, threads required by each server to communicate across the fleet. This will provide significant headroom in thread count used as the total threads each server must maintain is directly proportional to the number of servers in the fleet. Having fewer servers means that each server maintains fewer threads. We are adding fine-grained alarming for thread consumption in the service. We will also finish testing an increase in thread count limits in our operating system configuration, which we believe will give us significantly more threads per server and give us significant additional safety margin there as well. In addition, we are making a number of changes to radically improve the cold-start time for the front-end fleet. We are moving the front-end server cache to a dedicated fleet. We will also move a few large AWS services, like CloudWatch, to a separate, partitioned front-end fleet. In the medium term, we will greatly accelerate the cellularization of the front-end fleet to match what we’ve done with the back-end. Cellularization is an approach we use to isolate the effects of failure within a service, and to keep the components of the service (in this case, the shard-map cache) operating within a previously tested and operated range. This had been under way for the front-end fleet in Kinesis, but unfortunately the work is significant and had not yet been completed. In addition to allowing us to operate the front-end in a consistent and well-tested range of total threads consumed, cellularization will provide better protection against any future unknown scaling limit.

对于Kinesis，我们有很多的教训并且会被立马实时。从短期看，我们会给服务器升级更大的CPU和内存来减少服务器的总数量从而减少了每台服务器上被用来和其他服务器交互的线程总数。这将会给线程的数量提供大量的缓冲，因为每个服务器必须维护的总线程总数和机组中所有服务器的总数成正相关。服务器总数少一些，那每个服务器的线程数也要少一些。我们在服务中也给线程的消耗加入了更全面的警报系统。我们还将在操作系统配置中进行增加线程数限制的测试，我们相信这能显著的给每个服务器增加线程数，并给我们带来额外的安全边界。我们还会给服务进行一系列的加强来显著地减少frontend服务器的冷启动时间。我们还会将frontend服务器的缓存迁移到一个特定的机组上。并且还会把一些比较重要的AWS服务比如cloudwatch 分配到特定的frontend服务器上。从中期看，我们会加速对frontend服务器的蜂窝化进程使其与backend的进行匹配。蜂窝化是一种我们用于隔离服务内故障影响，并使服务的组件（对于我们来说是分片映射缓存）保持在先前测试和可用的范围内的方法。这个方法正在被实施到Kinesis的frontend机组中，但工作重大因此尚未完成。它除了允许我们在一致且经过良好测试的总线程范围内操作frontend外，蜂窝化还将提供更好的保护，防止将来发生任何未知的扩展限制。

There were a number of services that use Kinesis that were impacted as well. Amazon Cognito uses Kinesis Data Streams to collect and analyze API access patterns. While this information is extremely useful for operating the Cognito service, this information streaming is designed to be best effort. Data is buffered locally, allowing the service to cope with latency or short periods of unavailability of the Kinesis Data Stream service. Unfortunately, the prolonged issue with Kinesis Data Streams triggered a latent bug in this buffering code that caused the Cognito webservers to begin to block on the backlogged Kinesis Data Stream buffers. As a result, Cognito customers experienced elevated API failures and increased latencies for Cognito User Pools and Identity Pools, which prevented external users from authenticating or obtaining temporary AWS credentials. In the early stages of the event, the Cognito team worked to mitigate the impact of the Kinesis errors by adding additional capacity and thereby increasing their capacity to buffer calls to Kinesis. While this initially reduced impact, by 7:01 AM PST errors rates increased significantly. The team was working in parallel on a change to Cognito to reduce the dependency on Kinesis. At 10:15 AM PST, deployment of this change began and error rates began falling. By 12:15 PM PST, error rates were significantly reduced, and by 2:18 PM PST Cognito was operating normally. To prevent a recurrence of this issue, we have modified the Cognito webservers so that they can sustain Kinesis API errors without exhausting their buffers that resulted in these user errors.

很多使用到Kinesis的aws 服务也受到了影响。Amazon Cognito 使用Kinesis的数据流来收集和分析API访问模式。尽管这些数据对Cognito至关重要但是他们旨在尽力而为(没有也行)。数据被缓存在Cognito的服务器本地，用这种方法来应对Kinesis 数据流的延迟和短期不可用的问题。不幸的是，这次Kinesis数据流的问题持续时间超长，这触发了在处理这些缓存代码中的一个bug，这导致Cognito网络服务器开始阻塞积压的Kinesis Data Stream缓冲区。导致了当Cognito的用户访问Cognito UserPools 和IdentityPools时 API 错误率增加和更高的延迟，从而阻止了外部用户进行身份认证和获取临时AWS 凭证。在事件的早期，Cognito 团队通过增加额外容量来缓存到Kinesis的请求来减轻 Kinesis 造成的影响。虽然这在初期有点成效，但是在7:01AM PST的时候错误率陡然升高。Coginito团队同时还在对Cognito进行更改来减少对Kinesis的依赖。在10:15AM PST, 这个修复开始被部署并且错误率也开始下降。在12:15 PM PST, 错误率被显著的降低并且在2:18PM PST的时候回归正常状态。为了防止再次发生，我们对Cognito的服务器进行了修正，即便Kinesis API出现错误，Cognito也不会完全耗尽缓存并最终影响用户。

‌

CloudWatch uses Kinesis Data Streams for the processing of metric and log data. Starting at 5:15 AM PST, CloudWatch experienced increased error rates and latencies for the PutMetricData and PutLogEvents APIs, and alarms transitioned to the INSUFFICIENT_DATA state. While some CloudWatch metrics continued to be processed throughout the event, the increased error rates and latencies prevented the vast majority of metrics from being successfully processed. At 5:47 PM PST, CloudWatch began to see early signs of recovery as Kinesis Data Stream’s availability improved, and by 10:31 PM PST, CloudWatch metrics and alarms fully recovered. Delayed metrics and log data backfilling completed over the subsequent hours. While CloudWatch was experiencing these increased errors, both internal and external clients were unable to persist all metric data to the CloudWatch service. These errors will manifest as gaps in data in CloudWatch metrics. While CloudWatch currently relies on Kinesis for its complete metrics and logging capabilities, the CloudWatch team is making a change to persist 3-hours of metric data in the CloudWatch local metrics data store. This change will allow CloudWatch users, and services requiring CloudWatch metrics (including AutoScaling), to access these recent metrics directly from the CloudWatch local metrics data store. This change has been completed in the US-EAST-1 Region and will be deployed globally in the coming weeks.

Cloudwatch 使用Kinesis的流数据处理Metric和Log数据。从5:15AM PST开始，Cloudwatch 的PutMetricData 和 PutLogEvent API 开始出现错误率和延迟陡增的问题并且警报呈现出INSUFFICIENTDATA的状态. 在此次事件过程中，虽然有部分的Cloudwatch metrics数据能被处理，但是由于错误率和延迟的增加，绝大数的metrics还是没有被成功的处理。在5:47 PM PST，Cloudwatch 开始感受到 Kinesis 数据流的可用性得到恢复的信号, 并且在10:31 PM PST的时候Cloudwatch 最终完全恢复。被延迟的metrics和log数据的恢复也在之后的几个小时内完成。当 CloudWatch 遇到这些增加的错误时，内部和外部的用户都无法将所有Metric数据保留到 CloudWatch中。这些错误在cloudwatch metric上的呈现是数据丢失（在metric上显示空白）。虽然 CloudWatch 目前依赖于 Kinesis来实现 metric和log功能，但 CloudWatch 团队正在做出修改使其有能力在cloudwatch本地数据库中存储3个小时的数据。这个修改将会允许cloduwatch的用户和需要cloudwatch metric的服务直接从cloudwatch的本地数据库中读取metric数据。这个修改应已在US-EAST-1地区完成并且在接下来的几周内会被部署到全球。

‌

Two services were also impacted as a result of the issues with CloudWatch metrics. First, reactive AutoScaling policies that rely on CloudWatch metrics experienced delays until CloudWatch metrics began to recover at 5:47 PM PST. And second, Lambda saw impact. Lambda function invocations currently require publishing metric data to CloudWatch as part of invocation. Lambda metric agents are designed to buffer metric data locally for a period of time if CloudWatch is unavailable. Starting at 6:15 AM PST, this buffering of metric data grew to the point that it caused memory contention on the underlying service hosts used for Lambda function invocations, resulting in increased error rates. At 10:36 AM PST, engineers took action to mitigate the memory contention, which resolved the increased error rates for function invocations.

另外两个服务因为cloudwatch metric的问题也受到影响。首先，依赖于 CloudWatch metric的被动式自动扩展策略在 CloudWatch Metric (5:47PM PST)开始恢复之前遇到延迟。其次是Lambda 也受到影响。目前Lambda 的调用是需要将metric数据发送到cloudwatch来作为调用的一部分。如果CloudWatch不可用，Lambda Metric agent 能在一段时间内本地缓冲metric数据。从6:15 AM PST 开始，这个metric数据的缓存达到了临界点，造成了lambda 调用的服务器的内存争用。从而导致错误率增加。在10:36AM PST, 打工人开始采取手段来减缓导致错误率升高的内存争用问题。

‌

CloudWatch Events and EventBridge experienced increased API errors and delays in event processing starting at 5:15 AM PST. As Kinesis availability improved, EventBridge began to deliver new events and slowly process the backlog of older events. Elastic Container Service (ECS) and Elastic Kubernetes Service (EKS) both make use of EventBridge to drive internal workflows used to manage customer clusters and tasks. This impacted provisioning of new clusters, delayed scaling of existing clusters, and impacted task de-provisioning. By 4:15 PM PST, the majority of these issues had been resolved.

从5:15AM PST 开始， Cloudwatch Event 和 EventBridge 遭受到API 错误率增加和event 处理被延后的问题。随着Kinesis的可用性的提高，EventBridge开始传送新的event并且慢慢的处理之前堆积的event。ElasticContainerService(ECS) 和 Elastic KubernetsService(EKS) 使用EventBridge 来控制内部的工作流，该工作流的作用是管理用户的集群和任务。这影响了新群集的调配、现有群集的延迟扩展以及受影响的任务取消预配。到4:15PM PST 主要的问题都被解决。

‌

Outside of the service issues, we experienced some delays in communicating service status to customers during the early part of this event. We have two ways of communicating during operational events – the Service Health Dashboard, which is our public dashboard to alert all customers of broad operational issues, and the Personal Health Dashboard, which we use to communicate directly with impacted customers. With an event such as this one, we typically post to the Service Health Dashboard. During the early part of this event, we were unable to update the Service Health Dashboard because the tool we use to post these updates itself uses Cognito, which was impacted by this event. We have a back-up means of updating the Service Health Dashboard that has minimal service dependencies. While this worked as expected, we encountered several delays during the earlier part of the event in posting to the Service Health Dashboard with this tool, as it is a more manual and less familiar tool for our support operators. To ensure customers were getting timely updates, the support team used the Personal Health Dashboard to notify impacted customers if they were impacted by the service issues. We also posted a global banner summary on the Service Health Dashboard to ensure customers had broad visibility into the event. During the remainder of event, we continued using a combination of the Service Health Dashboard, both with global banner summaries and service specific details, while also continuing to update impacted customers via Personal Health Dashboard. Going forward, we have changed our support training to ensure that our support engineers are regularly trained on the backup tool for posting to the Service Health Dashboard.

除了服务问题之外，在此次事件的早期，我们在与客户传达服务状态方面遇到了一些延迟。我们在运营活动期间有两种沟通方式：服务运行状况仪表板 - Service Health Dashboard，这是我们用于提醒所有客户的广泛操作问题的公共仪表板，以及我们用于直接与受影响的客户通信的个人运行状况仪表板 - Personal Health Dashboard。对于此类事件，我们通常会发布到Service Health Dashboard. 但是在此次事件的早期，我们不能更新service health dashboard 因为我们我们用来发布消息的工具需要Cognito，但是Cognito也收到了影响。我们有一个备用方案来更新service health dashboard，这个方案不怎么依赖其他的服务。虽然这个方案没有问题，但是因为它需要额外的手动步骤和一些平时少用的工具来更新，所以在事件的初期花费了比较久的时间来更新信息。为了确保用户及时的得到最一手的消息，support team 使用personal health dashboard来通知受到影响的用户。我们也在service health dasahboard的网上加了一个置顶来通知用户我们的服务正出现问题来确保用户得到通知。往后，我们也修改了我们的训练手册来确保我们的技术支持团队能够熟练的使用后端工具对service health dashboard进行更新。

‌

Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon Kinesis, we know how critical this service, and the other AWS services that were impacted, are to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.

最后，(一堆道歉的官方话语..)

N拳超人

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
北弗吉尼亚（US-EAST-1）地区的Amazon Kinesis事件总结 - 译

原文链接 -> https://aws.amazon.com/message/11201/ 总结Kinesis线程数超过了系统的限制导致关键数据遗失。因为内部实现原因导致不能被很快的重启恢复，然后介绍一些短/中期的恢复方案。紧接着，介绍了Kinesis 如何影响Cognito和Cloudwatch，并且cloudwatch怎么间接的影响reactive auto scaling/lambda/ecs/eks还有service health dashboard等。译文We wanted
复制链接

扫一扫