分布式系统中错误预算的重要性

If you have ever worked on a large distributed system or on a platform team, you are acutely aware of the difficulties of communicating reliability requirements with sister or client teams. In distributed systems, upstreams inherit the failures of their downstreams. If a service has many downstreams, the powers of probability mean that service will experience the sum of its downstream’s failures.

如果您曾经在大型分布式系统或平台团队中工作过,那么您就会清楚地意识到与姐妹或客户团队沟通可靠性要求的困难。 在分布式系统中,上游继承了其下游的故障。 如果服务具有许多下游,则概率的力量意味着服务将经历其下游故障的总和。

On the other hand, platform teams deal with a much different conundrum — the risk of progress. Changes to any system inject uncertainty, and therefore risk. If a client team is moving quickly in developing new features, this velocity frequently comes at the cost of stability in the system. How do platform teams communicate risk vs reliability tradeoffs?

另一方面,平台团队要处理一个截然不同的难题-进步的风险。 任何系统的变更都会带来不确定性,从而带来风险。 如果客户团队正在快速开发新功能,那么这种速度通常会以系统稳定性为代价。 平台团队如何交流风险与可靠性之间的权衡?

The answer is error budgets!

答案是错误预算!

This post is not meant to be a comprehensive description of error budgets — for that I recommend reading Google’s book(s) on Site Reliability Engineering.

这篇文章并不是要对错误预算进行全面的描述,为此,我建议您阅读Google关于Site Reliability Engineering的书

In this post I will go over:

在这篇文章中,我将介绍:

  • Why error budgets are important?

    为什么错误预算很重要?
  • Recommendations for choosing the right SLI

    选择正确的SLI的建议
  • Suggestions for communications to client teams for successful adoption

    与客户团队沟通以成功采用的建议

为什么错误预算很重要?(Why are error budgets important?)

Error budgets are exactly that — a budget of errors. This budget is assigned to each team (or platform, depending on your architecture), and then monitored by that team. When a team runs out of budget, they should halt feature work and focus on reliability work. The definition of an “error” can be custom built, and we will talk about that more in the next section.

错误预算正是-错误的预算。 该预算分配给每个团队(或平台,取决于您的体系结构),然后由该团队进行监控。 当团队的预算用尽时,他们应该停止功能工作并专注于可靠性工作。 “错误”的定义可以自定义,我们将在下一节中进一步讨论。

Error budgets create a contract between teams focused on reliability and teams focused on feature releases. This alleviates tension among teams whose goals may seem to be at odds with one another, and creates a more collaborative environment between client and platform teams, or service owners and SREs. Furthermore, error budgets can become a common language that is understood across the company (even beyond the engineering org) as the canonical way of measuring both short and longer term availability of the product.

错误预算在专注于可靠性的团队与专注于功能发布的团队之间建立了合同。 这缓解了目标似乎彼此不一致的团队之间的紧张关系,并在客户端和平台团队,服务所有者和SRE之间创建了更加协作的环境。 此外,错误预算可以成为一种通用语言,在整个公司(甚至是工程组织之外)都可以理解,这是衡量产品短期和长期可用性的规范方法。

如何选择正确的SLI / SLO / SLA? (How to choose the right SLI/SLO/SLA?)

Error budget frameworks are built on an agreed upon Service Level Indicator (SLI), Service Level Agreement (SLAs), and Service Level Objectives (SLOs).

错误预算框架建立在商定的服务水平指标(SLI),服务水平协议(SLA)和服务水平目标(SLO)的基础上。

  • SLI: indicator of service/platform’s health (realtime measure)

    SLI:服务/平台的运行状况指示器(实时度量)

  • SLA: promise from service to clients about availability

    SLA:从服务方面向客户保证可用性

  • SLO: agreed upon SLI goal

    SLO:商定了SLI目标

In some cases, the SLA and SLO might be the same value. In others, a service might promise its clients an availability of three 9’s (99.9%), but set a more ambitious SLO of four 9’s (99.99%).

在某些情况下,SLA和SLO可能是相同的值。 在其他情况下,服务可能会向其客户保证三个9(99.9%)的可用性,但设置了一个更雄心勃勃的四个9(99.99%)的SLO。

What the SLI/SLO/SLA actually measures is an important question. Before deciding upon a formula, ask yourself the following questions:

SLI / SLO / SLA实际测量的内容是一个重要的问题。 在确定公式之前,先问自己以下问题:

  • What do I expect my service to do?

    我希望我的服务能做什么?
  • What latency of response is unacceptable?

    响应的延迟时间是不可接受的?
  • Is response status code adequate to label a response as success or failed?

    响应状态代码是否足以将响应标记为成功或失败?

For example:

例如:

A service that shows images to a user may count responses that return 0 images as a failure (even if the response was “successful”), because that is a bad user experience.

向用户显示图像的服务可能会将返回0张图像的响应计为失败(即使响应“成功”),因为这是糟糕的用户体验。

A service that allows users to collaborate on a document may count responses that take longer than 100ms as a failure since that is too slow for their product requirements.

允许用户在文档上进行协作的服务可能会将响应时间超过100毫秒的失败视为一次失败,因为这对于他们的产品要求而言太慢了。

A service that simply allows users to convert images from jpg to png may use a basic schema where a 500 HTTP response code is considered a failure and everything else is considered a success.

一项仅允许用户将图像从jpg转换为png的服务可能会使用一种基本模式,其中500 HTTP响应代码被视为失败,而其他所有内容都被视为成功。

计算方式 (Calculations)

SLI = (successes/total)SLO = (desired success/total)Error budget = 1 - SLO

If the SLO is 99.99%:

如果SLO为99.99%:

Error budget = 1–0.9999 = 0.0001 -> 0.01%

This means that this service can serve no more than errors to 0.01% of its requests to remain within budget

这意味着,此服务最多可以将其请求的错误数保持在预算范围内的0.01%

Absolute budget left = 0.0001 — (1-SLI)Percent budget left = ((0.0001 — (1-SLI))/0.0001)*100

Budgets can either roll over time (using a sliding window) or accumulate over a cycle. This means that if a platform experiences an outage today and runs out of budget, by spending a week on reliability work (and ensuring no more incidents), they are likely to accumulate enough budget quickly to resume feature work. Sliding windows mean that the team is free of incidents before that window, while cumulative budgets mean that as the service continues to behave in a healthy way, the relative percentage of errors goes down (as the service serves more successful requests).

预算可以随着时间滚动(使用滑动窗口),也可以随着周期累积。 这意味着,如果平台今天遇到故障并且预算不足,则通过花一周时间进行可靠性工作(并确保不再发生任何事件),他们很可能会Swift积累足够的预算来恢复功能工作。 滑动窗口表示该团队在该窗口之前没有任何事件,而累积预算意味着随着该服务继续以健康的方式运行,错误的相对百分比下降了(因为该服务为更成功的请求提供服务)。

Rolling error budgets generally sit better with clients. To do this, you can aggregate SLI over a sliding window of time. Without a rolling window, budgets will recover more slowly, and clients will have to account for outages early on for the entirety of the cycle.

滚动错误预算通常会更好地与客户配合。 为此,您可以在滑动的时间窗口内聚合SLI。 没有滚动窗口,预算将恢复得更慢,客户将不得不在整个周期中尽早解决断电问题。

It is best to test various window periods and find what works best. However, it is recommended to keep these parameters consistent across teams, as one of the most beneficial aspects of error budgets is their simplicity.

最好测试各种窗口期,然后找出最合适的时间。 但是,建议在团队之间使这些参数保持一致,因为错误预算最有利的方面之一就是其简单性。

建议的通讯 (Suggested Communications)

Communication of error budgets with client teams must be done carefully and intentionally. Error budgets serve the purpose of allowing feature developers to iterate with both speed and reliability. In addition, the metrics implemented for error budgets give these engineers real time insight into the health of their system. This is a system that is advantageous for their team, platform teams, SREs, and the company as a whole.

错误预算与客户团队的沟通必须谨慎而有意地完成。 错误预算的目的是允许功能开发人员以快速和可靠的方式进行迭代。 此外,为错误预算实施的指标使这些工程师可以实时了解其系统的运行状况。 这是一个对其团队,平台团队,SRE和整个公司都有利的系统。

If not communicated properly, error budgets can come across as a poorly understood chore that serves no greater purpose than adding yet another metric to monitor.

如果沟通不当,错误预算可能会被理解为繁琐的琐事,其目的并不比添加另一个要监视的指标更大。

When proposing error budgets to client teams, be sure to focus on the following:

向客户团队提出错误预算时,请务必关注以下方面:

  • Increased understanding of how the system/feature is performing

    进一步了解系统/功能的性能

  • Improving experience for users

    改善用户体验

  • Self-service: the team will be able to independently tradeoff novelty and stability in their product with minimal oversight

    自助服务:团队将能够以最小的监督权衡产品的新颖性和稳定性

  • If team is well within their budget, they can deprioritize reliability work and plan more feature work

    如果团队在预算范围内,他们可以取消可靠性工作的优先级并计划更多功能工作

  • Available support: comprehensive wikis/how-to guides, support channel available

    可用的支持:全面的Wiki /操作指南,可用的支持渠道

I hope this blog post gave you some insight into why error budgets are important for complex systems, how to go about choosing a metric to track, and the importance of intentional communication with clients. Happy budgeting!

我希望这篇博文可以使您对为什么错误预算对于复杂系统很重要,如何选择要跟踪的指标以及与客户进行有计划的交流的重要性有一些见解。 预算愉快!

翻译自: https://medium.com/swlh/importance-of-error-budgets-in-a-distributed-system-557a0e037957

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值