ev3怎么对场地进行测量_测量场地可靠性

最新推荐文章于 2021-03-05 09:17:59 发布

weixin_26737625

最新推荐文章于 2021-03-05 09:17:59 发布

阅读量252

点赞数

原文链接：https://medium.com/better-programming/measuring-site-reliability-9745617d206c

版权

ev3怎么对场地进行测量

We have all been in the Dev vs. Ops world where the Dev and Ops teams had different objectives, rules, and priorities. Most of the time, they opposed each other because one’s interest was the other’s problem.

开发人员和运营团队的目标，规则和优先级不同，因此我们都处在开发与运营之间。大多数时候，他们彼此反对，因为一个人的利益是另一个人的问题。

Now we have DevOps. In the words of Andrew Shafer and Patrick Debois, it is “a software engineering culture and practice, that aims at unifying software development and software operation.”

现在我们有了DevOps。用Andrew Shafer和Patrick Debois的话来说，这是“一种软件工程文化和实践，旨在统一软件开发和软件操作。”

Site reliability engineering implements DevOps by fostering shared ownership, applying the same tooling and techniques to never fail the same way twice while accepting failures. The primary focus is to build and run a reliable application without compromising on the speed of delivery — two things that were diametrically opposed to each other (i.e. “Make better software, faster”).

站点可靠性工程通过促进共享所有权，应用相同的工具和技术以在接受故障时绝不会以相同的方式两次失败来实施DevOps。主要重点是在不影响交付速度的情况下构建和运行可靠的应用程序-这是两方面相互对立的(即“制作更好的软件，更快”)。

Site reliability engineers, or SREs, measure everything and define and agree upon measurable metrics to ensure they work towards a measurable goal. For example, saying that the site is running slow is a vague statement because it does not mean anything in engineering. But saying that the 95th percentile of the response time has exceeded the SLO by 10% makes complete sense. They also measure repetitive tasks over time (called toil) and seek to automate them to avoid burnout.

站点可靠性工程师或SRE会测量所有内容并定义并商定可衡量的指标，以确保他们朝着可衡量的目标努力。例如，说站点运行缓慢是一个模糊的陈述，因为它对工程没有任何意义。但是说响应时间的第95个百分位比SLO超出10％是完全合理的。他们还测量随时间变化的重复性任务(称为“辛劳”)，并寻求将其自动化以免倦怠。

There are three major reliability parameters that SREs deal with, and we will declutter them one by one. They are the Definition of availability (SLO), Indicators of Availability (SLI), and Consequences of Unavailability (SLA)

SRE处理三个主要的可靠性参数，我们将它们一一整理。它们是可用性的定义(SLO)，可用性的指标(SLI)和不可用性的后果(SLA)

服务水平指标(SLI) (Service Level Indicators (SLI))

Service Level Indicators, or SLIs, are quantifiable measures of reliability. According to Google, they are “a carefully defined quantitative measure of some aspect of the level of service that is provided.” Some common examples can be request latency, failure rate, data throughput, etc. SLIs are specific to user journeys, and they vary between applications. A user journey is a sequence of activities that are performed by a user to achieve a particular end. For example, a user journey for doing a bank transfer can be adding a payee and making the fund transfer.

服务水平指标或SLI是可量化的可靠性措施。根据Google的说法，它们是“对所提供服务水平的某些方面进行仔细定义的量化度量。” 一些常见的示例可能是请求等待时间，故障率，数据吞吐量等。SLI特定于用户旅程，并且在不同的应用程序之间会有所不同。用户旅程是由用户执行以达到特定目的的一系列活动。例如，进行银行转帐的用户旅程可以是添加收款人并进行资金转帐。

Google, which is the original proponent of SRE, has indicated four Golden Signals that you can monitor for most user journeys:

Google是SRE的最初支持者，它指出了四个黄金信号，您可以监视大多数用户的旅程：

Latency
潜伏
Errors
失误
Traffic
交通
Saturation
饱和

Latency is the amount of time it takes for your service to respond to a user request, errors are the percentage of failed requests, traffic is the demand directed to your service, and saturation measures how utilised your infrastructure components are.

延迟是您的服务响应用户请求所花费的时间，错误是失败请求的百分比，流量是定向到您的服务的需求，而饱和度衡量的是基础架构组件的利用率。

There are various ways of obtaining Service Level Indicators, but one way recommended by Google is to get the ratio of Good Events over Valid Events: SLI = Good Events * 100 / Valid Events.

获取服务水平指标的方法有多种，但是Google推荐的一种方法是获取良好事件与有效事件的比率：SLI =良好事件* 100 /有效事件。

So an SLI of 100 means that everything works, and a zero means that everything is broken.

因此，SLI为100表示一切正常，为零表示一切损坏。

A good SLI ties up directly with user experience. For example, if the SLI indicates a lower value, it should also lower customer satisfaction. If that is not the case, then the SLI is not good and not even worth measuring.

良好的SLI与用户体验直接相关。例如，如果SLI指示较低的值，则它也应降低客户满意度。如果不是这种情况，则SLI不好，甚至不值得测量。

It would be best if you did not have more than a handful of SLIs to measure. Too many SLIs will confuse the team and trigger too many false positives. It is best to stick to four or five that directly correlate to customer satisfaction, so while you may want to measure the CPU and memory usage of your application, better metrics to measure your SLI would be request latency and error rate.

最好不要测量几个SLI。太多的SLI将使团队感到困惑，并引发太多的误报。最好坚持与客户满意度直接相关的四个或五个，因此，尽管您可能想衡量应用程序的CPU和内存使用情况，但衡量SLI的更好指标是请求延迟和错误率。

It is crucial to prioritise user journeys and give more value to journeys that impact the customer more and less to the ones that affect the customer less. For example, the transfer journey of funds in your banking application would be more critical than a profile update.

优先考虑用户旅程，并赋予越来越多地影响客户的旅程更多的价值给那些对客户影响较小的旅程，这是至关重要的。例如，银行应用程序中资金的转账过程比配置文件更新更为关键。

服务水平目标(SLO) (Service Level Objectives (SLO))

Google writes that Service Level Objectives, or SLO, “specify a target level for the reliability of your service.” They define what percentage of the SLI you should meet to consider your site as reliable. SLOs are created by combining one or more SLIs.

Google写道，服务水平目标或SLO“指定了服务可靠性的目标水平。” 他们定义了您应该满足的SLI百分比，以使您的站点可靠。 SLO是通过组合一个或多个SLI创建的。

For example, if you have an SLI that requires request latency to be less than 500ms in the last 15 minutes with a 95% percentile, an SLO would need the SLI to be met 99% of the time for a 99% SLO.

例如，如果您有一个SLI，要求在最近15分钟内请求延迟小于500毫秒，且百分率为95％，则SLO将需要在99％的时间内满足SLI才能达到99％的SLO。

While all organisations strive for 100% reliability, having a 100% SLO is not a good objective. A system with a 100% SLO is costly, more technically complicated, and most applications don’t need to have a 100% SLO to be acceptable for their users.

尽管所有组织都在争取100％的可靠性，但是拥有100％的SLO并不是一个好目标。具有100％SLO的系统成本高昂，技术上更复杂，并且大多数应用程序不需要具有100％SLO即可为用户所接受。

Also, a 100% reliable application does not leave room for new features, as every new feature has the potential to disrupt the existing service. You always need to have some room for error defined in your SLO.

同样，100％可靠的应用程序不会为新功能留出空间，因为每个新功能都有可能破坏现有服务。您始终需要在SLO中定义一些错误空间。

SLOs are an internal objective that the team agrees upon with their internal stakeholders, such as developers, product managers, SREs, and CTO. They require buy-in from the entire organisation. There are no explicit or implicit consequences of not meeting an SLO.

SLO是团队与内部利益相关者(例如开发人员，产品经理，SRE和CTO)达成共识的内部目标。他们需要整个组织的支持。不满足SLO不会有任何明显或隐含的后果。

For example, a customer cannot claim for damages if you don’t meet an SLO, but your organisation leadership may not be happy. That does not mean that not meeting the SLO should not have consequences. Not meeting the SLO means less frequent changes and fewer features developed. That may also indicate a reduction in quality and therefore more focus on the dev and testing function.

例如，如果您未达到SLO，则客户无法要求赔偿，但是您的组织领导可能并不满意。这并不意味着不满足SLO不会产生后果。不满足SLO意味着变更频率降低，功能开发减少。这也可能表示质量下降，因此更加关注开发和测试功能。

SLOs need to be realistic, and the team should strive to meet them. SLOs should tie into the customer experience and you should define them in such a way that if the service is within the SLO, the customers do not perceive any issues in the quality of service. If things go worse than the defined SLOs, they might impact the customer experience, but not to the point where they start raising support tickets.

SLO必须切合实际，团队应该努力实现它们。 SLO应该结合客户体验，并且您应该以这样的方式定义它们：如果服务在SLO中，则客户不会感觉到服务质量方面的任何问题。如果情况比定义的SLO差，则它们可能会影响客户体验，但不会影响他们开始筹集支持票。

Some organisations have two SLOs: achievable and aspirational. While the achievable SLO is the one the entire team should meet, the aspirational one is what the team should strive for and is a part of the continuous improvement process.

一些组织有两个SLO：可实现的和有抱负的。虽然可以实现的SLO是整个团队都应满足的SLO，但志向高远的SLO是团队应努力争取的SLO，并且是持续改进过程的一部分。

服务水平协议(SLA) (Service Level Agreements (SLA))

As noted by Google, Service Level Agreements (SLAs) are “an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.”

正如Google所指出的，服务水平协议(SLA)是“与您的用户的明示或隐含合同，其中包括达到(或缺少)所包含的SLO的后果。”

They are more formal and are a business-level agreement with the customers stating what would happen if the organisation does not meet the SLA. They can be both explicit and implicit. An explicit SLA is one where there are defined consequences (mostly in terms of service credits) for not meeting the set reliability. An implicit SLA is measured in terms of loss of reputation to the business and customers jumping ship.

它们是更正式的协议，并且是与客户的业务级别协议，说明如果组织不符合SLA将会发生的情况。它们可以是显式的也可以是隐式的。显式SLA是其中定义的后果(主要是在服务信用方面)导致不符合设置的可靠性的情况。隐式SLA是根据企业和跳船客户的声誉损失来衡量的。

The SLAs are set to the level that is just enough to avoid customers jumping ship, and therefore, SLAs tend to achieve a lower SLI value than the SLO. For example, if we consider the request latency SLI, we can define the SLO on the 300ms value of the SLI and the SLA on 500ms value. That is because SLOs are internal reliability targets, while SLAs are external. If the team strives to achieve the SLO, you meet the SLA automatically, but you also want to cover your organisation just in case they fail.

SLA设置为足以避免客户跳船的水平，因此，SLA倾向于实现比SLO更低的SLI值。例如，如果考虑请求等待时间SLI，则可以在SLI的300ms值上定义SLO，在500ms的SLA上定义SLA。这是因为SLO是内部可靠性目标，而SLA是外部可靠性。如果团队努力实现SLO，则您会自动满足SLA，但是您也想覆盖您的组织，以防万一他们失败了。

错误预算 (Error Budgets)

According to Liz Fong-Jones and Seth Vargo, Error Budgets are “a quantitative measurement shared between the product and SRE teams to balance innovation and stability.”

根据Liz Fong-Jones和Seth Vargo的说法，错误预算是“产品和SRE团队之间共享的定量度量，以平衡创新和稳定性。”

In simple terms, it is the measure of risk you can take to get new features in, stop services for maintenance, routine improvements, network and infrastructure outages, and unforeseen circumstances. Typically, the monitoring service measures your service uptime and the SLOs define the target you need to achieve. Error Budget is the difference between the two and the amount of time you can take to push new releases if your error budget allows.

简而言之，它是您获取新功能，停止维护服务，进行例行改进，网络和基础结构中断以及不可预见情况时可以采取的风险度量。通常，监视服务衡量您的服务正常运行时间，而SLO确定您需要实现的目标。错误预算是两者之间的差，并且如果您的错误预算允许，则推送新版本所花费的时间也就不同。

That is the reason why we did not have a 100% SLO in the first place. Error Budgets help a team balance innovation with reliability, and the reason why we need an error budget is that SRE considers failures inevitable and expected. So whenever you make a new change into production, you take some risk of disrupting your service. Therefore, a higher Error Budget allows you to push more features (Error Budget = 100% — SLO).

这就是为什么我们首先没有100％SLO的原因。错误预算可帮助团队在创新与可靠性之间取得平衡，而我们需要错误预算的原因是SRE认为故障是不可避免和预期的。因此，每当您对生产进行新的更改时，都会冒着中断服务的风险。因此，较高的错误预算可让您推送更多功能(错误预算= 100％— SLO)。

For example, if your SLO is 99%, the Error Budget is 1%. If we multiply that with 30 days/month * 24 hours/day, you get 7.2 hours of Error Budget per month. That is the time you can spend on your maintenance. For 99.9%, the value is 43.2 minutes. For 99.99%, it is 4.32 minutes per month.

例如，如果您的SLO为99％，则错误预算为1％。如果我们将其乘以30天/月* 24小时/天，则每月可获得7.2小时的错误预算。那是您可以花费在维护上的时间。对于99.9％，该值为43.2分钟。对于99.99％，它是每月4.32分钟。

These are actual downtimes, but if you have redundant services and you plan for high availability and DR, it is possible to enhance this number further because the service is still live while you are patching one server.

这些是实际的停机时间，但是，如果您有冗余服务，并且计划实现高可用性和灾难恢复，则可以进一步增加此数目，因为在修补一台服务器时该服务仍然有效。

结论 (Conclusion)

Now that you understand what these terms mean and how they can help you in your SRE journey, feel free to apply these principles in your organisation. Look for how you can use them to provide a better experience to your customers and your organisation’s stakeholders.

现在您已经了解了这些术语的含义以及它们如何在您的SRE旅程中为您提供帮助，请随时在组织中应用这些原则。寻找如何使用它们为您的客户和组织的利益相关者提供更好的体验。

Thanks for reading. I hope you enjoyed the article.

谢谢阅读。希望您喜欢这篇文章。

翻译自: https://medium.com/better-programming/measuring-site-reliability-9745617d206c

ev3怎么对场地进行测量

weixin_26737625

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ev3怎么对场地进行测量_测量场地可靠性

ev3怎么对场地进行测量We have all been in the Dev vs. Ops world where the Dev and Ops teams had different objectives, rules, and priorities. Most of the time, they opposed each other because one’s interest was ...
复制链接

扫一扫