工程项目失败案例_失败工程-CSDN博客

工程项目失败案例

Not so long ago, our systems were simple: we had one machine, with one process, probably no more than one external datastore, and the entire request lifecycle was processed and handled within this simple world.

不久之前，我们的系统很简单：我们有一台机器，一个进程，可能不超过一个外部数据存储，并且整个请求生命周期都是在这个简单的世界中处理和处理的。

Our users were also accustomed to a certain SLA standard — a 2-second page load time could have been acceptable a few years ago, but waiting more than a second for an Instagram post is unthinkable nowadays.

我们的用户还习惯了某种SLA标准-几年前可以接受2秒的页面加载时间，但是如今想等待超过一秒钟的Instagram帖子是不可想象的。

(Warning: buzzwords ahead)

(警告：流行语在前)

When systems get more complex, with strict latency requirements and a distributed infrastructure, an uninvited guest crawls up our systems — request failure.

当系统变得更加复杂，具有严格的延迟要求和分布式基础架构时，不请自来的来宾会爬上我们的系统- 请求失败。

With each additional request to an external service within the lifecycle of a user request, we’re adding another chance for failure. With every additional datastore, we’re open to an increased risk of failure. With every feature we add, we risk increasing our latency long-tail, resulting in a degraded user experience in some portion of the requests.

在用户请求的生命周期内，对外部服务的每个其他请求都会给我们增加另一个失败的机会。随着每个其他数据存储的增加，我们面临更大的失败风险。使用我们添加的每个功能，都可能会增加长尾等待时间，从而导致某些部分请求的用户体验下降。

In this article, I’ll cover some of the basic ways we at Riskified handle failures in order to provide maximal uptime and optimal service to our customers.

在本文中，我将介绍“ 风险承担”中处理故障的一些基本方法，以便为我们的客户提供最大的正常运行时间和最佳服务。

失败的例子 (Failure by example)

Every external service, no matter how good and reliable, will fail at some point. We at Riskified learned this the hard way when we experienced short failures with a managed, highly available service that almost resulted in data loss. That incident taught us the hard lesson that request failures should be handled gracefully.

任何外部服务，无论多么出色和可靠，都会在某个时候失败。当我们在受管的，高可用性的服务中经历了短暂的故障，几乎会导致数据丢失时，我们就冒了险。那次事件教会了我们一个艰苦的教训，即应该优雅地处理请求失败。

In Google’s superbly written Site Reliability Engineering Book, they describe The Global Chubby Planned Outage, in which a service was so reliable, that its customers were using it without taking into account the possibility of failure, and even using it without a real essential need, just because it was so reliable.

在Google出色撰写的《网站可靠性工程手册》中，他们描述了“全球胖乎乎的计划内停机” ，其中一项服务是如此可靠，以至于其客户在使用该服务时并未考虑失败的可能性，甚至在没有真正必要的情况下使用该服务，只是因为它是如此可靠。

As a result, Chubby, Google’s distributed locking system, was set a Service Level Objective (SLO) for service uptime, and for each quarter this SLO is met, the team responsible for the service intentionally takes it down. Their goal is to educate users that the service is not fail-safe and that they need to account for external service failures in their products.

结果，为 Google的分布式锁定系统Chubby设定了服务正常运行时间的服务水平目标(SLO)，并且在达到该SLO的每个季度，负责该服务的团队都会有意将其关闭 。他们的目标是教育用户该服务不是故障安全的，他们需要考虑产品中的外部服务故障。

So how should engineers handle request failures? Let’s cover some comment patterns:

那么工程师应该如何处理请求失败？让我们涵盖一些评论模式：

重试中 (Retrying)

Retrying a failed request can, in many cases, solve the problem. This is the obvious solution, assuming network failures are sporadic and unpredictable. Just set a reasonable timeout for each request you send out to an external resource, and the number of retries you want, and you’re done! Your system is now more reliable.

在许多情况下，重试失败的请求可以解决问题。假设网络故障是偶发性的且不可预测的，这是显而易见的解决方案。只需为发送到外部资源的每个请求设置合理的超时时间，然后设置所需的重试次数即可！您的系统现在更加可靠。

Something to consider, however, is that additional retries can cause additional load on the system you’re calling, and make an already failing system fail harder.

要考虑的事情，然而，就是额外的重试可能会导致系统要调用额外的负担，并且使已经失败的系统失败更难。

Implementing and configuring short-circuiting mechanisms might be a thing to consider. You can read more about it in this interesting Shopify engineering blog post.

实施和配置短路机制可能是要考虑的事情。您可以在这个有趣的Shopify工程博客文章中了解有关它的更多信息。

预取-在主流之外失败 (Prefetching — Fail outside of the main flow)

One of the best ways to avoid failure while calling an external service is to avoid calling this service at all.

避免在调用外部服务时失败的最佳方法之一是完全避免调用此服务。

Let’s say we’re implementing an online store — we have a user service and an order service, and the order service needs the current user’s email address in order to send them an invoice for their last purchase.

假设我们正在建立一个在线商店，我们有一个用户服务和一个订单服务，订单服务需要当前用户的电子邮件地址，以便向他们发送最后一次购买的发票。

The fact that we need the email address, doesn’t mean we have to query the user service while the user is logged in and waiting for order confirmation. It just means that an email address should be available.

我们需要电子邮件地址这一事实，并不意味着我们必须在用户登录并等待订单确认时查询用户服务。这仅意味着电子邮件地址应该可用。

In cases of fairly static data, we can easily pre-fetch all (or some) user details from the user service in a background process. This way, the email is already available during order processing, and we don’t need to call the external service. In the event the service fails to fetch user details, that failure remains outside of the main processing flow and is “hidden” from the user.

在数据相当静态的情况下，我们可以在后台进程中轻松地从用户服务中预取所有(或某些)用户详细信息。这样，在订单处理过程中就已经可以使用该电子邮件了，我们不需要致电外部服务。如果服务无法获取用户详细信息，则该故障将保留在主要处理流程之外，并且对用户“隐藏”。

In his talk, Jimmy Bogard explains it better than I do (the link starts from his explanation about prefetching, although the whole talk is great!)

Jimmy Bogard在演讲中比我解释得更好(链接从他对预取的解释开始，尽管整个演讲很棒！)

尽力而为 (Best efforting)

In some cases, we should just embrace failure, and continue processing without the data we were trying to get. You’re probably wondering — if we don’t need the data, why are we querying it at all?

在某些情况下，我们应该只接受失败，然后继续处理而无需尝试获取的数据。您可能想知道-如果我们不需要数据，为什么还要查询呢？

The best example we have for this in Riskified is a Redis-based distributed locking mechanism that we use to block concurrent transactions in some cases. Since we’re a low-latency oriented service, we didn’t want a latency surge in lock acquiring to cause us to exceed the SLA requirements of our customers. We set a very strict timeout on lock acquiring so that when the timeout is reached, we continue unlocked — i.e we prefer race conditions over the increase in latency for our customers. In other words, the locking feature is a “nice to have” feature in our process.

在Riskified中为此提供的最好的示例是基于Redis的分布式锁定机制，在某些情况下，该机制用于阻止并发事务。由于我们是面向低延迟的服务，因此我们不希望锁获取中的延迟激增导致我们超过客户的SLA要求。我们对锁定获取设置了非常严格的超时，以便在达到超时后，我们将继续解锁-即，相对于为客户增加延迟，我们更喜欢竞争条件。换句话说，锁定功能是我们过程中的“必备”功能。

退回到先前或估计的结果 (Falling back to previous or estimated results)

In some cases, you may be able to use previous results or sub-optimal estimations to handle a request while other services are unavailable.

在某些情况下，当其他服务不可用时，您可以使用先前的结果或次优估算来处理请求。

Let’s say we’re implementing a navigation system, and one of the features we want is traffic jam predictions.

假设我们正在实施导航系统，而我们想要的功能之一就是交通拥堵预测。

We’d probably have a JammingService (not to be confused with the Bob Marley song), that we’d call with our route to estimate the probability of traffic jams. When this service is failing, we might choose a sub-optimal course of action, while still serving the request:

我们可能会有一个JammingService(不要与Bob Marley的歌曲混淆)，我们将使用该路线进行调用以估计交通拥堵的可能性。当此服务失败时，我们可能会选择次优的操作方式，同时仍然处理请求：

Using previous results: we might cache some “common” jam predictions and serve them, we might even pre-fetch the jam estimation for the most commonly used routes of some of our users.
使用以前的结果：我们可能会缓存一些“常见”的干扰预测并提供服务，甚至可能会针对某些用户最常用的路线预先获取干扰估计。
Estimate a result: Our service can hold a mapping of mean jam estimation per region and serve that estimation for all requests for routes in the region.
估计结果：我们的服务可以保存每个区域的平均拥堵估计值的映射，并为该区域中所有路线请求提供该估计值。

In both examples, the solution is obviously not optimal, but probably be better than failing a request. The general idea here is to make a simple estimation of the result we’re trying to get from the external resource.

在两个示例中，解决方案显然都不是最优的，但可能比失败的请求更好。这里的总体思路是对我们试图从外部资源获得的结果进行简单的估计 。

延迟回应 (Delaying a response)

If the business of the product allows it, it’s possible to delay the processing of the request until the problem with the external resource is solved.

如果产品的业务允许，则可以延迟处理请求，直到解决外部资源问题为止。

As an example, let’s take the JammingService from the previous solution — when it fails we can decide to queue all requests in some internal queue, return a response to the user that the request cannot be processed at the moment, but a response will be available as soon as possible via push notification to the user’s phone, or via webhook for example.

例如，让我们从先前的解决方案中获取JammingService-当它失败时，我们可以决定将所有请求放入某个内部队列中，向用户返回当前无法处理该请求的响应，但是响应仍然可用尽快通过推送通知到用户电话，或通过webhook。

This is possible mostly in asynchronous services, where we can separate between the request and the response. (If you can design the service to be asynchronous to begin with, that’s even better!)

这在异步服务中很可能实现，在异步服务中，我们可以在请求和响应之间进行分隔。 (如果您可以从一开始就将服务设计为异步的，那就更好了！)

实现简化的后备逻辑 (Implement simplified fallback logic)

On some mission-critical features, a more complex solution is needed. In some cases, the external service is so critical to our services, that we’d have to fail a request if the external service fails.

在某些关键任务功能上，需要更复杂的解决方案。在某些情况下，外部服务对于我们的服务至关重要，因此如果外部服务失败，我们将不得不使请求失败。

One of the solutions we devised for such critical external resources, is to use “simplified” in-process versions of them. In other words, we’re re-implementing a simplified version of the external service as a fallback within our service, so that in the event the external service fails, we still have some data to work with, and can successfully process the request.

我们为此类关键外部资源设计的解决方案之一是使用它们的“简化”进程内版本。换句话说，我们将重新实现外部服务的简化版本，作为我们服务中的后备，以便在外部服务失败的情况下，我们仍然可以使用一些数据，并且可以成功处理请求。

As an example, let’s go back to our navigation system. It might be such an important feature of our system, that we want each request to have a fairly good traffic jam estimation, even if our JammingService is down.

作为示例，让我们回到导航系统。这可能是我们系统的重要功能，即使我们的JammingService出现故障，我们也希望每个请求都具有相当好的交通拥堵估计。

Our JammingService probably uses various complex machine learning algorithms and external data sources. In our simplified fallback version of it, we might choose, for example, to implement it using a simple greedy best-first algorithm, with simple optimizations.

我们的JammingService可能使用各种复杂的机器学习算法和外部数据源。在我们的简化后备版本中，例如，我们可能选择使用简单贪婪的“最佳优先”算法以及简单的优化来实现它。

In this case, even if there’s a failure of the JammingService, some fairly good traffic jam estimation is available within our navigation system.

在这种情况下，即使JammingService出现故障，我们的导航系统中也可以提供一些相当不错的交通拥堵估计。

This isn’t optimal since now we need to maintain two versions of the same feature, but when the feature is critical enough, and may be unstable enough — it could be worth it.

这不是最佳选择，因为现在我们需要维护同一功能的两个版本，但是当功能足够关键且可能不够稳定时，这是值得的。

结束思想-作为一种生活方式失败 (Closing thoughts — Failing as a way of life)

At school, I was quite a bad student, so failing is not new to me. This taught me that as an engineer, anything I lay my hands on might fail, and simply catching the exception is not enough — we need to do something when we catch it, we still need to provide some level of service.

在学校里，我是一个很糟糕的学生，所以失败对我来说并不新鲜。这告诉我，作为一名工程师，我动手做的任何事情都可能失败，仅捕获异常是不够的-在捕获异常时我们需要做一些事情，我们仍然需要提供一定水平的服务。

I encourage you to dedicate a big part of your time to failure handling, and to make it a habit to announce your systems are production-ready only when you handle your failures in a safe and business-oriented way.

我鼓励您将大部分时间用于处理故障，并养成一种习惯，即只有当您以安全且面向业务的方式处理故障时，才宣布系统已准备好投入生产。

As always, you’re welcome to find me at my Twitter handle: @BorisCherkasky

与往常一样，欢迎您在我的Twitter句柄中找到我： @BorisCherkasky