雷电代码_雷电群诺

最新推荐文章于 2024-05-21 09:40:00 发布

郝ren

最新推荐文章于 2024-05-21 09:40:00 发布

阅读量310

点赞数

文章标签： python java

原文链接：https://instagram-engineering.com/thundering-herds-promises-82191c8af57d

版权

雷电代码

服务的故事 (Story of a Service)

Suppose you’re writing a simple service: it handles inbound requests from your users, but doesn’t hold any data itself. This means that in order to handle any inbound requests it needs to refer to a backend to actually fetch whatever is required. This is a great feature as it means your service is stateless — it’s easier to test, simpler to scale, and typically easier to understand. In our story, you’re also lucky as you only handle a single request at a time with a single worker doing the work.

假设您正在编写一个简单的服务：它处理用户的入站请求，但本身不保存任何数据。这意味着，为了处理任何入站请求，它需要引用后端以实际获取所需的任何内容。这是一个很棒的功能，因为它意味着您的服务是无状态的 -测试起来更容易，扩展也更容易，并且通常更易于理解。在我们的故事中，您也很幸运，因为您一次只能由一个工人完成一个工作来处理一个请求。

Your service diagram might look like this:

您的服务图可能如下所示：

All is well but latency is an issue. The backend you use is slow. To address this, you notice that the vast majority — perhaps 90% — of the requests are the same. One common solution is to introduce a cache: before hitting the backend, you check your cache and use that value if present.

一切都很好，但是延迟是一个问题。您使用的后端速度很慢。为了解决这个问题，您注意到绝大多数(也许90％)的请求是相同的。一种常见的解决方案是引入缓存：在访问后端之前，请检查缓存并使用该值(如果存在)。

With the above design, 90% of incoming requests do not hit the backend. You’ve just reduced your resource requirements by 10x!

通过上述设计，90％的传入请求都不会到达后端。您只需将资源需求减少10倍！

Of course, there are some additional details you now need to consider. Namely, what happens if the cache is erased, or if a brand new request appears? In either scenario, your assumption of 90% similarity is broken. There are many ways you may choose to work around this, for example:

当然，您现在还需要考虑一些其他细节。即，如果清除了缓存或出现了新请求，会发生什么？在这两种情况下，您对90％相似性的假设都是无效的。您可以选择多种方法来解决此问题，例如：

If the cache is erased because you’ve had to restart the service, you could consider making the cache external:
如果由于必须重新启动服务而删除了缓存，则可以考虑将缓存设置为外部：

a) Leveraging memcache, or some other external caching service.
a)利用内存缓存或其他一些外部缓存服务。

b) Using local disk as something “external” to your service you can load on startup.
b)将本地磁盘用作服务的“外部”内容，可以在启动时加载。

c) Using something clever with shared memory segments to restart without discarding the in-memory state.
c)使用巧妙的共享内存段重新启动，而不会丢弃内存中状态。
If the request is a brand new feature, you could handle it incrementally:
如果请求是全新功能，则可以逐步处理：

a) Deploy the new client that makes the request slowly, or adjust your distribution plans so it’s not all at once.
a)部署使请求缓慢的新客户端，或者调整您的分发计划，以免一次完成所有任务。

b) Have your service perform some sort of warmup, perhaps with fake requests emulating what you expect.
b)让您的服务执行某种类型的预热，也许使用伪造的请求来模拟您的期望。

Furthermore, your service might not actually live alone. You might have lots of other instances of it, and an existing instance could be communicated with to accelerate your cache warmup routine.

此外，您的服务可能实际上并不单独存在。您可能还有许多其他实例，并且可以与现有实例进行通信以加快缓存预热例程。

All of these add complexity, some of them add intricate failure modes or make your service just more troublesome to deal with (what do you mean I can’t dial everything to 100% immediately?). Chances are your cache warms pretty quickly, and the less boxes on your service diagram the less things go wrong. Ultimately, you’re happy with doing nothing.

所有这些都增加了复杂性，其中一些增加了复杂的故障模式，或者使您的服务更难以处理(这是什么意思，我不能立即将所有内容都拨到100％？)。高速缓存可能很快就会变热，服务图上显示的框越少，出问题的机会就越少。最终，您对无所事事感到满意。

现实 (The Reality)

The above diagram isn’t really what your service looks like. It’s a runaway success and each service is doing quite a lot of work. Your service actually looks like this:

上面的图并不是您的服务真正的样子。这是巨大的成功，每个服务都在做大量工作。您的服务实际上如下所示：

However, the cache-empty case is now much more problematic. To illustrate: if your cache is hit with 100 concurrent requests, then, since the cache is empty, all of them will get a cache-miss at the one moment, resulting in 100 individual requests to the backend. If the backend is unable to handle this surge of concurrent requests (ex: capacity constraints), additional problems arise. This is what’s sometimes called a thundering herd.

但是，现在缓存为空的情况更加棘手。举例说明：如果您的缓存中有100个并发请求，则由于缓存为空，因此所有这些缓存在同一时刻都将丢失缓存，从而导致100个单独的后端请求。如果后端无法处理并发请求的激增(例如：容量限制)，则会出现其他问题。这有时被称为雷群。

该怎么办？ (What to do?)

The aforementioned alleviation scenarios may help improve the resilience of the cache — if you have a service-start resilient cache you can be safe to continually push things without worrying that this would cause awfulness. But this still doesn’t help you for the genuinely new request.

前面提到的缓解方案可能有助于提高缓存的弹性—如果您具有服务启动的弹性缓存，则可以安全地连续推送内容，而不必担心这会造成麻烦。但这仍然不能帮助您提出真正的新要求。

When you see a new request, by definition it is not in the cache, and in addition it may not even be predictable! You have many clients, they don’t always tell you about this in advance. Your only option is to slow the release of new features in order to pre-warm the cache, and hope everyone does this… or is it?

当您看到一个新请求时，根据定义，它不在缓存中，而且它甚至是不可预测的！您有很多客户，他们并不总是事先告知您。您唯一的选择是放慢新功能的发布，以预热缓存，并希望每个人都这样做…… 或者是吗？

承诺 (Promises)

A Promise is an object that represents a unit of work producing a result that will be fulfilled at some point in the future. You can “wait” on this promise to be complete, and when it is you can fetch the resulting value. In many languages there is also the idea of “chaining” future work once the promise is fulfilled.

承诺是代表工作单元的对象，其产生的结果将在将来的某个时候实现。您可以“等待”此诺言完成，然后在达到时就可以获取结果值。在许多语言中，一旦实现了诺言，就存在“链接”未来工作的想法。

At Instagram, when turning up a new cluster we would run into a thundering herd problem as the cluster’s cache was empty. We then used promises to help solve this: instead of caching the actual value, we cached a Promise that will eventually provide the value. When we use our cache atomically and get a miss, instead of going immediately to the backend we create a Promise and insert it into the cache. This new Promise then starts the work against the backend. The benefit this provides is other concurrent requests will not miss as they’ll find the existing Promise — and all these simultaneous workers will wait on the single backend request.

在Instagram，当建立一个新集群时，由于集群的缓存为空，我们会遇到一个雷电群问题。然后，我们使用了Promise来帮助解决这一问题： 我们缓存了一个Promise，该最终将提供该值，而不是缓存实际值 。当我们原子地使用缓存并错过时，我们不会立即进入后端，而是创建一个Promise并将其插入到缓存中。然后，这个新的Promise将针对后端开始工作。这样做的好处是其他并发请求不会丢失，因为它们将找到现有的Promise，并且所有这些同时工作的工人将等待单个后端请求。

The net-effect is you’re able to maintain your assumptions around caching of requests. Assuming your request distribution has the property that 90% are cache-able; then you’d maintain that ratio to your backend even when something new happens or your service is restarted.

最终结果是，您可以维持有关缓存请求的假设。假设您的请求分发具有90％可缓存的属性；那么即使发生新情况或服务重新启动，您也要保持与后端的比率。

At Instagram, most of our C++ services are implemented withfolly::Future. It provides a SharedPromise abstraction to simplify the implementation of the above behavior. By using this Promise instead of just a raw value when implementing a cache, you can improve the cold cache behavior within your service.

在Instagram，我们的大多数C ++服务都是通过folly::Future 。它提供了SharedPromise抽象以简化上述行为的实现。通过在实现高速缓存时使用此Promise而不只是原始值，可以改善服务中的冷高速缓存行为。

We manage so many different services at Instagram that reducing the amount of traffic to our servers via Promise-based caches has real stability benefits.

我们在Instagram管理着许多不同的服务，以至于通过基于Promise的缓存减少了到我们服务器的流量，这确实带来了稳定性。

翻译自: https://instagram-engineering.com/thundering-herds-promises-82191c8af57d

雷电代码

郝ren

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
雷电代码_雷电群诺

雷电代码服务的故事 (Story of a Service)Suppose you’re writing a simple service: it handles inbound requests from your users, but doesn’t hold any data itself. This means that in order to handle any inbound r...
复制链接

扫一扫