最好的重试是指数后退和抖动

最新推荐文章于 2023-12-23 20:43:54 发布

dfz54668

最新推荐文章于 2023-12-23 20:43:54 发布

阅读量445

点赞数

原文链接：http://www.cnblogs.com/liululee/p/11569565.html

版权

1. 概述

在本教程中，我们将探讨如何使用两种不同的策略改进客户端重试：指数后退和抖动。

2. 重试

在分布式系统中，多个组件之间的网络通信随时可能发生故障。

客户端应用程序通过实现重试来处理这些失败。

设想我们有一个调用远程服务的客户端应用程序—— PingPongService 。

interface PingPongService {
    String call(String ping) throws PingPongServiceException;
}

如果 PingPongService 返回一个 PingPongServiceException ，则客户端应用程序必须重试。在以下选项当中，我们将考虑实现客户端重试的方法。

3. Resilience4j 重试

在我们的例子中，我们将使用 Resilience4j 库，特别是它的 retry 模块。我们需要将添加 resilience4j-retry 模块到 pom.xml ：

<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-retry</artifactId>
</dependency>

关于重试的复习，不要忘记查看我们的 Resilience4j 指南。

4. 指数后退

客户端应用程序必须负责地实现重试。当客户在没有等待的情况下重试失败的调用时，他们可能会使系统不堪重负，并导致已经处于困境的服务进一步降级。

指数回退是处理失败网络调用重试的常用策略。简单地说，客户端在连续重试之间等待的时间间隔越来越长：

wait_interval = base * multiplier^n

其中，

base 是初始间隔，即等待第一次重试
n 是已经发生的故障数量
multiplier 是一个任意的乘法器，可以用任何合适的值替换

通过这种方法，我们为系统提供了喘息的空间，以便从间歇性故障或更严重的问题中恢复过来。

我们可以在 Resilience4j 重试中使用指数回退算法，方法是配置它的 IntervalFunction ，该函数接受 initialInterval 和 multiplier。

重试机制使用 IntervalFunction 作为睡眠函数：

IntervalFunction intervalFn =
  IntervalFunction.ofExponentialBackoff(INITIAL_INTERVAL, MULTIPLIER);

RetryConfig retryConfig = RetryConfig.custom()
  .maxAttempts(MAX_RETRIES)
  .intervalFunction(intervalFn)
  .build();
Retry retry = Retry.of("pingpong", retryConfig);

Function<String, String> pingPongFn = Retry
    .decorateFunction(retry, ping -> service.call(ping));
pingPongFn.apply("Hello");

让我们模拟一个真实的场景，假设我们有几个客户端同时调用 PingPongService ：

ExecutorService executors = newFixedThreadPool(NUM_CONCURRENT_CLIENTS);
List<Callable> tasks = nCopies(NUM_CONCURRENT_CLIENTS, () -> pingPongFn.apply("Hello"));
executors.invokeAll(tasks);

让我们看看 NUM_CONCURRENT_CLIENTS = 4 的远程调用日志：

[thread-1] At 00:37:42.756
[thread-2] At 00:37:42.756
[thread-3] At 00:37:42.756
[thread-4] At 00:37:42.756

[thread-2] At 00:37:43.802
[thread-4] At 00:37:43.802
[thread-1] At 00:37:43.802
[thread-3] At 00:37:43.802

[thread-2] At 00:37:45.803
[thread-1] At 00:37:45.803
[thread-4] At 00:37:45.803
[thread-3] At 00:37:45.803

[thread-2] At 00:37:49.808
[thread-3] At 00:37:49.808
[thread-4] At 00:37:49.808
[thread-1] At 00:37:49.808

我们可以在这里看到一个清晰的模式——客户机等待指数级增长的间隔，但是在每次重试（冲突）时，它们都在同一时间调用远程服务。

我们只解决了问题的一部分 - 我们不再重新启动远程服务，但是，取而代之的是随着时间的推移分散工作量，我们在工作时间间隔更多，空闲时间更长。此行为类似于惊群问题。

5. 介绍抖动

在我们前面的方法中，客户机等待时间逐渐变长，但仍然是同步的。添加抖动提供了一种方法来中断跨客户机的同步，从而避免冲突。在这种方法中，我们给等待间隔增加了随机性。

wait_interval = (base * 2^n) +/- (random_interval)

其中，random_interval 被添加（或减去）以打破客户端之间的同步。

我们不会深入研究随机区间的计算机制，但是随机化必须将峰值空间分离到更平滑的客户端调用分布。

我们可以通过配置一个指数随机回退 IntervalFunction，它也接受一个 randomizationFactor，从而在 Resilience4j 重试中使用带有抖动的指数回退：

IntervalFunction intervalFn = 
  IntervalFunction.ofExponentialRandomBackoff(INITIAL_INTERVAL, MULTIPLIER, RANDOMIZATION_FACTOR);

让我们回到我们的真实场景，并查看带抖动的远程调用日志：

[thread-2] At 39:21.297
[thread-4] At 39:21.297
[thread-3] At 39:21.297
[thread-1] At 39:21.297

[thread-2] At 39:21.918
[thread-3] At 39:21.868
[thread-4] At 39:22.011
[thread-1] At 39:22.184

[thread-1] At 39:23.086
[thread-5] At 39:23.939
[thread-3] At 39:24.152
[thread-4] At 39:24.977

[thread-3] At 39:26.861
[thread-1] At 39:28.617
[thread-4] At 39:28.942
[thread-2] At 39:31.039

现在我们有了更好的传播。我们已经消除了冲突和空闲时间，并以几乎恒定的客户端调用率结束，除非出现最初的激增。