Eureka源码深度刨析-(9)EurekaServer服务实例故障感知及摘除

最新推荐文章于 2023-01-04 08:15:00 发布

知秋丶

最新推荐文章于 2023-01-04 08:15:00 发布

阅读量350

点赞数 1

分类专栏： Java源码深度解析-netflix 文章标签：微服务架构

本文链接：https://blog.csdn.net/Josh_scott/article/details/119324649

版权

Java源码深度解析-netflix 专栏收录该内容

23 篇文章 7 订阅

订阅专栏

“不积跬步，无以至千里。”

如果是eureka client主动停机下线，可以去调用shutdown()方法这种优雅的方式将服务实例摘除，但实际生产中，我们开发人员一般不会主动去发送请求将服务实例下线，而是某个服务发生了故障宕机。

那么eureka应对这种服务故障宕机的场景，就会采用故障检测机制进行感知，然后把故障实例摘除

eureka靠什么感知故障？心跳机制。

每个eureka client会定时向server端发送心跳，服务端记录对应服务及心跳发送时间，搞一个后台调度线程池去轮询，发现某个服务超过了一段时间阈值没有发送心跳，就认为说这个服务故障了，死掉了，就会把这个服务给摘除掉。

这块的代码逻辑就在之前分析过的eureka server的初始化逻辑里，EurekaBootStrap#initEurekaServerContext()

检查服务实例有没有宕机，最可能是跟服务实例注册表相关的东西，所以找找registry相关的方法，最后在一个openForTraffic()这样的一个方法里面找到，这命名… …

PeerAwareInstanceRegistry registry;
... ...
registry.openForTraffic(applicationInfoManager, registryCount);

看了一下这个所谓的openForTraffic方法，服务实例故障检查的机制居然在最后一行 super.postInit()，又是这种命名，已无力吐槽… …

@Override
public void openForTraffic(ApplicationInfoManager applicationInfoManager, int count) {
    // Renewals happen every 30 seconds and for a minute it should be a factor of 2.
    this.expectedNumberOfClientsSendingRenews = count;
    updateRenewsPerMinThreshold();
    logger.info("Got {} instances from neighboring DS node", count);
    logger.info("Renew threshold is: {}", numberOfRenewsPerMinThreshold);
    this.startupTime = System.currentTimeMillis();
    if (count > 0) {
        this.peerInstancesTransferEmptyOnStartup = false;
    }
    DataCenterInfo.Name selfName = applicationInfoManager.getInfo().getDataCenterInfo().getName();
    boolean isAws = Name.Amazon == selfName;
    if (isAws && serverConfig.shouldPrimeAwsReplicaConnections()) {
        logger.info("Priming AWS connections for all replicas..");
        primeAwsReplicas(applicationInfoManager);
    }
    logger.info("Changing status to UP");
    applicationInfoManager.setInstanceStatus(InstanceStatus.UP);
    super.postInit();
}

这里是将EvictionTask放进了一个调度器Timer中，定时去执行，默认的时间间隔是60s

protected void postInit() {
    renewsLastMin.start();
    if (evictionTaskRef.get() != null) {
        evictionTaskRef.get().cancel();
    }
    evictionTaskRef.set(new EvictionTask());
    evictionTimer.schedule(evictionTaskRef.get(),
                           //默认60s
                           serverConfig.getEvictionIntervalTimerInMs(),
                           serverConfig.getEvictionIntervalTimerInMs());
}

@Override
public long getEvictionIntervalTimerInMs() {
    return configInstance.getLongProperty(
        namespace + "evictionIntervalTimerInMs", (60 * 1000)).get();
}

这个所谓的EvictionTask就是服务故障检测的一个任务，来清理注册表中已经很久没有发送心跳的服务实例，即发生故障宕机的服务实例

/* visible for testing */ class EvictionTask extends TimerTask {

    private final AtomicLong lastExecutionNanosRef = new AtomicLong(0l);

    @Override
    public void run() {
        try {
            //获取补偿时间
            long compensationTimeMs = getCompensationTimeMs();
            logger.info("Running the evict task with compensationTime {}ms", compensationTimeMs);
            evict(compensationTimeMs);
        } catch (Throwable e) {
            logger.error("Could not run the evict task", e);
        }
    }

这里有一个getCompensationTimeMs()，是获取一个补偿时间，啥意思呢，先看看代码

它先拿到本次执行清理任务的时间戳和上次执行任务的时间戳

然后通过TimeUnit.NANOSECONDS.toMillis(currNanos - lastNanos)这行代码来计算了这两个时间间隔的毫秒数

最后用这个计算来的毫秒数减去默认配置的serverConfig.getEvictionIntervalTimerInMs()60s

如果这个结果大于0，就把这个补偿时间返回，体会一下

整这一套组合拳是干嘛呢？有一说一，这块设计还是不错的，它主要是怕定时调度任务由于网络的原因出现了延迟，解决这个问题的，说白了，就是为了保证任意两次定时调度一定要在它指定的时间内执行

 /**
   * compute a compensation time defined as the actual time this task was executed since the prev iteration,
   * vs the configured amount of time for execution. This is useful for cases where changes in time (due to    * clock skew    * or gc for example) causes the actual eviction task to execute later than the desired time
   * according to the configured cycle.
   */
long getCompensationTimeMs() {
    long currNanos = getCurrentTimeNano();
    long lastNanos = lastExecutionNanosRef.getAndSet(currNanos);
    if (lastNanos == 0l) {
        return 0l;
    }

    long elapsedMs = TimeUnit.NANOSECONDS.toMillis(currNanos - lastNanos);
    long compensationTime = elapsedMs - serverConfig.getEvictionIntervalTimerInMs();
    return compensationTime <= 0l ? 0l : compensationTime;
}

然后我们看看最核心的evict()方法，判断服务是否宕机，宕机即清理

public void evict(long additionalLeaseMs) {
    logger.debug("Running the evict task");

    if (!isLeaseExpirationEnabled()) {
        logger.debug("DS: lease expiration is currently disabled.");
        return;
    }

    // We collect first all expired items, to evict them in random order. For large eviction sets,
    // if we do not that, we might wipe out whole apps before self preservation kicks in. By randomizing it,
    // the impact should be evenly distributed across all applications.
    List<Lease<InstanceInfo>> expiredLeases = new ArrayList<>();
    for (Entry<String, Map<String, Lease<InstanceInfo>>> groupEntry : registry.entrySet()) {
        Map<String, Lease<InstanceInfo>> leaseMap = groupEntry.getValue();
        if (leaseMap != null) {
            for (Entry<String, Lease<InstanceInfo>> leaseEntry : leaseMap.entrySet()) {
                Lease<InstanceInfo> lease = leaseEntry.getValue();
                //判断实例是否过期，过期即清理
                if (lease.isExpired(additionalLeaseMs) && lease.getHolder() != null) {
                    expiredLeases.add(lease);
                }
            }
        }
    }

    // To compensate for GC pauses or drifting local time, we need to use current registry size as a base for
    // triggering self-preservation. Without that we would wipe out full registry.
    int registrySize = (int) getLocalRegistrySize();
    int registrySizeThreshold = (int) (registrySize * serverConfig.getRenewalPercentThreshold());
    int evictionLimit = registrySize - registrySizeThreshold;

    int toEvict = Math.min(expiredLeases.size(), evictionLimit);
    if (toEvict > 0) {
        logger.info("Evicting {} items (expired={}, evictionLimit={})", toEvict, expiredLeases.size(), evictionLimit);

        Random random = new Random(System.currentTimeMillis());
        for (int i = 0; i < toEvict; i++) {
            // Pick a random item (Knuth shuffle algorithm)
            int next = i + random.nextInt(expiredLeases.size() - i);
            Collections.swap(expiredLeases, i, next);
            Lease<InstanceInfo> lease = expiredLeases.get(i);

            String appName = lease.getHolder().getAppName();
            String id = lease.getHolder().getId();
            EXPIRED.increment();
            logger.warn("DS: Registry: expired lease for {}/{}", appName, id);
            internalCancel(appName, id, false);
        }
    }
}

首先遍历注册表中所有的服务实例，然后调用lease.isExpired(additionalLeaseMs)方法，

来判断当前这个服务实例的租约是否过期了，如果是故障的服务实例，加入一个列表，expiredLeases.add(lease);

 /**
     * Checks if the lease of a given {@link com.netflix.appinfo.InstanceInfo} has expired or not.
     *
     * Note that due to renew() doing the 'wrong" thing and setting lastUpdateTimestamp to +duration more than
     * what it should be, the expiry will actually be 2 * duration. This is a minor bug and should only affect
     * instances that ungracefully shutdown. Due to possible wide ranging impact to existing usage, this will
     * not be fixed.
     *
     * @param additionalLeaseMs any additional lease time to add to the lease evaluation in ms.
     */
    public boolean isExpired(long additionalLeaseMs) {
        return (evictionTimestamp > 0 || System.currentTimeMillis() > (lastUpdateTimestamp + duration + additionalLeaseMs));
    }

evictionTimestamp > 0，这个第一次来判断肯定不是大于0的，因为还没有对这个实例过期

System.currentTimeMillis() > (lastUpdateTimestamp + duration + additionalLeaseMs))，这行代码是在计算一下，看看当前时间是不是大于过期时间（上次心跳时间+duration+补偿时间），如果大于返回ture，也就是这个服务实例过期

上面这行代码不知道你发现什么有趣的事情吗？

它说在renew()方法里面有一个bug，把续约时间多加了一个duration时间，所以在计算这个evictionTime的时间也需要加上这个duration，也就是说，一个服务只有在 duration * 2 的时间周期里（这个duration的默认值是90s，前面分析服务续约的时候都看到了），都没有发送心跳了，才会被摘除！！！大哥，话说你那边多加了一个90s，这边应该减去90s才对吧，所以这个bug直到现在也没有被修复… …

我们接着看evict方法的后半程

// To compensate for GC pauses or drifting local time, we need to use current registry size as a base for
// triggering self-preservation. Without that we would wipe out full registry.
int registrySize = (int) getLocalRegistrySize();
int registrySizeThreshold = (int) (registrySize * serverConfig.getRenewalPercentThreshold());
int evictionLimit = registrySize - registrySizeThreshold;

int toEvict = Math.min(expiredLeases.size(), evictionLimit);
if (toEvict > 0) {
    logger.info("Evicting {} items (expired={}, evictionLimit={})", toEvict, expiredLeases.size(), evictionLimit);

    Random random = new Random(System.currentTimeMillis());
    for (int i = 0; i < toEvict; i++) {
        // Pick a random item (Knuth shuffle algorithm)
        int next = i + random.nextInt(expiredLeases.size() - i);
        Collections.swap(expiredLeases, i, next);
        Lease<InstanceInfo> lease = expiredLeases.get(i);

        String appName = lease.getHolder().getAppName();
        String id = lease.getHolder().getId();
        EXPIRED.increment();
        logger.warn("DS: Registry: expired lease for {}/{}", appName, id);
        internalCancel(appName, id, false);
    }
}