Eureka源码深度刨析-(9)EurekaServer服务实例故障感知及摘除

“不积跬步,无以至千里。”

如果是eureka client主动停机下线,可以去调用shutdown()方法这种优雅的方式将服务实例摘除,但实际生产中,我们开发人员一般不会主动去发送请求将服务实例下线,而是某个服务发生了故障宕机。

那么eureka应对这种服务故障宕机的场景,就会采用故障检测机制进行感知,然后把故障实例摘除

eureka靠什么感知故障?心跳机制。

每个eureka client会定时向server端发送心跳,服务端记录对应服务及心跳发送时间,搞一个后台调度线程池去轮询,发现某个服务超过了一段时间阈值没有发送心跳,就认为说这个服务故障了,死掉了,就会把这个服务给摘除掉。

这块的代码逻辑就在之前分析过的eureka server的初始化逻辑里,EurekaBootStrap#initEurekaServerContext()

检查服务实例有没有宕机,最可能是跟服务实例注册表相关的东西,所以找找registry相关的方法,最后在一个openForTraffic()这样的一个方法里面找到,这命名… …

PeerAwareInstanceRegistry registry;
... ...
registry.openForTraffic(applicationInfoManager, registryCount);

看了一下这个所谓的openForTraffic方法,服务实例故障检查的机制居然在最后一行 super.postInit(),又是这种命名,已无力吐槽… …

@Override
public void openForTraffic(ApplicationInfoManager applicationInfoManager, int count) {
    // Renewals happen every 30 seconds and for a minute it should be a factor of 2.
    this.expectedNumberOfClientsSendingRenews = count;
    updateRenewsPerMinThreshold();
    logger.info("Got {} instances from neighboring DS node", count);
    logger.info("Renew threshold is: {}", numberOfRenewsPerMinThreshold);
    this.startupTime = System.currentTimeMillis();
    if (count > 0) {
        this.peerInstancesTransferEmptyOnStartup = false;
    }
    DataCenterInfo.Name selfName = applicationInfoManager.getInfo().getDataCenterInfo().getName();
    boolean isAws = Name.Amazon == selfName;
    if (isAws && serverConfig.shouldPrimeAwsReplicaConnections()) {
        logger.info("Priming AWS connections for all replicas..");
        primeAwsReplicas(applicationInfoManager);
    }
    logger.info("Changing status to UP");
    applicationInfoManager.setInstanceStatus(InstanceStatus.UP);
    super.postInit();
}

这里是将EvictionTask放进了一个调度器Timer中,定时去执行,默认的时间间隔是60s

protected void postInit() {
    renewsLastMin.start();
    if (evictionTaskRef.get() != null) {
        evictionTaskRef.get().cancel();
    }
    evictionTaskRef.set(new EvictionTask());
    evictionTimer.schedule(evictionTaskRef.get(),
                           //默认60s
                           serverConfig.getEvictionIntervalTimerInMs(),
                           serverConfig.getEvictionIntervalTimerInMs());
}
@Override
public long getEvictionIntervalTimerInMs() {
    return configInstance.getLongProperty(
        namespace + "evictionIntervalTimerInMs", (60 * 1000)).get();
}

这个所谓的EvictionTask就是服务故障检测的一个任务,来清理注册表中已经很久没有发送心跳的服务实例,即发生故障宕机的服务实例

/* visible for testing */ class EvictionTask extends TimerTask {

    private final AtomicLong lastExecutionNanosRef = new AtomicLong(0l);

    @Override
    public void run() {
        try {
            //获取补偿时间
            long compensationTimeMs = getCompensationTimeMs();
            logger.info("Running the evict task with compensationTime {}ms", compensationTimeMs);
            evict(compensationTimeMs);
        } catch (Throwable e) {
            logger.error("Could not run the evict task", e);
        }
    }

这里有一个getCompensationTimeMs(),是获取一个补偿时间,啥意思呢,先看看代码

它先拿到本次执行清理任务的时间戳和上次执行任务的时间戳

然后通过TimeUnit.NANOSECONDS.toMillis(currNanos - lastNanos)这行代码来计算了这两个时间间隔的毫秒数

最后用这个计算来的毫秒数减去默认配置的serverConfig.getEvictionIntervalTimerInMs()60s

如果这个结果大于0,就把这个补偿时间返回,体会一下

整这一套组合拳是干嘛呢?有一说一,这块设计还是不错的,它主要是怕定时调度任务由于网络的原因出现了延迟,解决这个问题的,说白了,就是为了保证任意两次定时调度一定要在它指定的时间内执行

 /**
   * compute a compensation time defined as the actual time this task was executed since the prev iteration,
   * vs the configured amount of time for execution. This is useful for cases where changes in time (due to    * clock skew    * or gc for example) causes the actual eviction task to execute later than the desired time
   * according to the configured cycle.
   */
long getCompensationTimeMs() {
    long currNanos = getCurrentTimeNano();
    long lastNanos = lastExecutionNanosRef.getAndSet(currNanos);
    if (lastNanos == 0l) {
        return 0l;
    }

    long elapsedMs = TimeUnit.NANOSECONDS.toMillis(currNanos - lastNanos);
    long compensationTime = elapsedMs - serverConfig.getEvictionIntervalTimerInMs();
    return compensationTime <= 0l ? 0l : compensationTime;
}

然后我们看看最核心的evict()方法,判断服务是否宕机,宕机即清理

public void evict(long additionalLeaseMs) {
    logger.debug("Running the evict task");

    if (!isLeaseExpirationEnabled()) {
        logger.debug("DS: lease expiration is currently disabled.");
        return;
    }

    // We collect first all expired items, to evict them in random order. For large eviction sets,
    // if we do not that, we might wipe out whole apps before self preservation kicks in. By randomizing it,
    // the impact should be evenly distributed across all applications.
    List<Lease<InstanceInfo>> expiredLeases = new ArrayList<>();
    for (Entry<String, Map<String, Lease<InstanceInfo>>> groupEntry : registry.entrySet()) {
        Map<String, Lease<InstanceInfo>> leaseMap = groupEntry.getValue();
        if (leaseMap != null) {
            for (Entry<String, Lease<InstanceInfo>> leaseEntry : leaseMap.entrySet()) {
                Lease<InstanceInfo> lease = leaseEntry.getValue();
                //判断实例是否过期,过期即清理
                if (lease.isExpired(additionalLeaseMs) && lease.getHolder() != null) {
                    expiredLeases.add(lease);
                }
            }
        }
    }

    // To compensate for GC pauses or drifting local time, we need to use current registry size as a base for
    // triggering self-preservation. Without that we would wipe out full registry.
    int registrySize = (int) getLocalRegistrySize();
    int registrySizeThreshold = (int) (registrySize * serverConfig.getRenewalPercentThreshold());
    int evictionLimit = registrySize - registrySizeThreshold;

    int toEvict = Math.min(expiredLeases.size(), evictionLimit);
    if (toEvict > 0) {
        logger.info("Evicting {} items (expired={}, evictionLimit={})", toEvict, expiredLeases.size(), evictionLimit);

        Random random = new Random(System.currentTimeMillis());
        for (int i = 0; i < toEvict; i++) {
            // Pick a random item (Knuth shuffle algorithm)
            int next = i + random.nextInt(expiredLeases.size() - i);
            Collections.swap(expiredLeases, i, next);
            Lease<InstanceInfo> lease = expiredLeases.get(i);

            String appName = lease.getHolder().getAppName();
            String id = lease.getHolder().getId();
            EXPIRED.increment();
            logger.warn("DS: Registry: expired lease for {}/{}", appName, id);
            internalCancel(appName, id, false);
        }
    }
}

首先遍历注册表中所有的服务实例,然后调用lease.isExpired(additionalLeaseMs)方法,

来判断当前这个服务实例的租约是否过期了,如果是故障的服务实例,加入一个列表,expiredLeases.add(lease);

 /**
     * Checks if the lease of a given {@link com.netflix.appinfo.InstanceInfo} has expired or not.
     *
     * Note that due to renew() doing the 'wrong" thing and setting lastUpdateTimestamp to +duration more than
     * what it should be, the expiry will actually be 2 * duration. This is a minor bug and should only affect
     * instances that ungracefully shutdown. Due to possible wide ranging impact to existing usage, this will
     * not be fixed.
     *
     * @param additionalLeaseMs any additional lease time to add to the lease evaluation in ms.
     */
    public boolean isExpired(long additionalLeaseMs) {
        return (evictionTimestamp > 0 || System.currentTimeMillis() > (lastUpdateTimestamp + duration + additionalLeaseMs));
    }

evictionTimestamp > 0,这个第一次来判断肯定不是大于0的,因为还没有对这个实例过期

System.currentTimeMillis() > (lastUpdateTimestamp + duration + additionalLeaseMs)),这行代码是在计算一下,看看当前时间是不是大于过期时间(上次心跳时间+duration+补偿时间),如果大于返回ture,也就是这个服务实例过期

上面这行代码不知道你发现什么有趣的事情吗?

它说在renew()方法里面有一个bug,把续约时间多加了一个duration时间,所以在计算这个evictionTime的时间也需要加上这个duration,也就是说,一个服务只有在 duration * 2 的时间周期里(这个duration的默认值是90s,前面分析服务续约的时候都看到了),都没有发送心跳了,才会被摘除!!!大哥,话说你那边多加了一个90s,这边应该减去90s才对吧,所以这个bug直到现在也没有被修复… …

我们接着看evict方法的后半程

// To compensate for GC pauses or drifting local time, we need to use current registry size as a base for
// triggering self-preservation. Without that we would wipe out full registry.
int registrySize = (int) getLocalRegistrySize();
int registrySizeThreshold = (int) (registrySize * serverConfig.getRenewalPercentThreshold());
int evictionLimit = registrySize - registrySizeThreshold;

int toEvict = Math.min(expiredLeases.size(), evictionLimit);
if (toEvict > 0) {
    logger.info("Evicting {} items (expired={}, evictionLimit={})", toEvict, expiredLeases.size(), evictionLimit);

    Random random = new Random(System.currentTimeMillis());
    for (int i = 0; i < toEvict; i++) {
        // Pick a random item (Knuth shuffle algorithm)
        int next = i + random.nextInt(expiredLeases.size() - i);
        Collections.swap(expiredLeases, i, next);
        Lease<InstanceInfo> lease = expiredLeases.get(i);

        String appName = lease.getHolder().getAppName();
        String id = lease.getHolder().getId();
        EXPIRED.increment();
        logger.warn("DS: Registry: expired lease for {}/{}", appName, id);
        internalCancel(appName, id, false);
    }
}

这个serverConfig.getRenewalPercentThreshold()的默认值是0.85

也就是说,默认情况下,不是一下子把所有故障实例都摘除,每次最多将注册表中15%的服务实例给摘除掉,所以一次没摘除所有的故障实例,下次EvictionTask再次执行的时候,会再次摘除,这里有一个分批摘取机制

而且代码中也可以看出,摘除服务实例的时候,其实就是调用下线的方法,internelCancel()方法,这个在之前服务下线相关的文章已经分析过了,到此为止,整个故障摘除的机制就是这些了。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值