“不积跬步,无以至千里。”
如果是eureka client主动停机下线,可以去调用shutdown()
方法这种优雅的方式将服务实例摘除,但实际生产中,我们开发人员一般不会主动去发送请求将服务实例下线,而是某个服务发生了故障宕机。
那么eureka应对这种服务故障宕机的场景,就会采用故障检测机制进行感知,然后把故障实例摘除
eureka靠什么感知故障?心跳机制。
每个eureka client会定时向server端发送心跳,服务端记录对应服务及心跳发送时间,搞一个后台调度线程池去轮询,发现某个服务超过了一段时间阈值没有发送心跳,就认为说这个服务故障了,死掉了,就会把这个服务给摘除掉。
这块的代码逻辑就在之前分析过的eureka server的初始化逻辑里,EurekaBootStrap#initEurekaServerContext()
检查服务实例有没有宕机,最可能是跟服务实例注册表相关的东西,所以找找registry相关的方法,最后在一个openForTraffic()
这样的一个方法里面找到,这命名… …
PeerAwareInstanceRegistry registry;
... ...
registry.openForTraffic(applicationInfoManager, registryCount);
看了一下这个所谓的openForTraffic方法,服务实例故障检查的机制居然在最后一行 super.postInit()
,又是这种命名,已无力吐槽… …
@Override
public void openForTraffic(ApplicationInfoManager applicationInfoManager, int count) {
// Renewals happen every 30 seconds and for a minute it should be a factor of 2.
this.expectedNumberOfClientsSendingRenews = count;
updateRenewsPerMinThreshold();
logger.info("Got {} instances from neighboring DS node", count);
logger.info("Renew threshold is: {}", numberOfRenewsPerMinThreshold);
this.startupTime = System.currentTimeMillis();
if (count > 0) {
this.peerInstancesTransferEmptyOnStartup = false;
}
DataCenterInfo.Name selfName = applicationInfoManager.getInfo().getDataCenterInfo().getName();
boolean isAws = Name.Amazon == selfName;
if (isAws && serverConfig.shouldPrimeAwsReplicaConnections()) {
logger.info("Priming AWS connections for all replicas..");
primeAwsReplicas(applicationInfoManager);
}
logger.info("Changing status to UP");
applicationInfoManager.setInstanceStatus(InstanceStatus.UP);
super.postInit();
}
这里是将EvictionTask
放进了一个调度器Timer中,定时去执行,默认的时间间隔是60s
protected void postInit() {
renewsLastMin.start();
if (evictionTaskRef.get() != null) {
evictionTaskRef.get().cancel();
}
evictionTaskRef.set(new EvictionTask());
evictionTimer.schedule(evictionTaskRef.get(),
//默认60s
serverConfig.getEvictionIntervalTimerInMs(),
serverConfig.getEvictionIntervalTimerInMs());
}
@Override
public long getEvictionIntervalTimerInMs() {
return configInstance.getLongProperty(
namespace + "evictionIntervalTimerInMs", (60 * 1000)).get();
}
这个所谓的EvictionTask
就是服务故障检测的一个任务,来清理注册表中已经很久没有发送心跳的服务实例,即发生故障宕机的服务实例
/* visible for testing */ class EvictionTask extends TimerTask {
private final AtomicLong lastExecutionNanosRef = new AtomicLong(0l);
@Override
public void run() {
try {
//获取补偿时间
long compensationTimeMs = getCompensationTimeMs();
logger.info("Running the evict task with compensationTime {}ms", compensationTimeMs);
evict(compensationTimeMs);
} catch (Throwable e) {
logger.error("Could not run the evict task", e);
}
}
这里有一个getCompensationTimeMs()
,是获取一个补偿时间,啥意思呢,先看看代码
它先拿到本次执行清理任务的时间戳和上次执行任务的时间戳
然后通过TimeUnit.NANOSECONDS.toMillis(currNanos - lastNanos)
这行代码来计算了这两个时间间隔的毫秒数
最后用这个计算来的毫秒数减去默认配置的serverConfig.getEvictionIntervalTimerInMs()
60s
如果这个结果大于0,就把这个补偿时间返回,体会一下
整这一套组合拳是干嘛呢?有一说一,这块设计还是不错的,它主要是怕定时调度任务由于网络的原因出现了延迟,解决这个问题的,说白了,就是为了保证任意两次定时调度一定要在它指定的时间内执行
/**
* compute a compensation time defined as the actual time this task was executed since the prev iteration,
* vs the configured amount of time for execution. This is useful for cases where changes in time (due to * clock skew * or gc for example) causes the actual eviction task to execute later than the desired time
* according to the configured cycle.
*/
long getCompensationTimeMs() {
long currNanos = getCurrentTimeNano();
long lastNanos = lastExecutionNanosRef.getAndSet(currNanos);
if (lastNanos == 0l) {
return 0l;
}
long elapsedMs = TimeUnit.NANOSECONDS.toMillis(currNanos - lastNanos);
long compensationTime = elapsedMs - serverConfig.getEvictionIntervalTimerInMs();
return compensationTime <= 0l ? 0l : compensationTime;
}
然后我们看看最核心的evict()方法,判断服务是否宕机,宕机即清理
public void evict(long additionalLeaseMs) {
logger.debug("Running the evict task");
if (!isLeaseExpirationEnabled()) {
logger.debug("DS: lease expiration is currently disabled.");
return;
}
// We collect first all expired items, to evict them in random order. For large eviction sets,
// if we do not that, we might wipe out whole apps before self preservation kicks in. By randomizing it,
// the impact should be evenly distributed across all applications.
List<Lease<InstanceInfo>> expiredLeases = new ArrayList<>();
for (Entry<String, Map<String, Lease<InstanceInfo>>> groupEntry : registry.entrySet()) {
Map<String, Lease<InstanceInfo>> leaseMap = groupEntry.getValue();
if (leaseMap != null) {
for (Entry<String, Lease<InstanceInfo>> leaseEntry : leaseMap.entrySet()) {
Lease<InstanceInfo> lease = leaseEntry.getValue();
//判断实例是否过期,过期即清理
if (lease.isExpired(additionalLeaseMs) && lease.getHolder() != null) {
expiredLeases.add(lease);
}
}
}
}
// To compensate for GC pauses or drifting local time, we need to use current registry size as a base for
// triggering self-preservation. Without that we would wipe out full registry.
int registrySize = (int) getLocalRegistrySize();
int registrySizeThreshold = (int) (registrySize * serverConfig.getRenewalPercentThreshold());
int evictionLimit = registrySize - registrySizeThreshold;
int toEvict = Math.min(expiredLeases.size(), evictionLimit);
if (toEvict > 0) {
logger.info("Evicting {} items (expired={}, evictionLimit={})", toEvict, expiredLeases.size(), evictionLimit);
Random random = new Random(System.currentTimeMillis());
for (int i = 0; i < toEvict; i++) {
// Pick a random item (Knuth shuffle algorithm)
int next = i + random.nextInt(expiredLeases.size() - i);
Collections.swap(expiredLeases, i, next);
Lease<InstanceInfo> lease = expiredLeases.get(i);
String appName = lease.getHolder().getAppName();
String id = lease.getHolder().getId();
EXPIRED.increment();
logger.warn("DS: Registry: expired lease for {}/{}", appName, id);
internalCancel(appName, id, false);
}
}
}
首先遍历注册表中所有的服务实例,然后调用lease.isExpired(additionalLeaseMs)
方法,
来判断当前这个服务实例的租约是否过期了,如果是故障的服务实例,加入一个列表,expiredLeases.add(lease);
/**
* Checks if the lease of a given {@link com.netflix.appinfo.InstanceInfo} has expired or not.
*
* Note that due to renew() doing the 'wrong" thing and setting lastUpdateTimestamp to +duration more than
* what it should be, the expiry will actually be 2 * duration. This is a minor bug and should only affect
* instances that ungracefully shutdown. Due to possible wide ranging impact to existing usage, this will
* not be fixed.
*
* @param additionalLeaseMs any additional lease time to add to the lease evaluation in ms.
*/
public boolean isExpired(long additionalLeaseMs) {
return (evictionTimestamp > 0 || System.currentTimeMillis() > (lastUpdateTimestamp + duration + additionalLeaseMs));
}
evictionTimestamp > 0
,这个第一次来判断肯定不是大于0的,因为还没有对这个实例过期
System.currentTimeMillis() > (lastUpdateTimestamp + duration + additionalLeaseMs))
,这行代码是在计算一下,看看当前时间是不是大于过期时间(上次心跳时间+duration+补偿时间),如果大于返回ture,也就是这个服务实例过期
上面这行代码不知道你发现什么有趣的事情吗?
它说在renew()
方法里面有一个bug,把续约时间多加了一个duration时间,所以在计算这个evictionTime的时间也需要加上这个duration,也就是说,一个服务只有在 duration * 2
的时间周期里(这个duration的默认值是90s,前面分析服务续约的时候都看到了),都没有发送心跳了,才会被摘除!!!大哥,话说你那边多加了一个90s,这边应该减去90s才对吧,所以这个bug直到现在也没有被修复… …
我们接着看evict方法的后半程
// To compensate for GC pauses or drifting local time, we need to use current registry size as a base for
// triggering self-preservation. Without that we would wipe out full registry.
int registrySize = (int) getLocalRegistrySize();
int registrySizeThreshold = (int) (registrySize * serverConfig.getRenewalPercentThreshold());
int evictionLimit = registrySize - registrySizeThreshold;
int toEvict = Math.min(expiredLeases.size(), evictionLimit);
if (toEvict > 0) {
logger.info("Evicting {} items (expired={}, evictionLimit={})", toEvict, expiredLeases.size(), evictionLimit);
Random random = new Random(System.currentTimeMillis());
for (int i = 0; i < toEvict; i++) {
// Pick a random item (Knuth shuffle algorithm)
int next = i + random.nextInt(expiredLeases.size() - i);
Collections.swap(expiredLeases, i, next);
Lease<InstanceInfo> lease = expiredLeases.get(i);
String appName = lease.getHolder().getAppName();
String id = lease.getHolder().getId();
EXPIRED.increment();
logger.warn("DS: Registry: expired lease for {}/{}", appName, id);
internalCancel(appName, id, false);
}
}
这个serverConfig.getRenewalPercentThreshold()
的默认值是0.85
也就是说,默认情况下,不是一下子把所有故障实例都摘除,每次最多将注册表中15%的服务实例给摘除掉,所以一次没摘除所有的故障实例,下次EvictionTask
再次执行的时候,会再次摘除,这里有一个分批摘取机制
而且代码中也可以看出,摘除服务实例的时候,其实就是调用下线的方法,internelCancel()
方法,这个在之前服务下线相关的文章已经分析过了,到此为止,整个故障摘除的机制就是这些了。