本文主要来解析Eureka Server自我保护机制、失效剔除和Eureka Server集群复制的核心源码,基于1.9.8版本
四、Eureka Server自我保护机制
1、什么是自我保护机制
当Eureka Server节点在短时间内丢失过多客户端时(可能发生了网络分区故障,服务实例与Eureka Server之间无法正常通信),那么这个节点就会进入自我保护模式(eureka.server.enable-self-preservation=true
,默认开启自我保护模式)。一旦进入该模式,Eureka Server就会保护服务注册表中的信息,不再删除服务注册表中的数据(也就是不会注销任何微服务)。当网络故障恢复后,该Eureka Server节点会自动退出自我保护模式
2、自我保护机制实现
1)、开启条件
Renews threshold:Eureka Server期望每分钟收到客户端实例续约的阈值
Renews (last min):Eureka Server最后1分钟收到客户端实例续约的总数
自我保护模式开启的条件是:1分钟后,若Renews (last min) < Renews threshold,那么开启自我保护机制
2)、计算公式
public abstract class AbstractInstanceRegistry implements InstanceRegistry {
protected void updateRenewsPerMinThreshold() {
this.numberOfRenewsPerMinThreshold = (int) (this.expectedNumberOfClientsSendingRenews
* (60.0 / serverConfig.getExpectedClientRenewalIntervalSeconds())
* serverConfig.getRenewalPercentThreshold());
}
numberOfRenewsPerMinThreshold
就是Dashboard中的Renews thresholdexpectedNumberOfClientsSendingRenews
期望收到客户端续约的总数(实际为服务实例的总数)getExpectedClientRenewalIntervalSeconds()
获取客户端续约间隔(秒为单位)的方法,(默认30s)getRenewalPercentThreshold()
获取自我保护续约百分比阈值因子(默认85%)
那么:
- Renews threshold = 服务实例总数 * (60 / 续约间隔) * 自我保护续约百分比阈值因子
- Renews (last min) = 服务实例总数 * (60 / 续约间隔)
public class PeerAwareInstanceRegistryImpl extends AbstractInstanceRegistry implements PeerAwareInstanceRegistry {
@Override
public boolean isLeaseExpirationEnabled() {
if (!isSelfPreservationModeEnabled()) {
// The self preservation mode is disabled, hence allowing the instances to expire.
return true;
}
return numberOfRenewsPerMinThreshold > 0 && getNumOfRenewsInLastMin() > numberOfRenewsPerMinThreshold;
}
isLeaseExpirationEnabled()
是Eureka Server失效剔除时调用,判断是否需要清理。如果自我保护模式没开启,那就可以清理。如果自我保护模式开启了,且当续约阈值 > 0,上一分钟的续约数 > 阈值,那么可以清理;当上一分钟续约数 < 阈值,那么就不清理
3)、Renews threshold更新时机
1)应用实例注册
public abstract class AbstractInstanceRegistry implements InstanceRegistry {
public void register(InstanceInfo registrant, int leaseDuration, boolean isReplication) {
// ...省略其他代码
synchronized (lock) {
if (this.expectedNumberOfClientsSendingRenews > 0) {
// Since the client wants to register it, increase the number of clients sending renews
this.expectedNumberOfClientsSendingRenews = this.expectedNumberOfClientsSendingRenews + 1;
updateRenewsPerMinThreshold();
}
}
logger.debug("No previous lease information found; it is new registration");
}
// ...省略其他代码
}
当有应用实例注册时,expectedNumberOfClientsSendingRenews会增加,然后触发updateRenewsPerMinThreshold()
更新Renews threshold
2)应用实例下线
public class PeerAwareInstanceRegistryImpl extends AbstractInstanceRegistry implements PeerAwareInstanceRegistry {
public boolean cancel(final String appName, final String id,
final boolean isReplication) {
// ...省略其他代码
synchronized (lock) {
if (this.expectedNumberOfClientsSendingRenews > 0) {
// Since the client wants to cancel it, reduce the number of clients to send renews
this.expectedNumberOfClientsSendingRenews = this.expectedNumberOfClientsSendingRenews - 1;
updateRenewsPerMinThreshold();
}
}
// ...省略其他代码
}
当有应用实例下线时,expectedNumberOfClientsSendingRenews会减少,然后触发updateRenewsPerMinThreshold()
更新Renews threshold
3)定时重置(默认15分钟)
public class PeerAwareInstanceRegistryImpl extends AbstractInstanceRegistry implements PeerAwareInstanceRegistry {
private void scheduleRenewalThresholdUpdateTask() {
timer.schedule(new TimerTask() {
@Override
public void run() {
updateRenewalThreshold();
}
}, serverConfig.getRenewalThresholdUpdateIntervalMs(),
serverConfig.getRenewalThresholdUpdateIntervalMs());
}
private void updateRenewalThreshold() {
try {
// 计算应用实例数
Applications apps = eurekaClient.getApplications();
int count = 0;
for (Application app : apps.getRegisteredApplications()) {
for (InstanceInfo instance : app.getInstances()) {
if (this.isRegisterable(instance)) {
++count;
}
}
}
synchronized (lock) {
// Update threshold only if the threshold is greater than the
// current expected threshold or if self preservation is disabled.
// 重新计算expectedNumberOfClientsSendingRenews和numberOfRenewsPerMinThreshold
if ((count) > (serverConfig.getRenewalPercentThreshold() * expectedNumberOfClientsSendingRenews)
|| (!this.isSelfPreservationModeEnabled())) {
this.expectedNumberOfClientsSendingRenews = count;
updateRenewsPerMinThreshold();
}
}
logger.info("Current renewal threshold is : {}", numberOfRenewsPerMinThreshold);
} catch (Throwable e) {
logger.error("Cannot update renewal threshold", e);
}
}
五、应用实例失效剔除
应用实例失效剔除核心流程如下图:
1、为什么需要失效剔除
正常情况下,应用实例下线时候会主动向Eureka Server发起下线请求。但实际情况下,应用实例可能异常崩溃,又或者是网络异常等原因,导致下线请求无法被成功提交
介于这种情况,通过Eureka Client心跳延长租约,配合Eureka Server清理超时的租约解决上述异常
2、EvictionTask
com.netflix.eureka.registry.AbstractInstanceRegistry.EvictionTask
清理租约过期任务。在Eureka Server启动时,初始化EvictionTask定时执行,实现代码如下:
public abstract class AbstractInstanceRegistry implements InstanceRegistry {
protected void postInit() {
renewsLastMin.start();
if (evictionTaskRef.get() != null) {
evictionTaskRef.get().cancel();
}
// 初始化清理租约过期任务
evictionTaskRef.set(new EvictionTask());
evictionTimer.schedule(evictionTaskRef.get(),
serverConfig.getEvictionIntervalTimerInMs(),
serverConfig.getEvictionIntervalTimerInMs());
}
eureka.evictionIntervalTimerInMs
清理租约过期任务执行频率,默认1分钟
EvictionTask实现代码如下:
public abstract class AbstractInstanceRegistry implements InstanceRegistry {
class EvictionTask extends TimerTask {
private final AtomicLong lastExecutionNanosRef = new AtomicLong(0l);
@Override
public void run() {
try {
// 获取补偿时间毫秒数(当前时间-最后任务执行时间-任务执行频率)
long compensationTimeMs = getCompensationTimeMs();
logger.info("Running the evict task with compensationTime {}ms", compensationTimeMs);
// 清理过期租约逻辑
evict(compensationTimeMs);
} catch (Throwable e) {
logger.error("Could not run the evict task", e);
}
}
3、失效剔除逻辑
调用AbstractInstanceRegistry的 evict(long additionalLeaseMs)
方法,执行清理过期租约逻辑,实现代码如下:
public abstract class AbstractInstanceRegistry implements InstanceRegistry {
public void evict(long additionalLeaseMs) {
logger.debug("Running the evict task");
if (!isLeaseExpirationEnabled()) {
logger.debug("DS: lease expiration is currently disabled.");
return;
}
// 获得所有过期的租约
// We collect first all expired items, to evict them in random order. For large eviction sets,
// if we do not that, we might wipe out whole apps before self preservation kicks in. By randomizing it,
// the impact should be evenly distributed across all applications.
List<Lease<InstanceInfo>> expiredLeases = new ArrayList<>();
for (Entry<String, Map<String, Lease<InstanceInfo>>> groupEntry : registry.entrySet()) {
Map<String, Lease<InstanceInfo>> leaseMap = groupEntry.getValue();
if (leaseMap != null) {
for (Entry<String, Lease<InstanceInfo>> leaseEntry : leaseMap.entrySet()) {
Lease<InstanceInfo> lease = leaseEntry.getValue();
// 1)
if (lease.isExpired(additionalLeaseMs) && lease.getHolder() != null) {
expiredLeases.add(lease);
}
}
}
}
// 计算最大允许清理租约数量
// To compensate for GC pauses or drifting local time, we need to use current registry size as a base for
// triggering self-preservation. Without that we would wipe out full registry.
int registrySize = (int) getLocalRegistrySize();
int registrySizeThreshold = (int) (registrySize * serverConfig.getRenewalPercentThreshold());
int evictionLimit = registrySize - registrySizeThreshold;
// 计算清理租约数量
int toEvict = Math.min(expiredLeases.size(), evictionLimit);
if (toEvict > 0) {
logger.info("Evicting {} items (expired={}, evictionLimit={})", toEvict, expiredLeases.size(), evictionLimit);
// 逐个过期
Random random = new Random(System.currentTimeMillis());
for (int i = 0; i < toEvict; i++) {
// Pick a random item (Knuth shuffle algorithm)
int next = i + random.nextInt(expiredLeases.size() - i);
Collections.swap(expiredLeases, i, next);
Lease<InstanceInfo> lease = expiredLeases.get(i);
String appName = lease.getHolder().getAppName();
String id = lease.getHolder().getId();
EXPIRED.increment();
logger.warn("DS: Registry: expired lease for {}/{}", appName, id);
// 下线已过期的租约
internalCancel(appName, id, false);
}
}
}
代码1)处调用Lease的isExpired(long additionalLeaseMs)
方法,判断租约是否过期
public class Lease<T> {
public boolean isExpired(long additionalLeaseMs) {
return (evictionTimestamp > 0 || System.currentTimeMillis() > (lastUpdateTimestamp + duration + additionalLeaseMs));
}
public void renew() {
lastUpdateTimestamp = System.currentTimeMillis() + duration;
}
在不考虑参数additionalLeaseMs
的情况下,租约过期时间比预期多了一个duration,原因在于续约renew()
方法错误的设置lastUpdateTimestamp = System.currentTimeMillis() + duration
,正确的设置应该是lastUpdateTimestamp = System.currentTimeMillis()
六、Eureka Server集群复制
1、概述
Eureka Server集群、服务提供者及服务消费者架构图如下:
- Eureka Server集群所有节点相同角色,完全对等
- Eureka Client可以向任意Eureka Server节点发起注册、续约、下线等操作,该节点将操作复制到另外的Eureka Server节点以达到最终一致性
- 启动服务消费者的时候,Eureka Client会发送一个REST请求给任意Eureka Server节点,获取上面注册的服务列表,并将其缓存下来,Eureka Client会定期刷新缓存的服务列表
- 服务消费者在获取服务列表后,通过服务名可以获得具体提供服务的实例名和该实例的元数据信息。在Ribbon中会默认采用轮询的方式进行调用,从而实现客户端的负载均衡
2、获取初始注册信息
Eureka Server启动时,会调用PeerAwareInstanceRegistryImpl的syncUp()
方法,从集群的一个Eureka Server节点获取初始注册信息,代码如下:
public class PeerAwareInstanceRegistryImpl extends AbstractInstanceRegistry implements PeerAwareInstanceRegistry {
@Override
public int syncUp() {
// Copy entire entry from neighboring DS node
int count = 0;
for (int i = 0; ((i < serverConfig.getRegistrySyncRetries()) && (count == 0)); i++) {
// 重试过程中,sleep等待一段时间
if (i > 0) {
try {
Thread.sleep(serverConfig.getRegistrySyncRetryWaitMs());
} catch (InterruptedException e) {
logger.warn("Interrupted during registry transfer..");
break;
}
}
// 获取初始注册信息
Applications apps = eurekaClient.getApplications();
for (Application app : apps.getRegisteredApplications()) {
for (InstanceInfo instance : app.getInstances()) {
try {
if (isRegisterable(instance)) {
register(instance, instance.getLeaseInfo().getDurationInSecs(), true);
count++;
}
} catch (Throwable t) {
logger.error("During DS init copy", t);
}
}
}
}
return count;
}
3、同步注册信息
Eureka Server接收到Eureka Client的注册、续约、下线等操作,固定间隔(默认,500毫秒)向Eureka Server集群内其他节点同步
1)、发起Eureka Server同步操作
以注册操作为例,代码如下:
public class PeerAwareInstanceRegistryImpl extends AbstractInstanceRegistry implements PeerAwareInstanceRegistry {
@Override
public void register(final InstanceInfo info, final boolean isReplication) {
// 租约过期时间
int leaseDuration = Lease.DEFAULT_DURATION_IN_SECS;
if (info.getLeaseInfo() != null && info.getLeaseInfo().getDurationInSecs() > 0) {
leaseDuration = info.getLeaseInfo().getDurationInSecs();
}
// 注册应用实例信息
super.register(info, leaseDuration, isReplication);
// Eureka Server复制
replicateToPeers(Action.Register, info.getAppName(), info.getId(), info, null, isReplication);
}
private void replicateToPeers(Action action, String appName, String id,
InstanceInfo info /* optional */,
InstanceStatus newStatus /* optional */, boolean isReplication) {
Stopwatch tracer = action.getTimer().start();
try {
if (isReplication) {
numberOfReplicationsLastMin.increment();
}
// 1)Eureka Server发起的请求或者集群为空
// If it is a replication already, do not replicate again as this will create a poison replication
if (peerEurekaNodes == Collections.EMPTY_LIST || isReplication) {
return;
}
// 循环集群内每个节点,调用replicateInstanceActionsToPeers
for (final PeerEurekaNode node : peerEurekaNodes.getPeerEurekaNodes()) {
// If the url represents this host, do not replicate to yourself.
if (peerEurekaNodes.isThisMyUrl(node.getServiceUrl())) {
continue;
}
replicateInstanceActionsToPeers(action, appName, id, info, newStatus, node);
}
} finally {
tracer.stop();
}
}
private void replicateInstanceActionsToPeers(Action action, String appName,
String id, InstanceInfo info, InstanceStatus newStatus,
PeerEurekaNode node) {
try {
InstanceInfo infoFromRegistry = null;
CurrentRequestVersion.set(Version.V2);
// 根据操作类型,调用PeerEurekaNode的对应方法
switch (action) {
case Cancel:
node.cancel(appName, id);
break;
case Heartbeat:
InstanceStatus overriddenStatus = overriddenInstanceStatusMap.get(id);
infoFromRegistry = getInstanceByAppAndId(appName, id, false);
node.heartbeat(appName, id, infoFromRegistry, overriddenStatus, false);
break;
case Register:
node.register(info);
break;
case StatusUpdate:
infoFromRegistry = getInstanceByAppAndId(appName, id, false);
node.statusUpdate(appName, id, newStatus, infoFromRegistry);
break;
case DeleteStatusOverride:
infoFromRegistry = getInstanceByAppAndId(appName, id, false);
node.deleteStatusOverride(appName, id, infoFromRegistry);
break;
}
} catch (Throwable t) {
logger.error("Cannot replicate information to {} for action {}", node.getServiceUrl(), action.name(), t);
}
}
代码1)处判断了isReplication的值,该值是来源于Request Header的x-netflix-discovery-replication
,Eureka Client的注册请求isReplication为false,接收注册请求的Eureka Server节点会将该注册信息同步到其他Eureka Server节点,同步请求的isReplication为true,表示该注册信息是由其他Eureka Server节点复制过来的,这时候就不会继续往下传递了,避免了复制死循环的问题
public class PeerEurekaNode {
public void register(final InstanceInfo info) throws Exception {
long expiryTime = System.currentTimeMillis() + getLeaseRenewalOf(info);
batchingDispatcher.process(
// 生成任务编号 相同应用实例的相同同步操作使用相同任务编号
taskId("register", info),
// 发起注册应用实例
new InstanceReplicationTask(targetHost, Action.Register, info, null, true) {
public EurekaHttpResponse<Void> execute() {
return replicationClient.register(info);
}
},
expiryTime
);
}
2)、接收Eureka Server同步操作
@Path("/{version}/peerreplication")
@Produces({"application/xml", "application/json"})
public class PeerReplicationResource {
@Path("batch")
@POST
public Response batchReplication(ReplicationList replicationList) {
try {
ReplicationListResponse batchResponse = new ReplicationListResponse();
// 逐个同步操作任务处理,并将处理结果合并到ReplicationListResponse
for (ReplicationInstance instanceInfo : replicationList.getReplicationList()) {
try {
batchResponse.addResponse(dispatch(instanceInfo));
} catch (Exception e) {
batchResponse.addResponse(new ReplicationInstanceResponse(Status.INTERNAL_SERVER_ERROR.getStatusCode(), null));
logger.error("{} request processing failed for batch item {}/{}",
instanceInfo.getAction(), instanceInfo.getAppName(), instanceInfo.getId(), e);
}
}
return Response.ok(batchResponse).build();
} catch (Throwable e) {
logger.error("Cannot execute batch Request", e);
return Response.status(Status.INTERNAL_SERVER_ERROR).build();
}
}
private ReplicationInstanceResponse dispatch(ReplicationInstance instanceInfo) {
ApplicationResource applicationResource = createApplicationResource(instanceInfo);
InstanceResource resource = createInstanceResource(instanceInfo, applicationResource);
String lastDirtyTimestamp = toString(instanceInfo.getLastDirtyTimestamp());
String overriddenStatus = toString(instanceInfo.getOverriddenStatus());
String instanceStatus = toString(instanceInfo.getStatus());
Builder singleResponseBuilder = new Builder();
switch (instanceInfo.getAction()) {
case Register:
singleResponseBuilder = handleRegister(instanceInfo, applicationResource);
break;
case Heartbeat:
singleResponseBuilder = handleHeartbeat(serverConfig, resource, lastDirtyTimestamp, overriddenStatus, instanceStatus);
break;
case Cancel:
singleResponseBuilder = handleCancel(resource);
break;
case StatusUpdate:
singleResponseBuilder = handleStatusUpdate(instanceInfo, resource);
break;
case DeleteStatusOverride:
singleResponseBuilder = handleDeleteStatusOverride(instanceInfo, resource);
break;
}
return singleResponseBuilder.build();
}
dispatch()
方法是把单个同步操作任务提交到其他Resource处理,和Eureka Server收到Eureka Client请求响应的Resource是相同的逻辑,只是isReplication值固定为true
参考: