问题描述
在使用 spring cloud gateway + nacos 做服务发现时,会发现当下游的服务器恢复了,但是还有经过一段时间 gateway 才成功转发请求到刚恢复的下游服务上。于是我就深入源码进行企图通过修改相关配置的方式优化gateway服务发现的恢复时间。
相关依赖版本
<!-- SpringCloud Gateway -->
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-gateway</artifactId>
<version>2.2.6.RELEASE</version>
</dependency>
<!-- SpringCloud Ailibaba Nacos -->
<dependency>
<groupId>com.alibaba.cloud</groupId>
<artifactId>spring-cloud-starter-alibaba-nacos-discovery</artifactId>
<version>2.2.6.RELEASE</version>
</dependency>
<!-- SpringCloud Ailibaba Nacos Config -->
<dependency>
<groupId>com.alibaba.cloud</groupId>
<artifactId>spring-cloud-starter-alibaba-nacos-config</artifactId>
<version>2.2.6.RELEASE</version>
</dependency>
源码
经过漫长的调试以及百度搜索,终于锁定到入口 LoadBalancerClientFilter
,但目前gateway还在使用该过滤器。
而LoadBalancerClientFilter
由 GatewayLoadBalancerClientAutoConfiguration
自动装配进来
GatewayLoadBalancerClientAutoConfiguration
/**
* AutoConfiguration for {@link LoadBalancerClientFilter}.
*
* @author Spencer Gibb
* @author Olga Maciaszek-Sharma
*/
@Configuration(proxyBeanMethods = false)
@ConditionalOnClass({ LoadBalancerClient.class, RibbonAutoConfiguration.class,
DispatcherHandler.class })
@AutoConfigureAfter(RibbonAutoConfiguration.class)
@EnableConfigurationProperties(LoadBalancerProperties.class)
public class GatewayLoadBalancerClientAutoConfiguration {
@Bean
@ConditionalOnBean(LoadBalancerClient.class)
@ConditionalOnMissingBean({ LoadBalancerClientFilter.class,
ReactiveLoadBalancerClientFilter.class })
@ConditionalOnEnabledGlobalFilter
public LoadBalancerClientFilter loadBalancerClientFilter(LoadBalancerClient client,
LoadBalancerProperties properties) {
return new LoadBalancerClientFilter(client, properties);
}
}
LoadBalancerClientFilter
这个是 gateway 用于负载均衡转发请求的过滤器,虽然被标注为过期
package org.springframework.cloud.gateway.filter;
@Deprecated
public class LoadBalancerClientFilter implements GlobalFilter, Ordered {
protected final LoadBalancerClient loadBalancer;
private LoadBalancerProperties properties;
public LoadBalancerClientFilter(LoadBalancerClient loadBalancer,
LoadBalancerProperties properties) {
this.loadBalancer = loadBalancer;
this.properties = properties;
}
...
@Override
@SuppressWarnings("Duplicates")
public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
URI url = exchange.getAttribute(GATEWAY_REQUEST_URL_ATTR);
String schemePrefix = exchange.getAttribute(GATEWAY_SCHEME_PREFIX_ATTR);
if (url == null
|| (!"lb".equals(url.getScheme()) && !"lb".equals(schemePrefix))) {
return chain.filter(exchange);
}
// preserve the original url
addOriginalRequestUrl(exchange, url);
if (log.isTraceEnabled()) {
log.trace("LoadBalancerClientFilter url before: " + url);
}
final ServiceInstance instance = choose(exchange);
if (instance == null) {
throw NotFoundException.create(properties.isUse404(),
"Unable to find instance for " + url.getHost());
}
URI uri = exchange.getRequest().getURI();
// if the `lb:<scheme>` mechanism was used, use `<scheme>` as the default,
// if the loadbalancer doesn't provide one.
String overrideScheme = instance.isSecure() ? "https" : "http";
if (schemePrefix != null) {
overrideScheme = url.getScheme();
}
URI requestUrl = loadBalancer.reconstructURI(
new DelegatingServiceInstance(instance, overrideScheme), uri);
if (log.isTraceEnabled()) {
log.trace("LoadBalancerClientFilter url chosen: " + requestUrl);
}
exchange.getAttributes().put(GATEWAY_REQUEST_URL_ATTR, requestUrl);
return chain.filter(exchange);
}
protected ServiceInstance choose(ServerWebExchange exchange) {
return loadBalancer.choose(
((URI) exchange.getAttribute(GATEWAY_REQUEST_URL_ATTR)).getHost());
}
}
从源码看到,由 LoadBalancerClientFiltechoose(ServerWebExchange exchange)r#
选择出适合的下游服务进行转发。其中方法中loadBalancer
就是 DynamicServerListLoadBalancer
实例
DynamicServerListLoadBalancer
package com.netflix.loadbalancer;
public class DynamicServerListLoadBalancer<T extends Server> extends BaseLoadBalancer {
private static final Logger LOGGER = LoggerFactory.getLogger(DynamicServerListLoadBalancer.class);
boolean isSecure = false;
boolean useTunnel = false;
// to keep track of modification of server lists
protected AtomicBoolean serverListUpdateInProgress = new AtomicBoolean(false);
volatile ServerList<T> serverListImpl;
volatile ServerListFilter<T> filter;
protected final ServerListUpdater.UpdateAction updateAction = new ServerListUpdater.UpdateAction() {
@Override
public void doUpdate() {
updateListOfServers();
}
};
protected volatile ServerListUpdater serverListUpdater;
public DynamicServerListLoadBalancer() {
super();
}
@Deprecated
public DynamicServerListLoadBalancer(IClientConfig clientConfig, IRule rule, IPing ping,
ServerList<T> serverList, ServerListFilter<T> filter) {
this(
clientConfig,
rule,
ping,
serverList,
filter,
new PollingServerListUpdater()
);
}
public DynamicServerListLoadBalancer(IClientConfig clientConfig, IRule rule, IPing ping,
ServerList<T> serverList, ServerListFilter<T> filter,
ServerListUpdater serverListUpdater) {
super(clientConfig, rule, ping);
this.serverListImpl = serverList;
this.filter = filter;
this.serverListUpdater = serverListUpdater;
if (filter instanceof AbstractServerListFilter) {
((AbstractServerListFilter) filter).setLoadBalancerStats(getLoadBalancerStats());
}
restOfInit(clientConfig);
}
public DynamicServerListLoadBalancer(IClientConfig clientConfig) {
initWithNiwsConfig(clientConfig);
}
@Override
public void initWithNiwsConfig(IClientConfig clientConfig) {
try {
super.initWithNiwsConfig(clientConfig);
String niwsServerListClassName = clientConfig.getPropertyAsString(
CommonClientConfigKey.NIWSServerListClassName,
DefaultClientConfigImpl.DEFAULT_SEVER_LIST_CLASS);
ServerList<T> niwsServerListImpl = (ServerList<T>) ClientFactory
.instantiateInstanceWithClientConfig(niwsServerListClassName, clientConfig);
this.serverListImpl = niwsServerListImpl;
if (niwsServerListImpl instanceof AbstractServerList) {
AbstractServerListFilter<T> niwsFilter = ((AbstractServerList) niwsServerListImpl)
.getFilterImpl(clientConfig);
niwsFilter.setLoadBalancerStats(getLoadBalancerStats());
this.filter = niwsFilter;
}
String serverListUpdaterClassName = clientConfig.getPropertyAsString(
CommonClientConfigKey.ServerListUpdaterClassName,
DefaultClientConfigImpl.DEFAULT_SERVER_LIST_UPDATER_CLASS
);
this.serverListUpdater = (ServerListUpdater) ClientFactory
.instantiateInstanceWithClientConfig(serverListUpdaterClassName, clientConfig);
restOfInit(clientConfig);
} catch (Exception e) {
throw new RuntimeException(
"Exception while initializing NIWSDiscoveryLoadBalancer:"
+ clientConfig.getClientName()
+ ", niwsClientConfig:" + clientConfig, e);
}
}
void restOfInit(IClientConfig clientConfig) {
boolean primeConnection = this.isEnablePrimingConnections();
// turn this off to avoid duplicated asynchronous priming done in BaseLoadBalancer.setServerList()
this.setEnablePrimingConnections(false);
enableAndInitLearnNewServersFeature();
updateListOfServers();
if (primeConnection && this.getPrimeConnections() != null) {
this.getPrimeConnections()
.primeConnections(getReachableServers());
}
this.setEnablePrimingConnections(primeConnection);
LOGGER.info("DynamicServerListLoadBalancer for client {} initialized: {}", clientConfig.getClientName(), this.toString());
}
....
/**
* Feature that lets us add new instances (from AMIs) to the list of
* existing servers that the LB will use Call this method if you want this
* feature enabled
*/
public void enableAndInitLearnNewServersFeature() {
LOGGER.info("Using serverListUpdater {}", serverListUpdater.getClass().getSimpleName());
serverListUpdater.start(updateAction);
}
@VisibleForTesting
public void updateListOfServers() {
List<T> servers = new ArrayList<T>();
if (serverListImpl != null) {
servers = serverListImpl.getUpdatedListOfServers();
LOGGER.debug("List of Servers for {} obtained from Discovery client: {}",
getIdentifier(), servers);
if (filter != null) {
servers = filter.getFilteredListOfServers(servers);
LOGGER.debug("Filtered List of Servers for {} obtained from Discovery client: {}",
getIdentifier(), servers);
}
}
updateAllServerList(servers);
}
/**
* Update the AllServer list in the LoadBalancer if necessary and enabled
*
* @param ls
*/
protected void updateAllServerList(List<T> ls) {
// other threads might be doing this - in which case, we pass
if (serverListUpdateInProgress.compareAndSet(false, true)) {
try {
for (T s : ls) {
s.setAlive(true); // set so that clients can start using these
// servers right away instead
// of having to wait out the ping cycle.
}
setServersList(ls);
super.forceQuickPing();
} finally {
serverListUpdateInProgress.set(false);
}
}
}
...
}
从源码上可以看到,在构造方法初始化时,会在DynamicServerListLoadBalancer#restOfInit
里调用DynamicServerListLoadBalancer#updateListOfServers
方法加载下游的服务列表。同时,DynamicServerListLoadBalancer
会启用 DynamicServerListLoadBalancer#serverListUpdater
定时更新下游的服务列表。serverListUpdater
实际上就是PollingServerListUpdater
实例
PollingServerListUpdater
package com.netflix.loadbalancer;
public class PollingServerListUpdater implements ServerListUpdater {
...
private static long LISTOFSERVERS_CACHE_UPDATE_DELAY = 1000; // msecs;
private static int LISTOFSERVERS_CACHE_REPEAT_INTERVAL = 30 * 1000; // msecs;
...
private final AtomicBoolean isActive = new AtomicBoolean(false);
private volatile long lastUpdated = System.currentTimeMillis();
private final long initialDelayMs;
private final long refreshIntervalMs;
private volatile ScheduledFuture<?> scheduledFuture;
public PollingServerListUpdater() {
this(LISTOFSERVERS_CACHE_UPDATE_DELAY, LISTOFSERVERS_CACHE_REPEAT_INTERVAL);
}
public PollingServerListUpdater(IClientConfig clientConfig) {
this(LISTOFSERVERS_CACHE_UPDATE_DELAY, getRefreshIntervalMs(clientConfig));
}
public PollingServerListUpdater(final long initialDelayMs, final long refreshIntervalMs) {
this.initialDelayMs = initialDelayMs;
this.refreshIntervalMs = refreshIntervalMs;
}
@Override
public synchronized void start(final UpdateAction updateAction) {
if (isActive.compareAndSet(false, true)) {
final Runnable wrapperRunnable = new Runnable() {
@Override
public void run() {
if (!isActive.get()) {
if (scheduledFuture != null) {
scheduledFuture.cancel(true);
}
return;
}
try {
updateAction.doUpdate();
lastUpdated = System.currentTimeMillis();
} catch (Exception e) {
logger.warn("Failed one update cycle", e);
}
}
};
scheduledFuture = getRefreshExecutor().scheduleWithFixedDelay(
wrapperRunnable,
initialDelayMs,
refreshIntervalMs,
TimeUnit.MILLISECONDS
);
} else {
logger.info("Already active, no-op");
}
}
...
private static long getRefreshIntervalMs(IClientConfig clientConfig) {
return clientConfig.get(CommonClientConfigKey.ServerListRefreshInterval, LISTOFSERVERS_CACHE_REPEAT_INTERVAL);
}
}
在 PollingServerListUpdater#start
方法上可以看到PollingServerListUpdater
实际上是定时地调用 DynamicServerListLoadBalancer#updateAction
,由DynamicServerListLoadBalancer#updateAction
进行定时更新最新服务列表。这里可以看到PollingServerListUpdater
的刷新时间是可以通过配置ribbon.ServerListRefreshInterval=毫秒数
去修改,默认是DynamicServerListLoadBalancer#LISTOFSERVERS_CACHE_REPEAT_INTERVAL
,即30
秒。如果你想精确到下游的某个服务的刷新时间,可以配置成 ribbon.服务ID.ServerListRefreshInterval=毫秒数
。
刷新服务列表是在DynamicServerListLoadBalancer#updateListOfServers
里,而最新的用户列表的获取是交由DynamicServerListLoadBalancer#serverListImpl
实现。由于我们gateway接入了 nacos ,所以此时的DynamicServerListLoadBalancer#serverListImpl
就是NacosServerList
NacosServerList
package com.alibaba.cloud.nacos.ribbon;
public class NacosServerList extends AbstractServerList<NacosServer> {
private NacosDiscoveryProperties discoveryProperties;
private String serviceId;
public NacosServerList(NacosDiscoveryProperties discoveryProperties) {
this.discoveryProperties = discoveryProperties;
}
@Override
public List<NacosServer> getUpdatedListOfServers() {
return getServers();
}
private List<NacosServer> getServers() {
try {
String group = discoveryProperties.getGroup();
List<Instance> instances = discoveryProperties.namingServiceInstance()
.selectInstances(serviceId, group, true);
return instancesToServerList(instances);
}
catch (Exception e) {
throw new IllegalStateException(
"Can not get service instances from nacos, serviceId=" + serviceId,
e);
}
}
private List<NacosServer> instancesToServerList(List<Instance> instances) {
List<NacosServer> result = new ArrayList<>();
if (CollectionUtils.isEmpty(instances)) {
return result;
}
for (Instance instance : instances) {
result.add(new NacosServer(instance));
}
return result;
}
}
关键还是看NacosServerList#getServers
的discoveryProperties.namingServiceInstance() .selectInstances(serviceId, group, true);
,通过不断的深入,可以看到调用到了NacosNamingService#
NacosNamingService
package com.alibaba.nacos.client.naming;
public class NacosNamingService implements NamingService {
@Override
public List<Instance> selectInstances(String serviceName, boolean healthy) throws NacosException {
return selectInstances(serviceName, new ArrayList<String>(), healthy);
}
@Override
public List<Instance> selectInstances(String serviceName, String groupName, boolean healthy) throws NacosException {
return selectInstances(serviceName, groupName, healthy, true);
}
@Override
public List<Instance> selectInstances(String serviceName, boolean healthy, boolean subscribe)
throws NacosException {
return selectInstances(serviceName, new ArrayList<String>(), healthy, subscribe);
}
@Override
public List<Instance> selectInstances(String serviceName, String groupName, boolean healthy, boolean subscribe)
throws NacosException {
return selectInstances(serviceName, groupName, new ArrayList<String>(), healthy, subscribe);
}
@Override
public List<Instance> selectInstances(String serviceName, List<String> clusters, boolean healthy)
throws NacosException {
return selectInstances(serviceName, clusters, healthy, true);
}
@Override
public List<Instance> selectInstances(String serviceName, String groupName, List<String> clusters, boolean healthy)
throws NacosException {
return selectInstances(serviceName, groupName, clusters, healthy, true);
}
@Override
public List<Instance> selectInstances(String serviceName, List<String> clusters, boolean healthy, boolean subscribe)
throws NacosException {
return selectInstances(serviceName, Constants.DEFAULT_GROUP, clusters, healthy, subscribe);
}
@Override
public List<Instance> selectInstances(String serviceName, String groupName, List<String> clusters, boolean healthy,
boolean subscribe) throws NacosException {
ServiceInfo serviceInfo;
if (subscribe) {
serviceInfo = hostReactor.getServiceInfo(NamingUtils.getGroupedName(serviceName, groupName),
StringUtils.join(clusters, ","));
} else {
serviceInfo = hostReactor
.getServiceInfoDirectlyFromServer(NamingUtils.getGroupedName(serviceName, groupName),
StringUtils.join(clusters, ","));
}
return selectInstances(serviceInfo, healthy);
}
private List<Instance> selectInstances(ServiceInfo serviceInfo, boolean healthy) {
List<Instance> list;
if (serviceInfo == null || CollectionUtils.isEmpty(list = serviceInfo.getHosts())) {
return new ArrayList<Instance>();
}
Iterator<Instance> iterator = list.iterator();
while (iterator.hasNext()) {
Instance instance = iterator.next();
if (healthy != instance.isHealthy() || !instance.isEnabled() || instance.getWeight() <= 0) {
iterator.remove();
}
}
return list;
}
}
关键还是看NacosNamingService#selectInstances(String serviceName, String groupName, List<String> clusters, boolean healthy, boolean subscribe)
,服务列表是NacosNamingService#hostReactor
获取,继续深入。。
HostReactor
package com.alibaba.nacos.client.naming.core;
public class HostReactor implements Closeable {
private static final long DEFAULT_DELAY = 1000L;
private static final long UPDATE_HOLD_INTERVAL = 5000L;
private final Map<String, ScheduledFuture<?>> futureMap = new HashMap<String, ScheduledFuture<?>>();
private final Map<String, ServiceInfo> serviceInfoMap;
private final Map<String, Object> updatingMap;
private final PushReceiver pushReceiver;
private final BeatReactor beatReactor;
private final NamingProxy serverProxy;
private final FailoverReactor failoverReactor;
private final String cacheDir;
private final boolean pushEmptyProtection;
private final ScheduledExecutorService executor;
private final InstancesChangeNotifier notifier;
public ServiceInfo getServiceInfo(final String serviceName, final String clusters) {
NAMING_LOGGER.debug("failover-mode: " + failoverReactor.isFailoverSwitch());
String key = ServiceInfo.getKey(serviceName, clusters);
if (failoverReactor.isFailoverSwitch()) {
return failoverReactor.getService(key);
}
ServiceInfo serviceObj = getServiceInfo0(serviceName, clusters);
if (null == serviceObj) {
serviceObj = new ServiceInfo(serviceName, clusters);
serviceInfoMap.put(serviceObj.getKey(), serviceObj);
updatingMap.put(serviceName, new Object());
updateServiceNow(serviceName, clusters);
updatingMap.remove(serviceName);
} else if (updatingMap.containsKey(serviceName)) {
if (UPDATE_HOLD_INTERVAL > 0) {
// hold a moment waiting for update finish
synchronized (serviceObj) {
try {
serviceObj.wait(UPDATE_HOLD_INTERVAL);
} catch (InterruptedException e) {
NAMING_LOGGER
.error("[getServiceInfo] serviceName:" + serviceName + ", clusters:" + clusters, e);
}
}
}
}
scheduleUpdateIfAbsent(serviceName, clusters);
return serviceInfoMap.get(serviceObj.getKey());
}
private ServiceInfo getServiceInfo0(String serviceName, String clusters) {
String key = ServiceInfo.getKey(serviceName, clusters);
return serviceInfoMap.get(key);
}
/**
* Schedule update if absent.
*
* @param serviceName service name
* @param clusters clusters
*/
public void scheduleUpdateIfAbsent(String serviceName, String clusters) {
if (futureMap.get(ServiceInfo.getKey(serviceName, clusters)) != null) {
return;
}
synchronized (futureMap) {
if (futureMap.get(ServiceInfo.getKey(serviceName, clusters)) != null) {
return;
}
ScheduledFuture<?> future = addTask(new UpdateTask(serviceName, clusters));
futureMap.put(ServiceInfo.getKey(serviceName, clusters), future);
}
}
}
HostReactor#getServiceInfo(final String serviceName, final String clusters)
最终的是获取HostReactor#serviceInfoMap
缓存中的服务列表。至于缓存的更新工作,我们可以最终到HostReactor#getServiceInfo(final String serviceName, final String clusters)
中调用了scheduleUpdateIfAbsent(String serviceName, String clusters)
方法,从scheduleUpdateIfAbsent(String serviceName, String clusters)
方法中可以看到,刷新最新服务列表缓存是由HostReactor
的内部类HostReactor#UpdateTask
实现。
UpdateTask
package com.alibaba.nacos.client.naming.core;
public class HostReactor implements Closeable {
private static final long DEFAULT_DELAY = 1000L;
private static final long UPDATE_HOLD_INTERVAL = 5000L;
public class UpdateTask implements Runnable {
long lastRefTime = Long.MAX_VALUE;
private final String clusters;
private final String serviceName;
/**
* the fail situation. 1:can't connect to server 2:serviceInfo's hosts is empty
*/
private int failCount = 0;
public UpdateTask(String serviceName, String clusters) {
this.serviceName = serviceName;
this.clusters = clusters;
}
private void incFailCount() {
int limit = 6;
if (failCount == limit) {
return;
}
failCount++;
}
private void resetFailCount() {
failCount = 0;
}
@Override
public void run() {
long delayTime = DEFAULT_DELAY;
try {
ServiceInfo serviceObj = serviceInfoMap.get(ServiceInfo.getKey(serviceName, clusters));
if (serviceObj == null) {
updateService(serviceName, clusters);
return;
}
if (serviceObj.getLastRefTime() <= lastRefTime) {
updateService(serviceName, clusters);
serviceObj = serviceInfoMap.get(ServiceInfo.getKey(serviceName, clusters));
} else {
// if serviceName already updated by push, we should not override it
// since the push data may be different from pull through force push
refreshOnly(serviceName, clusters);
}
lastRefTime = serviceObj.getLastRefTime();
if (!notifier.isSubscribed(serviceName, clusters) && !futureMap
.containsKey(ServiceInfo.getKey(serviceName, clusters))) {
// abort the update task
NAMING_LOGGER.info("update task is stopped, service:" + serviceName + ", clusters:" + clusters);
return;
}
if (CollectionUtils.isEmpty(serviceObj.getHosts())) {
incFailCount();
return;
}
delayTime = serviceObj.getCacheMillis();
resetFailCount();
} catch (Throwable e) {
incFailCount();
NAMING_LOGGER.warn("[NA] failed to update serviceName: " + serviceName, e);
} finally {
executor.schedule(this, Math.min(delayTime << failCount, DEFAULT_DELAY * 60), TimeUnit.MILLISECONDS);
}
}
}
}
可以看到HostReactor#updateTask
是根据失败次数去调整刷新间隔时间。
而认为失败的条件有两个:
1. 因为IO原因(比如网路)获取不到nacos的服务列表。
2. nacos返回的服务类别是空。
根据间隔时间的算法Math.min(delayTime << failCount, DEFAULT_DELAY * 60), TimeUnit.MILLISECONDS)
,我们可以得出,最大刷新间隔时间是60秒:
失败次数 | 刷新间隔时间(ms) |
---|---|
0 | 1000 |
1 | 2000 |
2 | 4000 |
3 | 8000 |
4 | 16000 |
5 | 32000 |
6 | 60000 |
7 | 60000 |
8 | 60000 |
9 | 60000 |
10 | 60000 |
而这个HostReactor#updateTask
并不支持配置,是写死的。
省流,总结
gateway + nacos
的刷新服务列表时间间隔优化是首先交由 ribbon
进行管理,我们可以通过配置ribbon.ServerListRefreshInterval=毫秒数
去修改,默认是DynamicServerListLoadBalancer#LISTOFSERVERS_CACHE_REPEAT_INTERVAL
,即30
秒。如果你想精确到下游的某个服务的刷新时间,可以配置成 ribbon.服务ID.ServerListRefreshInterval=毫秒数
。
当时 ribbon
的刷新任务实际上也是获取 nacos
的服务发现实例,而nacos
给到ribbon
的服务类别也是缓存列表,至于nacos
的缓存列表的刷新时间间隔时间根据失败次数而定,没有失败的情况下是1秒刷新,随着失败次数的增加,最慢可达到60秒。nacos
的刷新服务列表算法以及刷新时间不支持配置。
所以gateway + nacos
服务列表刷新时间间隔最快是 ribbon.ServerListRefreshInterval+1s(nacos在没有失败的情况下是1秒刷新)
,最慢是ribbon.ServerListRefreshInterval+60s(nacos在没有失败的情况下是60秒刷新)