现象描述
上完线3天后,收到报警信息说系统内存超过阈值了,打开邮件看到频繁的fullgc邮件。内存监控如下:
监控显示yanggc的频率明显比之前变大了。现在7~10/min,每次耗时100~200ms,而之前6/min左右,每次耗时<100ms。监控中的fullgc次数也增多了。查看堆的使用量走势,如下图:
堆的形状呈锯齿状,此形状一出,系统必有“妖”。下面开始捉“妖”的过程。
排查过程
通过命令查看系统资源
a.查看进程信息
通过top查看系统占用资源情况,发现业务系统进程的cpu持续占有60%左右。查看进程中线程情况:
top -H -p <pid>
定位到有几个线程ID,时不时的会飙高cpu使用率。那就看看这个几个线程是谁,因为top看到的线程ID是十进制而栈中线程ID是十六进制,ID需要转换。
b.查看栈信息
jstack -l <pid> | grep <tid>
c.查看堆信息
jmap -dump:format=b,file=heap.hprof
观察内存泄露的可能地方和内存的大对象,发现CloudEurekaClient对象占用了大量内存。Eureka作为注册中心,我们的系统使用了很长时间一直都没问题,为啥这次上线3天后内存占用这么大?
2. 后台查看Eureka信息
打开Eureka的管理后台,发现进入了自我保护模式。而且有一个服务的实例有1W+,都处于Starting状态。
此图为了模拟产生STARTING状态 手动产生的,Eureka出问题时的实际实例1W+。
问题定位到了,一顿操作,手动删除了这些Starting状态的实例。再去业务系统查看内存和CPU,发现已经正常了。
分析根本原因
1. 查看DiscoveryClient源码
@InjectDiscoveryClient(ApplicationInfoManager applicationInfoManager, EurekaClientConfig config, AbstractDiscoveryClientOptionalArgs args, Provider backupRegistryProvider) { ...... this.applicationInfoManager = applicationInfoManager; InstanceInfo myInfo = applicationInfoManager.getInfo(); clientConfig = config; staticClientConfig = clientConfig; transportConfig = config.getTransportConfig(); instanceInfo = myInfo; if (myInfo != null) { appPathIdentifier = instanceInfo.getAppName() + "/" + instanceInfo.getId(); } else { logger.warn("Setting instanceInfo to a passed in null value"); } ...... //在栈信息中,定位到缓存刷新线程池占用了大量CPU资源 cacheRefreshExecutor = new ThreadPoolExecutor( 1, clientConfig.getCacheRefreshExecutorThreadPoolSize(), 0, TimeUnit.SECONDS, new SynchronousQueue(), new ThreadFactoryBuilder() .setNameFormat("DiscoveryClient-CacheRefreshExecutor-%d") .setDaemon(true) .build() ); // use direct handoff ...... //初始化定时任务 initScheduledTasks(); ......}
2. 查看方法initScheduledTasks()源码
/** * Initializes all scheduled tasks. */private void initScheduledTasks() { if (clientConfig.shouldFetchRegistry()) { // registry cache refresh timer int registryFetchIntervalSeconds = clientConfig.getRegistryFetchIntervalSeconds(); int expBackOffBound = clientConfig.getCacheRefreshExecutorExponentialBackOffBound(); scheduler.schedule( new TimedSupervisorTask( "cacheRefresh", scheduler, cacheRefreshExecutor, registryFetchIntervalSeconds, TimeUnit.SECONDS, expBackOffBound, //刷新缓存的线程 new CacheRefreshThread() ), //默认30s刷新一次 registryFetchIntervalSeconds, TimeUnit.SECONDS); } ......}/** * The task that fetches the registry information at specified intervals. * */class CacheRefreshThread implements Runnable { public void run() { //刷新注册信息 refreshRegistry(); }}public class DefaultEurekaClientConfig implements EurekaClientConfig { @Override public int getRegistryFetchIntervalSeconds() { return configInstance.getIntProperty( namespace + REGISTRY_REFRESH_INTERVAL_KEY, 30).get(); }}
3. 查看方法refreshRegistry源码
@VisibleForTestingvoid refreshRegistry() { try { boolean isFetchingRemoteRegionRegistries = isFetchingRemoteRegionRegistries(); boolean remoteRegionsModified = false; // This makes sure that a dynamic change to remote regions to fetch is honored. String latestRemoteRegions = clientConfig.fetchRegistryForRemoteRegions(); if (null != latestRemoteRegions) { String currentRemoteRegions = remoteRegionsToFetch.get(); if (!latestRemoteRegions.equals(currentRemoteRegions)) { // Both remoteRegionsToFetch and AzToRegionMapper.regionsToFetch need to be in sync synchronized (instanceRegionChecker.getAzToRegionMapper()) { if (remoteRegionsToFetch.compareAndSet(currentRemoteRegions, latestRemoteRegions)) { String[] remoteRegions = latestRemoteRegions.split(","); remoteRegionsRef.set(remoteRegions); instanceRegionChecker.getAzToRegionMapper().setRegionsToFetch(remoteRegions); remoteRegionsModified = true; } else { logger.info("Remote regions to fetch modified concurrently," + " ignoring change from {} to {}", currentRemoteRegions, latestRemoteRegions); } } } else { // Just refresh mapping to reflect any DNS/Property change instanceRegionChecker.getAzToRegionMapper().refreshMapping(); } } //获取注册信息,并更新本地缓存数据 boolean success = fetchRegistry(remoteRegionsModified); if (success) { registrySize = localRegionApps.get().size(); lastSuccessfulRegistryFetchTimestamp = System.currentTimeMillis(); } ....... } catch (Throwable e) { logger.error("Cannot fetch registry from server", e); } }
4. 查看方法fetchRegistry源码
/** * Fetches the registry information. * * * This method tries to get only deltas after the first fetch unless there * is an issue in reconciling eureka server and client registry information. * * * @param forceFullRegistryFetch Forces a full registry fetch. * * @return true if the registry was fetched */private boolean fetchRegistry(boolean forceFullRegistryFetch) { Stopwatch tracer = FETCH_REGISTRY_TIMER.start(); try { // If the delta is disabled or if it is the first time, get all // applications Applications applications = getApplications(); if (clientConfig.shouldDisableDelta() || (!Strings.isNullOrEmpty(clientConfig.getRegistryRefreshSingleVipAddress())) || forceFullRegistryFetch || (applications == null) || (applications.getRegisteredApplications().size() == 0) || (applications.getVersion() == -1)) //Client application does not have latest library supporting delta { logger.info("Disable delta property : {}", clientConfig.shouldDisableDelta()); logger.info("Single vip registry refresh property : {}", clientConfig.getRegistryRefreshSingleVipAddress()); logger.info("Force full registry fetch : {}", forceFullRegistryFetch); logger.info("Application is null : {}", (applications == null)); logger.info("Registered Applications size is zero : {}", (applications.getRegisteredApplications().size() == 0)); logger.info("Application version is -1: {}", (applications.getVersion() == -1)); //第一次注册时,获取全部注册信息并保存到本地缓存中 getAndStoreFullRegistry(); } else { //获取并更新所有服务注册的信息 getAndUpdateDelta(applications); } applications.setAppsHashCode(applications.getReconcileHashCode()); logTotalInstances(); } catch (Throwable e) { logger.error(PREFIX + appPathIdentifier + " - was unable to refresh its cache! status = " + e.getMessage(), e); return false; } finally { if (tracer != null) { tracer.stop(); } } // Notify about cache refresh before updating the instance remote status onCacheRefreshed(); // Update remote status based on refreshed data held in the cache updateInstanceRemoteStatus(); // registry was fetched successfully, so return true return true;}
5. 查看方法getAndUpdateDelta源码
/** * Get the delta registry information from the eureka server and update it locally. * When applying the delta, the following flow is observed: * * if (update generation have not advanced (due to another thread)) * atomically try to: update application with the delta and get reconcileHashCode * abort entire processing otherwise * do reconciliation if reconcileHashCode clash * fi * * @return the client response * @throws Throwable on error */private void getAndUpdateDelta(Applications applications) throws Throwable { long currentUpdateGeneration = fetchRegistryGeneration.get(); Applications delta = null; //调接口(serviceUrl/apps/delta),查询Eureka上delta上的服务列表 EurekaHttpResponse httpResponse = eurekaTransport.queryClient.getDelta(remoteRegionsRef.get()); if (httpResponse.getStatusCode() == Status.OK.getStatusCode()) { delta = httpResponse.getEntity(); } if (delta == null) { logger.warn("The server does not allow the delta revision to be applied because it is not safe. " + "Hence got the full registry."); //如果调delta接口没获取服务列表,则调接口(serviceUrl/apps)获取所有注册的服务列表信息 getAndStoreFullRegistry(); } else if (fetchRegistryGeneration.compareAndSet(currentUpdateGeneration, currentUpdateGeneration + 1)) { logger.debug("Got delta update with apps hashcode {}", delta.getAppsHashCode()); String reconcileHashCode = ""; if (fetchRegistryUpdateLock.tryLock()) { try { //根据获取d饿delta,跟新本地缓存信息 updateDelta(delta); reconcileHashCode = getReconcileHashCode(applications); } finally { fetchRegistryUpdateLock.unlock(); } } else { logger.warn("Cannot acquire update lock, aborting getAndUpdateDelta"); } // There is a diff in number of instances for some reason if (!reconcileHashCode.equals(delta.getAppsHashCode()) || clientConfig.shouldLogDeltaDiff()) { reconcileAndLogDifference(delta, reconcileHashCode); // this makes a remoteCall } } else { logger.warn("Not updating application delta as another thread is updating it already"); logger.debug("Ignoring delta update with apps hashcode {}, as another thread is updating it already", delta.getAppsHashCode()); }}
分析到这里已经看到了很多代码,这是客户端更新服务列表的核心流程,下面的方法是核心中的核心。查询服务列表更新本地缓存。
6. 查看方法updateDelta源码
/** * Updates the delta information fetches from the eureka server into the * local cache. * * @param delta * the delta information received from eureka server in the last * poll cycle. */private void updateDelta(Applications delta) { int deltaCount = 0; for (Application app : delta.getRegisteredApplications()) { for (InstanceInfo instance : app.getInstances()) { Applications applications = getApplications(); String instanceRegion = instanceRegionChecker.getInstanceRegion(instance); if (!instanceRegionChecker.isLocalRegion(instanceRegion)) { Applications remoteApps = remoteRegionVsApps.get(instanceRegion); if (null == remoteApps) { remoteApps = new Applications(); remoteRegionVsApps.put(instanceRegion, remoteApps); } applications = remoteApps; } ...... } }}
源码分析到这里,终于找到了根本原因。因为eureka中,有个服务注册了1万+的实例,导致定时遍历所有服务的实例时,耗时过长且占用内存过大,才导致了频繁的yonggc和fullgc。
总结
定位问题时,不能一直盯着刚上线的代码去排查问题,有时候因为某个中间件也会导致业务系统资源占用过高。
仔细观察问题现象
top查看进程信息
jstack查看栈信息
jmap dump堆信息,通过MAT分析堆内大对象
跟踪源码,定位根本原因
出现问题后的措施,避免出现同类问题