fullgc频繁的原因_一次Eureka引起的业务系统频繁fullgc的排查

最新推荐文章于 2024-07-01 21:54:00 发布

weixin_39583162

最新推荐文章于 2024-07-01 21:54:00 发布

阅读量528

点赞数

文章标签： fullgc频繁的原因

现象描述

上完线3天后，收到报警信息说系统内存超过阈值了，打开邮件看到频繁的fullgc邮件。内存监控如下：

监控显示yanggc的频率明显比之前变大了。现在7~10/min，每次耗时100~200ms，而之前6/min左右，每次耗时<100ms。监控中的fullgc次数也增多了。查看堆的使用量走势，如下图：

堆的形状呈锯齿状，此形状一出，系统必有“妖”。下面开始捉“妖”的过程。

排查过程

通过命令查看系统资源

a.查看进程信息

通过top查看系统占用资源情况，发现业务系统进程的cpu持续占有60%左右。查看进程中线程情况：

top -H -p <pid>

定位到有几个线程ID，时不时的会飙高cpu使用率。那就看看这个几个线程是谁，因为top看到的线程ID是十进制而栈中线程ID是十六进制，ID需要转换。

b.查看栈信息

jstack -l <pid> | grep <tid>

c.查看堆信息

jmap -dump:format=b,file=heap.hprof

观察内存泄露的可能地方和内存的大对象，发现CloudEurekaClient对象占用了大量内存。Eureka作为注册中心，我们的系统使用了很长时间一直都没问题，为啥这次上线3天后内存占用这么大？

2. 后台查看Eureka信息

打开Eureka的管理后台，发现进入了自我保护模式。而且有一个服务的实例有1W+，都处于Starting状态。

此图为了模拟产生STARTING状态手动产生的，Eureka出问题时的实际实例1W+。

问题定位到了，一顿操作，手动删除了这些Starting状态的实例。再去业务系统查看内存和CPU，发现已经正常了。

分析根本原因

1. 查看DiscoveryClient源码

@InjectDiscoveryClient(ApplicationInfoManager applicationInfoManager, EurekaClientConfig config, AbstractDiscoveryClientOptionalArgs args,                Provider backupRegistryProvider) {    ......    this.applicationInfoManager = applicationInfoManager;    InstanceInfo myInfo = applicationInfoManager.getInfo();    clientConfig = config;    staticClientConfig = clientConfig;    transportConfig = config.getTransportConfig();    instanceInfo = myInfo;    if (myInfo != null) {        appPathIdentifier = instanceInfo.getAppName() + "/" + instanceInfo.getId();    } else {        logger.warn("Setting instanceInfo to a passed in null value");    }    ......    //在栈信息中，定位到缓存刷新线程池占用了大量CPU资源    cacheRefreshExecutor = new ThreadPoolExecutor(            1, clientConfig.getCacheRefreshExecutorThreadPoolSize(), 0, TimeUnit.SECONDS,            new SynchronousQueue(),            new ThreadFactoryBuilder()                    .setNameFormat("DiscoveryClient-CacheRefreshExecutor-%d")                    .setDaemon(true)                    .build()    );  // use direct handoff    ......    //初始化定时任务    initScheduledTasks();    ......}

2. 查看方法initScheduledTasks()源码

/** * Initializes all scheduled tasks. */private void initScheduledTasks() {    if (clientConfig.shouldFetchRegistry()) {        // registry cache refresh timer        int registryFetchIntervalSeconds = clientConfig.getRegistryFetchIntervalSeconds();        int expBackOffBound = clientConfig.getCacheRefreshExecutorExponentialBackOffBound();        scheduler.schedule(                new TimedSupervisorTask(                        "cacheRefresh",                        scheduler,                        cacheRefreshExecutor,                        registryFetchIntervalSeconds,                        TimeUnit.SECONDS,                        expBackOffBound,                        //刷新缓存的线程                        new CacheRefreshThread()                ),                //默认30s刷新一次                registryFetchIntervalSeconds, TimeUnit.SECONDS);    }    ......}/** * The task that fetches the registry information at specified intervals. * */class CacheRefreshThread implements Runnable {    public void run() {        //刷新注册信息        refreshRegistry();    }}public class DefaultEurekaClientConfig implements EurekaClientConfig {  @Override    public int getRegistryFetchIntervalSeconds() {        return configInstance.getIntProperty(                namespace + REGISTRY_REFRESH_INTERVAL_KEY, 30).get();    }}

3. 查看方法refreshRegistry源码

@VisibleForTestingvoid refreshRegistry() {    try {        boolean isFetchingRemoteRegionRegistries = isFetchingRemoteRegionRegistries();        boolean remoteRegionsModified = false;        // This makes sure that a dynamic change to remote regions to fetch is honored.        String latestRemoteRegions = clientConfig.fetchRegistryForRemoteRegions();        if (null != latestRemoteRegions) {            String currentRemoteRegions = remoteRegionsToFetch.get();            if (!latestRemoteRegions.equals(currentRemoteRegions)) {                // Both remoteRegionsToFetch and AzToRegionMapper.regionsToFetch need to be in sync                synchronized (instanceRegionChecker.getAzToRegionMapper()) {                    if (remoteRegionsToFetch.compareAndSet(currentRemoteRegions, latestRemoteRegions)) {                        String[] remoteRegions = latestRemoteRegions.split(",");                        remoteRegionsRef.set(remoteRegions);                        instanceRegionChecker.getAzToRegionMapper().setRegionsToFetch(remoteRegions);                        remoteRegionsModified = true;                    } else {                        logger.info("Remote regions to fetch modified concurrently," +                                " ignoring change from {} to {}", currentRemoteRegions, latestRemoteRegions);                    }                }            } else {                // Just refresh mapping to reflect any DNS/Property change                instanceRegionChecker.getAzToRegionMapper().refreshMapping();            }        }        //获取注册信息，并更新本地缓存数据        boolean success = fetchRegistry(remoteRegionsModified);        if (success) {            registrySize = localRegionApps.get().size();            lastSuccessfulRegistryFetchTimestamp = System.currentTimeMillis();        }        .......    } catch (Throwable e) {        logger.error("Cannot fetch registry from server", e);    }        }

4. 查看方法fetchRegistry源码

/** * Fetches the registry information. * *  * This method tries to get only deltas after the first fetch unless there * is an issue in reconciling eureka server and client registry information. *  * * @param forceFullRegistryFetch Forces a full registry fetch. * * @return true if the registry was fetched */private boolean fetchRegistry(boolean forceFullRegistryFetch) {    Stopwatch tracer = FETCH_REGISTRY_TIMER.start();    try {        // If the delta is disabled or if it is the first time, get all        // applications        Applications applications = getApplications();        if (clientConfig.shouldDisableDelta()                || (!Strings.isNullOrEmpty(clientConfig.getRegistryRefreshSingleVipAddress()))                || forceFullRegistryFetch                || (applications == null)                || (applications.getRegisteredApplications().size() == 0)                || (applications.getVersion() == -1)) //Client application does not have latest library supporting delta        {            logger.info("Disable delta property : {}", clientConfig.shouldDisableDelta());            logger.info("Single vip registry refresh property : {}", clientConfig.getRegistryRefreshSingleVipAddress());            logger.info("Force full registry fetch : {}", forceFullRegistryFetch);            logger.info("Application is null : {}", (applications == null));            logger.info("Registered Applications size is zero : {}",                    (applications.getRegisteredApplications().size() == 0));            logger.info("Application version is -1: {}", (applications.getVersion() == -1));            //第一次注册时，获取全部注册信息并保存到本地缓存中            getAndStoreFullRegistry();        } else {            //获取并更新所有服务注册的信息            getAndUpdateDelta(applications);        }        applications.setAppsHashCode(applications.getReconcileHashCode());        logTotalInstances();    } catch (Throwable e) {        logger.error(PREFIX + appPathIdentifier + " - was unable to refresh its cache! status = " + e.getMessage(), e);        return false;    } finally {        if (tracer != null) {            tracer.stop();        }    }    // Notify about cache refresh before updating the instance remote status    onCacheRefreshed();    // Update remote status based on refreshed data held in the cache    updateInstanceRemoteStatus();    // registry was fetched successfully, so return true    return true;}

5. 查看方法getAndUpdateDelta源码

/** * Get the delta registry information from the eureka server and update it locally. * When applying the delta, the following flow is observed: * * if (update generation have not advanced (due to another thread)) *   atomically try to: update application with the delta and get reconcileHashCode *   abort entire processing otherwise *   do reconciliation if reconcileHashCode clash * fi * * @return the client response * @throws Throwable on error */private void getAndUpdateDelta(Applications applications) throws Throwable {    long currentUpdateGeneration = fetchRegistryGeneration.get();    Applications delta = null;    //调接口(serviceUrl/apps/delta)，查询Eureka上delta上的服务列表    EurekaHttpResponse httpResponse = eurekaTransport.queryClient.getDelta(remoteRegionsRef.get());    if (httpResponse.getStatusCode() == Status.OK.getStatusCode()) {        delta = httpResponse.getEntity();    }    if (delta == null) {        logger.warn("The server does not allow the delta revision to be applied because it is not safe. "                + "Hence got the full registry.");        //如果调delta接口没获取服务列表，则调接口(serviceUrl/apps)获取所有注册的服务列表信息        getAndStoreFullRegistry();    } else if (fetchRegistryGeneration.compareAndSet(currentUpdateGeneration, currentUpdateGeneration + 1)) {        logger.debug("Got delta update with apps hashcode {}", delta.getAppsHashCode());        String reconcileHashCode = "";        if (fetchRegistryUpdateLock.tryLock()) {            try {                //根据获取d饿delta，跟新本地缓存信息                updateDelta(delta);                reconcileHashCode = getReconcileHashCode(applications);            } finally {                fetchRegistryUpdateLock.unlock();            }        } else {            logger.warn("Cannot acquire update lock, aborting getAndUpdateDelta");        }        // There is a diff in number of instances for some reason        if (!reconcileHashCode.equals(delta.getAppsHashCode()) || clientConfig.shouldLogDeltaDiff()) {            reconcileAndLogDifference(delta, reconcileHashCode);  // this makes a remoteCall        }    } else {        logger.warn("Not updating application delta as another thread is updating it already");        logger.debug("Ignoring delta update with apps hashcode {}, as another thread is updating it already", delta.getAppsHashCode());    }}

分析到这里已经看到了很多代码，这是客户端更新服务列表的核心流程，下面的方法是核心中的核心。查询服务列表更新本地缓存。

6. 查看方法updateDelta源码

/** * Updates the delta information fetches from the eureka server into the * local cache. * * @param delta *            the delta information received from eureka server in the last *            poll cycle. */private void updateDelta(Applications delta) {    int deltaCount = 0;    for (Application app : delta.getRegisteredApplications()) {        for (InstanceInfo instance : app.getInstances()) {            Applications applications = getApplications();            String instanceRegion = instanceRegionChecker.getInstanceRegion(instance);            if (!instanceRegionChecker.isLocalRegion(instanceRegion)) {                Applications remoteApps = remoteRegionVsApps.get(instanceRegion);                if (null == remoteApps) {                    remoteApps = new Applications();                    remoteRegionVsApps.put(instanceRegion, remoteApps);                }                applications = remoteApps;            }            ......        }    }}

源码分析到这里，终于找到了根本原因。因为eureka中，有个服务注册了1万+的实例，导致定时遍历所有服务的实例时，耗时过长且占用内存过大，才导致了频繁的yonggc和fullgc。

总结

定位问题时，不能一直盯着刚上线的代码去排查问题，有时候因为某个中间件也会导致业务系统资源占用过高。

仔细观察问题现象
top查看进程信息
jstack查看栈信息
jmap dump堆信息，通过MAT分析堆内大对象
跟踪源码，定位根本原因
出现问题后的措施，避免出现同类问题

weixin_39583162

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
fullgc频繁的原因_一次Eureka引起的业务系统频繁fullgc的排查

现象描述上完线3天后，收到报警信息说系统内存超过阈值了，打开邮件看到频繁的fullgc邮件。内存监控如下：监控显示yanggc的频率明显比之前变大了。现在7~10/min，每次耗时100~200ms，而之前6/min左右，每次耗时<100ms。监控中的fullgc次数也增多了。查看堆的使用量走势，如下图：堆的形状呈锯齿状，此形状一出，系统必有“妖”。下面开始捉“妖”的过程。排查过程...
复制链接

扫一扫