fullgc频繁的原因_一次Eureka引起的业务系统频繁fullgc的排查

现象描述


上完线3天后,收到报警信息说系统内存超过阈值了,打开邮件看到频繁的fullgc邮件。内存监控如下:

f5e6a5c4150c49cc1ce48a4475abc69e.png

监控显示yanggc的频率明显比之前变大了。现在7~10/min,每次耗时100~200ms,而之前6/min左右,每次耗时<100ms。监控中的fullgc次数也增多了。查看堆的使用量走势,如下图:

44072588d82cebe6b31435a203bae4d1.png

堆的形状呈锯齿状,此形状一出,系统必有“妖”。下面开始捉“妖”的过程。

排查过程


  1. 通过命令查看系统资源

a.查看进程信息

通过top查看系统占用资源情况,发现业务系统进程的cpu持续占有60%左右。查看进程中线程情况:

top -H -p <pid>

定位到有几个线程ID,时不时的会飙高cpu使用率。那就看看这个几个线程是谁,因为top看到的线程ID是十进制而栈中线程ID是十六进制,ID需要转换。

b.查看栈信息

jstack -l <pid> | grep <tid>

f0b99985380a5609ce526f15b81644f3.png

c.查看堆信息

jmap -dump:format=b,file=heap.hprof 

94004e7d6b1bb3037eca06fd016cbee0.png

a2aaa5309862c1a55cd7562427a1f78c.png

观察内存泄露的可能地方和内存的大对象,发现CloudEurekaClient对象占用了大量内存。Eureka作为注册中心,我们的系统使用了很长时间一直都没问题,为啥这次上线3天后内存占用这么大?

    2. 后台查看Eureka信息

打开Eureka的管理后台,发现进入了自我保护模式。而且有一个服务的实例有1W+,都处于Starting状态。

2692f8d7e00a11ae2c9944555a88ced4.png

bd600512944fab296c28e003f364b779.png

此图为了模拟产生STARTING状态 手动产生的,Eureka出问题时的实际实例1W+。

问题定位到了,一顿操作,手动删除了这些Starting状态的实例。再去业务系统查看内存和CPU,发现已经正常了。

分析根本原因


    1. 查看DiscoveryClient源码

@InjectDiscoveryClient(ApplicationInfoManager applicationInfoManager, EurekaClientConfig config, AbstractDiscoveryClientOptionalArgs args,                Provider backupRegistryProvider) {    ......    this.applicationInfoManager = applicationInfoManager;    InstanceInfo myInfo = applicationInfoManager.getInfo();    clientConfig = config;    staticClientConfig = clientConfig;    transportConfig = config.getTransportConfig();    instanceInfo = myInfo;    if (myInfo != null) {        appPathIdentifier = instanceInfo.getAppName() + "/" + instanceInfo.getId();    } else {        logger.warn("Setting instanceInfo to a passed in null value");    }    ......    //在栈信息中,定位到缓存刷新线程池占用了大量CPU资源    cacheRefreshExecutor = new ThreadPoolExecutor(            1, clientConfig.getCacheRefreshExecutorThreadPoolSize(), 0, TimeUnit.SECONDS,            new SynchronousQueue(),            new ThreadFactoryBuilder()                    .setNameFormat("DiscoveryClient-CacheRefreshExecutor-%d")                    .setDaemon(true)                    .build()    );  // use direct handoff    ......    //初始化定时任务    initScheduledTasks();    ......}

    2. 查看方法initScheduledTasks()源码

/** * Initializes all scheduled tasks. */private void initScheduledTasks() {    if (clientConfig.shouldFetchRegistry()) {        // registry cache refresh timer        int registryFetchIntervalSeconds = clientConfig.getRegistryFetchIntervalSeconds();        int expBackOffBound = clientConfig.getCacheRefreshExecutorExponentialBackOffBound();        scheduler.schedule(                new TimedSupervisorTask(                        "cacheRefresh",                        scheduler,                        cacheRefreshExecutor,                        registryFetchIntervalSeconds,                        TimeUnit.SECONDS,                        expBackOffBound,                        //刷新缓存的线程                        new CacheRefreshThread()                ),                //默认30s刷新一次                registryFetchIntervalSeconds, TimeUnit.SECONDS);    }    ......}/** * The task that fetches the registry information at specified intervals. * */class CacheRefreshThread implements Runnable {    public void run() {        //刷新注册信息        refreshRegistry();    }}public class DefaultEurekaClientConfig implements EurekaClientConfig {  @Override    public int getRegistryFetchIntervalSeconds() {        return configInstance.getIntProperty(                namespace + REGISTRY_REFRESH_INTERVAL_KEY, 30).get();    }}

    3. 查看方法refreshRegistry源码

@VisibleForTestingvoid refreshRegistry() {    try {        boolean isFetchingRemoteRegionRegistries = isFetchingRemoteRegionRegistries();        boolean remoteRegionsModified = false;        // This makes sure that a dynamic change to remote regions to fetch is honored.        String latestRemoteRegions = clientConfig.fetchRegistryForRemoteRegions();        if (null != latestRemoteRegions) {            String currentRemoteRegions = remoteRegionsToFetch.get();            if (!latestRemoteRegions.equals(currentRemoteRegions)) {                // Both remoteRegionsToFetch and AzToRegionMapper.regionsToFetch need to be in sync                synchronized (instanceRegionChecker.getAzToRegionMapper()) {                    if (remoteRegionsToFetch.compareAndSet(currentRemoteRegions, latestRemoteRegions)) {                        String[] remoteRegions = latestRemoteRegions.split(",");                        remoteRegionsRef.set(remoteRegions);                        instanceRegionChecker.getAzToRegionMapper().setRegionsToFetch(remoteRegions);                        remoteRegionsModified = true;                    } else {                        logger.info("Remote regions to fetch modified concurrently," +                                " ignoring change from {} to {}", currentRemoteRegions, latestRemoteRegions);                    }                }            } else {                // Just refresh mapping to reflect any DNS/Property change                instanceRegionChecker.getAzToRegionMapper().refreshMapping();            }        }        //获取注册信息,并更新本地缓存数据        boolean success = fetchRegistry(remoteRegionsModified);        if (success) {            registrySize = localRegionApps.get().size();            lastSuccessfulRegistryFetchTimestamp = System.currentTimeMillis();        }        .......    } catch (Throwable e) {        logger.error("Cannot fetch registry from server", e);    }        }

    4. 查看方法fetchRegistry源码

/** * Fetches the registry information. * *  * This method tries to get only deltas after the first fetch unless there * is an issue in reconciling eureka server and client registry information. *  * * @param forceFullRegistryFetch Forces a full registry fetch. * * @return true if the registry was fetched */private boolean fetchRegistry(boolean forceFullRegistryFetch) {    Stopwatch tracer = FETCH_REGISTRY_TIMER.start();    try {        // If the delta is disabled or if it is the first time, get all        // applications        Applications applications = getApplications();        if (clientConfig.shouldDisableDelta()                || (!Strings.isNullOrEmpty(clientConfig.getRegistryRefreshSingleVipAddress()))                || forceFullRegistryFetch                || (applications == null)                || (applications.getRegisteredApplications().size() == 0)                || (applications.getVersion() == -1)) //Client application does not have latest library supporting delta        {            logger.info("Disable delta property : {}", clientConfig.shouldDisableDelta());            logger.info("Single vip registry refresh property : {}", clientConfig.getRegistryRefreshSingleVipAddress());            logger.info("Force full registry fetch : {}", forceFullRegistryFetch);            logger.info("Application is null : {}", (applications == null));            logger.info("Registered Applications size is zero : {}",                    (applications.getRegisteredApplications().size() == 0));            logger.info("Application version is -1: {}", (applications.getVersion() == -1));            //第一次注册时,获取全部注册信息并保存到本地缓存中            getAndStoreFullRegistry();        } else {            //获取并更新所有服务注册的信息            getAndUpdateDelta(applications);        }        applications.setAppsHashCode(applications.getReconcileHashCode());        logTotalInstances();    } catch (Throwable e) {        logger.error(PREFIX + appPathIdentifier + " - was unable to refresh its cache! status = " + e.getMessage(), e);        return false;    } finally {        if (tracer != null) {            tracer.stop();        }    }    // Notify about cache refresh before updating the instance remote status    onCacheRefreshed();    // Update remote status based on refreshed data held in the cache    updateInstanceRemoteStatus();    // registry was fetched successfully, so return true    return true;}

      5. 查看方法getAndUpdateDelta源码

/** * Get the delta registry information from the eureka server and update it locally. * When applying the delta, the following flow is observed: * * if (update generation have not advanced (due to another thread)) *   atomically try to: update application with the delta and get reconcileHashCode *   abort entire processing otherwise *   do reconciliation if reconcileHashCode clash * fi * * @return the client response * @throws Throwable on error */private void getAndUpdateDelta(Applications applications) throws Throwable {    long currentUpdateGeneration = fetchRegistryGeneration.get();    Applications delta = null;    //调接口(serviceUrl/apps/delta),查询Eureka上delta上的服务列表    EurekaHttpResponse httpResponse = eurekaTransport.queryClient.getDelta(remoteRegionsRef.get());    if (httpResponse.getStatusCode() == Status.OK.getStatusCode()) {        delta = httpResponse.getEntity();    }    if (delta == null) {        logger.warn("The server does not allow the delta revision to be applied because it is not safe. "                + "Hence got the full registry.");        //如果调delta接口没获取服务列表,则调接口(serviceUrl/apps)获取所有注册的服务列表信息        getAndStoreFullRegistry();    } else if (fetchRegistryGeneration.compareAndSet(currentUpdateGeneration, currentUpdateGeneration + 1)) {        logger.debug("Got delta update with apps hashcode {}", delta.getAppsHashCode());        String reconcileHashCode = "";        if (fetchRegistryUpdateLock.tryLock()) {            try {                //根据获取d饿delta,跟新本地缓存信息                updateDelta(delta);                reconcileHashCode = getReconcileHashCode(applications);            } finally {                fetchRegistryUpdateLock.unlock();            }        } else {            logger.warn("Cannot acquire update lock, aborting getAndUpdateDelta");        }        // There is a diff in number of instances for some reason        if (!reconcileHashCode.equals(delta.getAppsHashCode()) || clientConfig.shouldLogDeltaDiff()) {            reconcileAndLogDifference(delta, reconcileHashCode);  // this makes a remoteCall        }    } else {        logger.warn("Not updating application delta as another thread is updating it already");        logger.debug("Ignoring delta update with apps hashcode {}, as another thread is updating it already", delta.getAppsHashCode());    }}

分析到这里已经看到了很多代码,这是客户端更新服务列表的核心流程,下面的方法是核心中的核心。查询服务列表更新本地缓存。

    6. 查看方法updateDelta源码

/** * Updates the delta information fetches from the eureka server into the * local cache. * * @param delta *            the delta information received from eureka server in the last *            poll cycle. */private void updateDelta(Applications delta) {    int deltaCount = 0;    for (Application app : delta.getRegisteredApplications()) {        for (InstanceInfo instance : app.getInstances()) {            Applications applications = getApplications();            String instanceRegion = instanceRegionChecker.getInstanceRegion(instance);            if (!instanceRegionChecker.isLocalRegion(instanceRegion)) {                Applications remoteApps = remoteRegionVsApps.get(instanceRegion);                if (null == remoteApps) {                    remoteApps = new Applications();                    remoteRegionVsApps.put(instanceRegion, remoteApps);                }                applications = remoteApps;            }            ......        }    }}

源码分析到这里,终于找到了根本原因。因为eureka中,有个服务注册了1万+的实例,导致定时遍历所有服务的实例时,耗时过长且占用内存过大,才导致了频繁的yonggc和fullgc。

总结


定位问题时,不能一直盯着刚上线的代码去排查问题,有时候因为某个中间件也会导致业务系统资源占用过高。

  • 仔细观察问题现象

  • top查看进程信息

  • jstack查看栈信息

  • jmap dump堆信息,通过MAT分析堆内大对象

  • 跟踪源码,定位根本原因

  • 出现问题后的措施,避免出现同类问题

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值