线上服务OOM问题排查

 

背景简介

风控服务有两个node,通过 SLB(负载均衡)对外提供服务。

其中 node2 比 node1 多了一个 scheduled task:定时统计数据,由于很多数据库只有在运行时才知道,所以有个动态连接数据库模块

 

问题描述

node2 服务异常,日志报错:

java.lang.OutOfMemoryError: Java heap space

(对外风控服务正常,因为node1服务正常,负载均衡服务会自动剔除异常节点)

 

排查过程

由于 JVM的启动参数没有显示指定,所以查看 JVM的默认启动参数,如下可见 heap的最大值为 4206886912(4G)

# jcmd 1 VM.flags 
1: 
-XX:CICompilerCount=3 
-XX:InitialHeapSize=264241152 
-XX:MaxHeapSize=4206886912 
-XX:MaxNewSize=1401946112 
-XX:MinHeapDeltaBytes=524288 
-XX:NewSize=88080384 
-XX:OldSize=176160768 
-XX:+UseCompressedClassPointers 
-XX:+UseCompressedOops 
-XX:+UseParallelGC

猜测是因为 MaxHeapSize 设置太大了,没有及时触发 gc,导致服务器内存耗尽。

所以,加上了启动参数,指定堆最大为2048m:

-Xms512m -Xmx2048m -Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps

继续观察,发现 node2的风控进程内存消耗依然在不断增大,几天后占用到 1.961GiB(node1的风控进程内存长期在 700~800M)

CONTAINER ID   NAME                              CPU %   MEM USAGE / LIMIT   MEM % 
  NET I/O   BLOCK I/O   PIDS 
7396a4269089   riskcontrolmsprod2_risk_control_1 2.88%   1.961GiB / 15.66GiB  12.52%   164MB / 68.9MB   0B / 0B   58

此时,查看 host的物理内存,还剩余 3.1G。

               total        used        free      shared  buff/cache   available
Mem:            15G        3.9G        3.1G        7.2M        8.7G         11G
Swap:          947M          0B        947M

由于 node1没有异常,所以推测是 node2的 scheduled task导致的,持续产生无法被 gc 清理的对象。猜测问题区域为 动态连接数据库模块。

 

第一步,验证问题所在区域。

在本地,启动两个程序,第一个和线上一样的jar(将调用动态连接的频率增大,便于较快复现问题),第二个去掉调用 动态连接数据库的代码。作对比。

同时将 -Xmx设为较小的128m,比便于观察gc情况

 

下图,使用了动态连接的jar 的运行情况分析:

1.运行几分钟后,java.lang.OutOfMemoryError: GC overhead limit exceeded

2.频繁触发 gc,但是堆的占用却没有释放,说明在持续地生成无法回收的对象

下图,使用了不使用动态连接的jar 的运行情况分析:

1.运行良好

2.每次触发 gc后堆都会大幅释放,且总体不会持续变大

 

结论:动态连接模块发生了内存泄漏。

 

第二步,查明是哪些对象持续占用内存,导致内存泄漏。

对比两个程序的各个对象的内存占用情况

 

下图,使用了动态连接的jar 的对象内存占用:

下图,不使用动态连接的jar 的对象内存占用:

对比,发现异常对象为 com.mysql.cj.util.LazyString

然后去代码中找与com.mysql.cj.util.LazyString相关的代码,找不到……,依然不知道是哪行代码的问题,也就依然不知道如何解决

然后,找 com.mysql.cj.util.LazyString被代码中的哪个对象引用了。

执行 堆 Dump,继续查看

发现 com.mysql.cj.util.LazyString 被 stmt 引用了。

然后到代码中找到

Statement stmt = conn.createStatement();

每次查询时,这行代码都会被执行,也就是数据库连接conn都会创建一个 Statement对象,随着次数的增加,Statement对象会越来越多。

为什么 Statement对象不会在 gc 时释放呢?

因为 Statement对象被数据库连接 conn 引用,而 conn被静态变量 map引用(map缓存每个数据库连接),所以,gc在可达性分析时发现 Statement对象是可达的,所以不会清理。

可达性分析路径:

    静态变量 map(GC Root)-> 数据库连接 conn -> 大量的Statement对象

 

第三步,解决。

每个数据库连接只创建一个 Statement

 

修复后的运行情况

1.运行良好

2.每次触发 gc后堆都会大幅释放,且总体不会持续变大

 

 

使用的工具:jvisualvm

 

查看 gc日志

# cat gc-prod.log
OpenJDK 64-Bit Server VM (25.265-b01) for linux-amd64 JRE (1.8.0_265-b01), built on Jul 27 2020 07:33:48 by "openjdk" with gcc 4.4.7 20120313 (Red Hat 4.4.7-23)
Memory: 4k page, physical 16425400k(4930808k free), swap 969964k(969964k free)
CommandLine flags: -XX:InitialHeapSize=536870912 -XX:MaxHeapSize=2116026368 -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseParallelGC
2020-10-27T16:27:55.102+0800: 1.044: [GC (Allocation Failure) [PSYoungGen: 131584K->4090K(153088K)] 131584K->4098K(502784K), 0.0065510 secs] [Times: user=0.02 sys=0.00, real=0.00 secs]
2020-10-27T16:27:55.455+0800: 1.397: [GC (Allocation Failure) [PSYoungGen: 135674K->5309K(153088K)] 135682K->5389K(502784K), 0.0105793 secs] [Times: user=0.03 sys=0.01, real=0.01 secs]
2020-10-27T16:27:55.528+0800: 1.470: [GC (Metadata GC Threshold) [PSYoungGen: 31279K->5398K(153088K)] 31359K->5486K(502784K), 0.0070279 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
2020-10-27T16:27:55.535+0800: 1.477: [Full GC (Metadata GC Threshold) [PSYoungGen: 5398K->0K(153088K)] [ParOldGen: 88K->5410K(349696K)] 5486K->5410K(502784K), [Metaspace: 20597K->20597K(1067008K)], 0.0274344 secs] [Times: user=0.06 sys=0.00, real=0.03 secs]
2020-10-27T16:27:56.205+0800: 2.147: [GC (Allocation Failure) [PSYoungGen: 131584K->3252K(222720K)] 136994K->8671K(572416K), 0.0065033 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
2020-10-27T16:27:57.583+0800: 3.525: [GC (Allocation Failure) [PSYoungGen: 222388K->7668K(266752K)] 227807K->13095K(616448K), 0.0113554 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
2020-10-27T16:27:57.827+0800: 3.769: [GC (Metadata GC Threshold) [PSYoungGen: 51598K->7732K(360960K)] 57025K->13167K(710656K), 0.0101993 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
2020-10-27T16:27:57.837+0800: 3.779: [Full GC (Metadata GC Threshold) [PSYoungGen: 7732K->0K(360960K)] [ParOldGen: 5434K->12615K(349696K)] 13167K->12615K(710656K), [Metaspace: 34130K->34130K(1079296K)], 0.0720798 secs] [Times: user=0.22 sys=0.01, real=0.07 secs]
2020-10-27T16:28:00.093+0800: 6.035: [GC (Allocation Failure) [PSYoungGen: 352768K->9215K(361984K)] 365383K->23142K(711680K), 0.0177717 secs] [Times: user=0.06 sys=0.01, real=0.02 secs]
2020-10-27T16:28:01.315+0800: 7.257: [GC (Allocation Failure) [PSYoungGen: 361983K->11233K(541696K)] 375910K->30842K(891392K), 0.0272988 secs] [Times: user=0.10 sys=0.01, real=0.03 secs]
2020-10-27T16:28:01.491+0800: 7.433: [GC (Metadata GC Threshold) [PSYoungGen: 69202K->12289K(545792K)] 88812K->31906K(895488K), 0.0193421 secs] [Times: user=0.05 sys=0.01, real=0.01 secs]
2020-10-27T16:28:01.510+0800: 7.452: [Full GC (Metadata GC Threshold) [PSYoungGen: 12289K->0K(545792K)] [ParOldGen: 19617K->30235K(349696K)] 31906K->30235K(895488K), [Metaspace: 56322K->56322K(1101824K)], 0.2012301 secs] [Times: user=0.67 sys=0.02, real=0.21 secs]
2020-10-27T16:30:07.578+0800: 133.520: [GC (Allocation Failure) [PSYoungGen: 530432K->11277K(671232K)] 560667K->41520K(1020928K), 0.0149197 secs] [Times: user=0.04 sys=0.00, real=0.02 secs]
2020-10-27T17:13:10.257+0800: 2716.198: [GC (Heap Inspection Initiated GC) [PSYoungGen: 460748K->14291K(669184K)] 490991K->44542K(1018880K), 0.0171887 secs] [Times: user=0.04 sys=0.01, real=0.02 secs]
2020-10-27T17:13:10.274+0800: 2716.216: [Full GC (Heap Inspection Initiated GC) [PSYoungGen: 14291K->0K(669184K)] [ParOldGen: 30251K->34941K(433152K)] 44542K->34941K(1102336K), [Metaspace: 72016K->71616K(1116160K)], 0.2015170 secs] [Times: user=0.47 sys=0.02, real=0.20 secs]
2020-10-27T17:13:20.135+0800: 2726.077: [GC (Heap Inspection Initiated GC) [PSYoungGen: 29913K->160K(669184K)] 64855K->35101K(1102336K), 0.0046381 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2020-10-27T17:13:20.140+0800: 2726.082: [Full GC (Heap Inspection Initiated GC) [PSYoungGen: 160K->0K(669184K)] [ParOldGen: 34941K->32022K(513536K)] 35101K->32022K(1182720K), [Metaspace: 71618K->71618K(1116160K)], 0.1277683 secs] [Times: user=0.29 sys=0.00, real=0.12 secs]
2020-10-27T17:13:22.548+0800: 2728.490: [GC (Heap Inspection Initiated GC) [PSYoungGen: 13762K->64K(669696K)] 45785K->32094K(1183232K), 0.0044085 secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
2020-10-27T17:13:22.552+0800: 2728.494: [Full GC (Heap Inspection Initiated GC) [PSYoungGen: 64K->0K(669696K)] [ParOldGen: 32030K->31981K(649728K)] 32094K->31981K(1319424K), [Metaspace: 71618K->71618K(1116160K)], 0.2138676 secs] [Times: user=0.59 sys=0.01, real=0.21 secs]
2020-10-27T17:13:26.588+0800: 2732.530: [GC (Heap Inspection Initiated GC) [PSYoungGen: 15054K->96K(669696K)] 47035K->32085K(1319424K), 0.0035065 secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
2020-10-27T17:13:26.592+0800: 2732.534: [Full GC (Heap Inspection Initiated GC) [PSYoungGen: 96K->0K(669696K)] [ParOldGen: 31989K->31982K(787968K)] 32085K->31982K(1457664K), [Metaspace: 71618K->71618K(1116160K)], 0.0974356 secs] [Times: user=0.20 sys=0.00, real=0.09 secs]
2020-10-27T17:13:53.825+0800: 2759.767: [GC (Heap Inspection Initiated GC) [PSYoungGen: 28338K->128K(669696K)] 60320K->32110K(1457664K), 0.0033413 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
2020-10-27T17:13:53.828+0800: 2759.770: [Full GC (Heap Inspection Initiated GC) [PSYoungGen: 128K->0K(669696K)] [ParOldGen: 31982K->31982K(958464K)] 32110K->31982K(1628160K), [Metaspace: 71619K->71619K(1116160K)], 0.0902424 secs] [Times: user=0.18 sys=0.00, real=0.10 secs]
2020-10-27T17:13:57.332+0800: 2763.274: [GC (Heap Inspection Initiated GC) [PSYoungGen: 12758K->64K(669696K)] 44741K->32046K(1628160K), 0.0039405 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
2020-10-27T17:13:57.336+0800: 2763.278: [Full GC (Heap Inspection Initiated GC) [PSYoungGen: 64K->0K(669696K)] [ParOldGen: 31982K->31982K(1142784K)] 32046K->31982K(1812480K), [Metaspace: 71620K->71620K(1116160K)], 0.0984356 secs] [Times: user=0.21 sys=0.01, real=0.10 secs]
2020-10-27T17:14:06.016+0800: 2771.958: [GC (Heap Inspection Initiated GC) [PSYoungGen: 25254K->64K(651264K)] 57237K->32054K(1794048K), 0.0036910 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
2020-10-27T17:14:06.019+0800: 2771.961: [Full GC (Heap Inspection Initiated GC) [PSYoungGen: 64K->0K(651264K)] [ParOldGen: 31990K->31982K(1368064K)] 32054K->31982K(2019328K), [Metaspace: 71624K->71624K(1116160K)], 0.1025445 secs] [Times: user=0.21 sys=0.00, real=0.10 secs]
2020-10-27T18:20:04.207+0800: 6730.149: [GC (Allocation Failure) [PSYoungGen: 650752K->6924K(669696K)] 682734K->38915K(2037760K), 0.0098047 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
2020-10-27T19:25:03.402+0800: 10629.344: [GC (Allocation Failure) [PSYoungGen: 658188K->9645K(650752K)] 690179K->41643K(2018816K), 0.0123282 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
2020-10-27T20:33:23.836+0800: 14729.778: [GC (Allocation Failure) [PSYoungGen: 650669K->10920K(642048K)] 682667K->42927K(2010112K), 0.0133321 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
2020-10-27T21:45:02.518+0800: 19028.460: [GC (Allocation Failure) [PSYoungGen: 641704K->9307K(630784K)] 673711K->41322K(1998848K), 0.0120141 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
2020-10-27T22:52:32.965+0800: 23078.907: [GC (Allocation Failure) [PSYoungGen: 630363K->9820K(621568K)] 662378K->41842K(1989632K), 0.0130400 secs] [Times: user=0.02 sys=0.00, real=0.02 secs]
2020-10-28T00:00:02.493+0800: 27128.435: [GC (Allocation Failure) [PSYoungGen: 621148K->10973K(613376K)] 653170K->42995K(1981440K), 0.0145196 secs] [Times: user=0.04 sys=0.00, real=0.01 secs]
2020-10-28T01:05:08.470+0800: 31034.412: [GC (Allocation Failure) [PSYoungGen: 613085K->9341K(602624K)] 645107K->41372K(1970688K), 0.0119603 secs] [Times: user=0.02 sys=0.00, real=0.02 secs]
2020-10-28T02:15:03.262+0800: 35229.203: [GC (Allocation Failure) [PSYoungGen: 602237K->9597K(593920K)] 634268K->41636K(1961984K), 0.0122601 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
2020-10-28T03:20:03.667+0800: 39129.609: [GC (Allocation Failure) [PSYoungGen: 593789K->10242K(586240K)] 625828K->42288K(1954304K), 0.0131573 secs] [Times: user=0.03 sys=0.00, real=0.02 secs]
2020-10-28T04:25:03.441+0800: 43029.383: [GC (Allocation Failure) [PSYoungGen: 585730K->10045K(577536K)] 617776K->42100K(1945600K), 0.0128218 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
2020-10-28T05:30:03.655+0800: 46929.596: [GC (Allocation Failure) [PSYoungGen: 577341K->9693K(568832K)] 609396K->41755K(1936896K), 0.0124312 secs] [Times: user=0.02 sys=0.01, real=0.01 secs]
2020-10-28T06:35:02.915+0800: 50828.857: [GC (Allocation Failure) [PSYoungGen: 568797K->9715K(560640K)] 600859K->41778K(1928704K), 0.0125140 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]
2020-10-28T07:36:15.246+0800: 54501.188: [GC (Allocation Failure) [PSYoungGen: 560627K->9389K(552960K)] 592690K->41459K(1921024K), 0.0123026 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]

 

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值