项目场景:
记录实际JVM调优过程及遇到的问题
问题描述
提示:这里描述项目中遇到的问题:
项目在线上运行发生崩溃,重启后发现Java进程的资源占用很高,其中RES和MEM在不断地增长,到一定程度Linux杀掉了进程。
查看被Linux杀掉了的进程,使用 egrep -i -r ‘killed process’ /var/log 或者 grep “Out of memory” /var/log/messages 都行
[root@ecs-c6-xlarge-2-linux-20210731223134 ~]# egrep -i -r 'killed process' /var/log /var/log/messages:Mar 21 10:43:25 ecs-c6-xlarge-2-linux-20210731223134 kernel: Killed process 11439 (java), UID 0, total-vm:18567668kB, anon-rss:5257924kB, file-rss:0kB, shmem-rss:0kB
/var/log/messages:Mar 21 21:56:19 ecs-c6-xlarge-2-linux-20210731223134 kernel: Killed process 18330 (java), UID 0, total-vm:16332012kB, anon-rss:5315660kB, file-rss:0kB, shmem-rss:0kB
/var/log/messages:Mar 22 09:17:34 ecs-c6-xlarge-2-linux-20210731223134 kernel: Killed process 3539 (java), UID 0, total-vm:17391912kB, anon-rss:5355368kB, file-rss:0kB, shmem-rss:0kB
/var/log/messages:Mar 22 20:50:12 ecs-c6-xlarge-2-linux-20210731223134 kernel: Killed process 28530 (java), UID 0, total-vm:15443512kB, anon-rss:5347832kB, file-rss:0kB, shmem-rss:0kB
/var/log/messages:Mar 22 22:30:10 ecs-c6-xlarge-2-linux-20210731223134 kernel: Killed process 27439 (java), UID 0, total-vm:13650416kB, anon-rss:5012252kB, file-rss:0kB, shmem-rss:0kB
/var/log/messages:Mar 23 03:07:26 ecs-c6-xlarge-2-linux-20210731223134 kernel: Killed process 5462 (java), UID 0, total-vm:16265528kB, anon-rss:4895452kB, file-rss:0kB, shmem-rss:0kB
/var/log/messages:Mar 23 09:32:20 ecs-c6-xlarge-2-linux-20210731223134 kernel: Killed process 15696 (java), UID 0, total-vm:16963300kB, anon-rss:5275868kB, file-rss:0kB, shmem-rss:0kB
/var/log/messages:Mar 24 05:00:24 ecs-c6-xlarge-2-linux-20210731223134 kernel: Killed process 3635 (java), UID 0, total-vm:18286524kB, anon-rss:4959324kB, file-rss:0kB, shmem-rss:0kB
/var/log/messages:Mar 24 09:16:54 ecs-c6-xlarge-2-linux-20210731223134 kernel: Killed process 15280 (java), UID 0, total-vm:17875512kB, anon-rss:5161304kB, file-rss:0kB, shmem-rss:0kB
通过top -i命令看到内存占用情况
原因分析:
根据常用的分析手段对Java进程的资源进行分析
1.使用jstat -gc PID 5000,对Java进程的垃圾回收情况每隔5秒进行输出一次
jstat -gc 17729 5000
输出结果如下:
[root@ecs-c6-xlarge-2-linux-20210731223134 ~]# jstat -gc 17729 5000
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT
12800.0 18432.0 12800.0 0.0 1011712.0 172034.5 3145728.0 534211.5 99968.0 96495.5 11392.0 10831.6 258 2.382 0 0.000 2.382
12800.0 18432.0 12800.0 0.0 1011712.0 244474.2 3145728.0 534211.5 99968.0 96495.5 11392.0 10831.6 258 2.382 0 0.000 2.382
12800.0 18432.0 12800.0 0.0 1011712.0 325510.9 3145728.0 534211.5 99968.0 96495.5 11392.0 10831.6 258 2.382 0 0.000 2.382
以上输出内容可以看到,YGC、FGC的次数和回收时间。
2.观察具体的内存回收情况,使用jmap -heap PID,输出内容如下:
在这里,直观地看到新生代(Eden,From,To)和老年代(OG)区的内存大小及使用情况。
3.再通过dump当时进程的日志记录,使用MAT工具对其进行内存分析jmap -dump:format=b,file=dumpFileName pid
jmap -dump:format=b,file=/root/dump.hprof 17729
打开MAT工具,对dump下来的文件进行内存分析,找到其具体占用内存情况。
一套流程整下来,基本上能确定可能出现OOM的点。
解决方案:
1.通过MAT去分析具体内存占用情况,针对可能出现内存泄漏的地方手动进行null值,让其可以被下一次垃圾回收;
2.观察可以看到,新生代占用3G,老年代占用了1G。在项目中,有几个地方会一次性获取到很大的对象,而垃圾收集器会把大的对象从Eden区直接放进老年代中,在年轻代Eden区的大小有2G多,通产情况下,对象的创建和回收在新生代的操作是比较快速的,YGC的回收效率也会比较高。当FGC频繁时,STW时间边长,CPU占用过高。
此时的JVM启动参数配置为:
JAVA_OPTS="-Xms4096M -Xmx4096M -Xmn3072M -Xss1M -XX:MetaspaceSize=256M -XX:MaxMetaspaceSize=256M -XX:MaxDirectMemorySize=512M -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/javaproject/oom"
设置-Xmn的值为1024M,整个堆的大小为4096M,当设置了-Xmn的值为1024M时,即年轻代为1024M,老年代就为3072M左右,调优后,观察堆栈信息以及内存占用情况如下。
[root@ecs-c6-xlarge-2-linux-20210731223134 ~]# jmap -heap 17729
Attaching to process ID 17729, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.312-b07
using thread-local object allocation.
Parallel GC with 4 thread(s)
Heap Configuration:
MinHeapFreeRatio = 0
MaxHeapFreeRatio = 100
MaxHeapSize = 4294967296 (4096.0MB)
NewSize = 1073741824 (1024.0MB)
MaxNewSize = 1073741824 (1024.0MB)
OldSize = 3221225472 (3072.0MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 268435456 (256.0MB)
CompressedClassSpaceSize = 260046848 (248.0MB)
MaxMetaspaceSize = 268435456 (256.0MB)
G1HeapRegionSize = 0 (0.0MB)
Heap Usage:
PS Young Generation
Eden Space:
capacity = 1038614528 (990.5MB)
used = 49375128 (47.087791442871094MB)
free = 989239400 (943.4122085571289MB)
4.753941589386279% used
From Space:
capacity = 17825792 (17.0MB)
used = 3768320 (3.59375MB)
free = 14057472 (13.40625MB)
21.139705882352942% used
To Space:
capacity = 17301504 (16.5MB)
used = 0 (0.0MB)
free = 17301504 (16.5MB)
0.0% used
PS Old Generation
capacity = 3221225472 (3072.0MB)
used = 548238792 (522.8412551879883MB)
free = 2672986680 (2549.1587448120117MB)
17.019572108983994% used
50277 interned Strings occupying 5419328 bytes.
后续,调优后,一开始表现较好,后来发现VIRTUAL升高,一查看发现老年代中的占用资源较高,如下:
针对这个问题,调大堆内存为7G,增大年轻代为2G,这样子老年代为5G,引入CMS收集器,修改对象在新生代经历的回收次数为15次(Parallel Scavenge中默认值为15,CMS中默认值为6,G1中默认值为15)。
JAVA_OPTS="-Xms7168M -Xmx7168M -Xmn2048M -Xss1M -XX:MetaspaceSize=512M -XX:MaxMetaspaceSize=512M -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC -XX:MaxTenuringThreshold=15 -XX:HeapDumpPath=/usr/local/javaproject/oom"
此时,项目进程中的CPU占用,堆使用情况,垃圾回收情况如下:
堆栈使用情况
[root@ecs-c6-xlarge-2-linux-20210731223134 bin]# jmap -heap 25404
Attaching to process ID 25404, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.312-b07
using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC
Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize = 7516192768 (7168.0MB)
NewSize = 2147483648 (2048.0MB)
MaxNewSize = 2147483648 (2048.0MB)
OldSize = 5368709120 (5120.0MB)
NewRatio = 2
SurvivorRatio = 8
MetaspaceSize = 536870912 (512.0MB)
CompressedClassSpaceSize = 528482304 (504.0MB)
MaxMetaspaceSize = 536870912 (512.0MB)
G1HeapRegionSize = 0 (0.0MB)
Heap Usage:
New Generation (Eden + 1 Survivor Space):
capacity = 1932787712 (1843.25MB)
used = 1614538248 (1539.7436599731445MB)
free = 318249464 (303.50634002685547MB)
83.53417387620478% used
Eden Space:
capacity = 1718091776 (1638.5MB)
used = 1601768232 (1527.5652236938477MB)
free = 116323544 (110.93477630615234MB)
93.22949183361902% used
From Space:
capacity = 214695936 (204.75MB)
used = 12770016 (12.178436279296875MB)
free = 201925920 (192.57156372070312MB)
5.94795422676282% used
To Space:
capacity = 214695936 (204.75MB)
used = 0 (0.0MB)
free = 214695936 (204.75MB)
0.0% used
concurrent mark-sweep generation:
capacity = 5368709120 (5120.0MB)
used = 92828240 (88.52790832519531MB)
free = 5275880880 (5031.472091674805MB)
1.729060709476471% used
53021 interned Strings occupying 5857896 bytes.
top -i查看进程情况
GC回收情况
[root@ecs-c6-xlarge-2-linux-20210731223134 bin]# jstat -gc 25404 5000
S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT
209664.0 209664.0 17859.9 0.0 1677824.0 969037.6 5242880.0 90652.9 98816.0 95939.1 11264.0 10765.5 230 4.016 0 0.000 4.016
209664.0 209664.0 17859.9 0.0 1677824.0 1048582.4 5242880.0 90652.9 98816.0 95939.1 11264.0 10765.5 230 4.016 0 0.000 4.016
209664.0 209664.0 17859.9 0.0 1677824.0 1126707.5 5242880.0 90652.9 98816.0 95939.1 11264.0 10765.5 230 4.016 0 0.000 4.016
目前,系统稳定运行,继续观察。