1、简介
最近负责的一个平台项目,主要功能是使用模型打分,最后排序返回,最近响应时间经常出问题,有些机器在早上会出现响应时间变长,没有查到原因,现在将整体现象和部分逻辑功能描述一下,希望各路大神帮忙分析。
平台现状:
机器配置:5台机器,机器物理内存128g,jvm配置 32g,新生代7.5g,gc使用cms
Java HotSpot(TM) 64-Bit Server VM (25.102-b14) for linux-amd64 JRE (1.8.0_102-b14), built on Jun 22 2016 18:43:17 by "java_re" with gcc 4.3.0 20080428 (Red Hat 4.3.0-8)
Memory: 4k page, physical 132244088k(35649492k free), swap 33554428k(33529800k free)
jvm部分参数配置:
-XX:+AlwaysPreTouch -XX:AutoBoxCacheMax=20000 -XX:CMSInitiatingOccupancyFraction=75 -XX:+CMSParallelInitialMarkEnabled -XX:ErrorFile=/apps/logs/osp/hs_err_%p.log -XX:+ExplicitGCInvokesConcurrent -XX:+GCLockerInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/apps/logs/osp/ -XX:InitialHeapSize=34359738368 -XX:InitialTenuringThreshold=4 -XX:+ManagementServer -XX:MaxDirectMemorySize=2147483648 -XX:MaxHeapSize=34359738368 -XX:MaxMetaspaceSize=536870912 -XX:MaxTenuringThreshold=4 -XX:MetaspaceSize=536870912 -XX:NewRatio=3 -XX:OldPLABSize=16 -XX:-OmitStackTraceInFastThrow -XX:ParGCCardsPerStrideChunk=1024 -XX:+ParallelRefProcEnabled -XX:+PerfDisableSharedMem -XX:+PrintCommandLineFlags -XX:+PrintGC -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:ReservedCodeCacheSize=251658240 -XX:-TieredCompilation -XX:+UnlockDiagnosticVMOptions -XX:-UseBiasedLocking -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseConcMarkSweepGC -XX:-UseCounterDecay -XX:+UseGCTaskAffinity -XX:+UseParNewGC
业务:
平台请求峰值在50/s,峰值请求时段ygc平均25s一次,加载的模型文件存储在堆内hashmap,存储空间大约占用15g(模型文件本身很大,最大1.1g),每天晚上定时更新 模型文件,最大的模型文件大约在凌晨4点更新,更新这儿因为之前业务会将模型文件县加载到ArrayList,然后再封装成hashmap,相当于模型文件会存储两份,等到加载完会清除掉ArrayList, 只保留hashmap。
问题描述:
五台机器不固定出现响应慢的情况,今天可能是1、2,重启后可能是2、3
每天凌晨4点更新大模型文件时,会触发cms gc,因为模型太大,导致concurrent mode failure,这时候会继续触发cms,经历两三次concurrent mode failure,最后cms成功回收掉内存,但是接着以后有问题的机器的ygc 变了,原来eden区百分之百,新生代89%就回收
ParNew: 6728007K->16851K(7549760K), 0.0127716 secs]
之后变成新生代100% 才回收:
ParNew: 7549759K->22038K(7549760K), 0.0131831 secs
没有问题的机器,ygc回收依然跟之前一样。这中间 不正常的机器是通过 cms 回收掉内存,正常的机器在最后触发了一次 fullgc 回收掉内存。
接着分析,发现服务响应时间变长是在更新完大模型文件后,最开始响应时间有10ms变成40ms,慢慢的变成1s甚至3s,而且不是 随机的,是在每隔一段时间,后来发现每次慢响应结束的时间跟 ygc结束的时间大致一致,响应开始慢都是在ygc前几秒,但是 ygc 清理时间没有问题。
以10:00:15 的ygc 为例
[GC (Allocation Failure) 2018-08-07T10:00:15.745+0800: 58657.032: [ParNew: 7549759K->41540K(7549760K), 0.0140605 secs] 24027593K->16519445K(32715584K), 0.0142925 secs] [Times: user=0.20 sys=0.00, real=0.02 secs]
对应的响应时间:
[2018-08-07 10:00:15.793] [thread-165] [-8288830327696948464 10.200.52.121 10.200.52.121:40412 10.200.43.89:8080 43910 *.rerank()_1.0.0 200 0 28669 28329 0ms 0ms 1ms 27ms 67ms]
[2018-08-07 10:00:15.794] [thread-96] [-8357411984988133029 10.214.11.166 10.214.11.71:57478 10.200.43.89:8080 35074 *.rerank()_1.0.0 200 0 28713 28357 2ms 0ms 0ms 2079ms 2081ms]
[2018-08-07 10:00:15.795] [thread-115] [-9109509378261846079 10.208.40.79 10.208.40.79:48106 10.200.43.89:8080 23738 *.rerank()_1.0.0 200 0 28631 28330 3ms 0ms 0ms 89ms 236ms]
[2018-08-07 10:00:15.796] [thread-145] [-8323442836846079900 10.200.52.117 10.200.52.117:44302 10.200.43.89:8080 43418 *.rerank()_1.0.0 200 0 28675 28330 0ms 0ms 0ms 34ms 120ms]
[2018-08-07 10:00:15.797] [thread-160] [-8357867963540901220 10.200.52.121 10.200.52.121:39612 10.200.43.89:8080 44120 *.rerank()_1.0.0 200 0 40556 40233 1ms 0ms 0ms 36ms 89ms]
[2018-08-07 10:00:15.799] [thread-68] [-8292392222823446982 10.200.52.121 10.200.52.121:39612 10.200.43.89:8080 44113 *.rerank()_1.0.0 200 0 56546 56231 0ms 0ms 0ms 2437ms 2661ms]
[2018-08-07 10:00:15.800] [thread-184] [-8305021004930562527 10.200.52.119 10.200.52.119:19262 10.200.43.89:8080 9622 *.rerank()_1.0.0 200 0 28646 28330 5ms 0ms 1ms 187ms 238ms]
[2018-08-07 10:00:15.801] [thread-174] [-8338195371701998057 10.200.52.125 10.200.52.125:10554 10.200.43.89:8080 43726 *.rerank()_1.0.0 200 0 2517 2195 0ms 0ms 0ms 3ms 4ms]
[2018-08-07 10:00:15.801] [thread-158] [-8295171650903359196 10.208.40.84 10.208.40.84:57058 10.200.43.89:8080 4841 *.rerank()_1.0.0 200 0 56512 56231 0ms 0ms 0ms 2537ms 2768ms]
[2018-08-07 10:00:15.801] [thread-46] [-9129656671070767850 10.200.52.126 10.200.52.126:42266 10.200.43.89:8080 44160 *.rerank()_1.0.0 200 0 54028 53719 2ms 0ms 0ms 250ms 584ms]
[2018-08-07 10:00:15.802] [thread-16] [-8346462911910830241 10.208.40.76 10.208.40.76:63356 10.200.43.89:8080 23666 *.rerank()_1.0.0 200 0 56561 56229 0ms 0ms 3ms 39ms 77ms]
[2018-08-07 10:00:15.807] [thread-140] [-8296785686930875249 10.200.52.117 10.200.52.117:45540 10.200.43.89:8080 44073 *.rerank()_1.0.0 200 0 56557 56230 6ms 0ms 0ms 46ms 134ms]
[2018-08-07 10:00:15.809] [thread-62] [-8330646766934052660 10.208.40.82 10.208.40.82:28314 10.200.43.89:8080 23539 *.rerank()_1.0.0 200 0 42622 42279 2ms 0ms 2ms 31ms 98ms]
[2018-08-07 10:00:15.816] [thread-9] [-8295019583619171506 10.200.52.117 10.200.52.117:45540 10.200.43.89:8080 44072 *.rerank()_1.0.0 200 0 42626 42280 2ms 0ms 1ms 55ms 218ms]
[2018-08-07 10:00:15.827] [thread-178] [-8305067902129936638 10.208.40.83 10.208.40.83:15934 10.200.43.89:8080 23693 *.rerank()_1.0.0 200 0 28653 28329 0ms 0ms 0ms 20ms 20ms]
[2018-08-07 10:00:15.872] [thread-122] [-8341096053262505245 10.214.1.6 10.214.1.37:58726 10.200.43.89:8080 35194 *.rerank()_1.0.0 200 0 7657 7336 0ms 0ms 0ms 7ms 7ms]
[2018-08-07 10:00:15.889] [thread-34] [-9100444074134353504 10.214.27.245 10.214.27.122:37014 10.200.43.89:8080 34898 *.rerank()_1.0.0 200 0 910 732 0ms 0ms 0ms 1ms 1ms]
[2018-08-07 10:00:15.921] [thread-50] [-8294018645016351829 10.208.40.81 10.208.40.81:22062 10.200.43.89:8080 23516 *.rerank()_1.0.0 200 0 28652 28329 1ms 0ms 0ms 14ms 15ms]
[2018-08-07 10:00:15.957] [thread-119] [-9132118080387802790 10.200.52.122 10.200.52.122:61174 10.200.43.89:8080 43952 *.rerank()_1.0.0 200 0 3008 2660 0ms 0ms 0ms 3ms 4ms]
[2018-08-07 10:00:15.967] [thread-167] [-8290516706153420296 10.200.52.122 10.200.52.122:61174 10.200.43.89:8080 43951 *.rerank()_1.0.0 200 0 42595 42279 0ms 0ms 0ms 29ms 30ms]
[2018-08-07 10:00:15.976] [thread-21] [-8357788417187591931 10.208.40.84 10.208.40.84:57058 10.200.43.89:8080 4843 *.rerank()_1.0.0 200 0 28678 28328 0ms 0ms 0ms 14ms 14ms]
[2018-08-07 10:00:16.047] [thread-1] [-8301189045126531877 10.214.27.245 10.214.27.122:43674 10.200.43.89:8080 34860 *.rerank()_1.0.0 200 0 28693 28355 0ms 0ms 1ms 11ms 13ms]
[2018-08-07 10:00:16.061] [thread-124] [-8289545017613729362 10.200.52.119 10.200.52.119:19262 10.200.43.89:8080 9623 *.rerank()_1.0.0 200 0 42596 42278 0ms 0ms 0ms 15ms 17ms]
[2018-08-07 10:00:16.083] [thread-78] [-8331749461723576302 10.200.52.122 10.200.52.122:61174 10.200.43.89:8080 43953 *.rerank()_1.0.0 200 0 28652 28327 0ms 0ms 0ms 9ms 9ms]
[2018-08-07 10:00:16.089] [thread-97] [-8355833413237133244 10.200.52.119 10.200.52.119:65224 10.200.43.89:8080 44185 *.rerank()_1.0.0 200 0 4775 4427 0ms 0ms 0ms 7ms 8ms]
[2018-08-07 10:00:16.132] [thread-84] [-8298776592520918742 10.214.11.167 10.214.11.22:45330 10.200.43.89:8080 35016 *.rerank()_1.0.0 200 0 28671 28355 0ms 0ms 0ms 22ms 23ms]
[2018-08-07 10:00:16.140] [thread-112] [-8355451418324574725 10.214.11.166 10.214.11.71:52898 10.200.43.89:8080 35695 *.rerank()_1.0.0 200 0 13803 13471 0ms 0ms 0ms 14ms 14ms]
[2018-08-07 10:00:16.176] [thread-181] [-8317321704782385856 10.208.40.83 10.208.40.83:14200 10.200.43.89:8080 23768 *.rerank()_1.0.0 200 0 37209 36884 0ms 0ms 0ms 18ms 18ms]
[2018-08-07 10:00:16.183] [thread-134] [-8329585309970522186 10.214.25.50 10.214.25.176:46984 10.200.43.89:8080 582 *.rerank()_1.0.0 200 0 2539 2232 0ms 0ms 0ms 5ms 5ms]
对应 ygc 时间 0.0142925 secs
同时,使用 jstat jcutil 一秒监控一次的数据:
非正常机器(Eden区满了之后需要四秒才做回收):
S S E O M CCS YGC YGCT FGC FGCT GCT
3.51 3.51 55.88 65.56 16.79 0 1520 81.162 14 331.849 413.01099999999997
3.51 3.51 59.35 65.56 16.79 0 1520 81.162 14 331.849 413.01099999999997
3.51 3.51 64.84 65.56 16.79 0 1520 81.162 14 331.849 413.01099999999997
3.51 3.51 68.40 65.56 16.79 0 1520 81.162 14 331.849 413.01099999999997
3.51 3.51 72.20 65.56 16.79 0 1520 81.162 14 331.849 413.01099999999997
3.51 3.51 74.82 65.56 16.79 0 1520 81.162 14 331.849 413.01099999999997
3.51 3.51 78.40 65.56 16.79 0 1520 81.162 14 331.849 413.01099999999997
3.51 3.51 83.64 65.56 16.79 0 1520 81.162 14 331.849 413.01099999999997
3.51 3.51 87.87 65.56 16.79 0 1520 81.162 14 331.849 413.01099999999997
3.51 3.51 91.42 65.56 16.79 0 1520 81.162 14 331.849 413.01099999999997
3.51 3.51 96.51 65.56 16.79 0 1520 81.162 14 331.849 413.01099999999997
3.51 3.51 99.66 65.56 16.79 0 1520 81.162 14 331.849 413.01099999999997
21.72 21.85 100.00 65.56 16.79 0 1520 81.162 14 331.849 413.01099999999997
38.89 39.05 100.00 65.56 16.79 0 1520 81.162 14 331.849 413.01099999999997
57.46 57.68 100.00 65.56 16.79 0 1520 81.162 14 331.849 413.01099999999997
78.06 78.38 100.00 65.56 16.79 0 1520 81.162 14 331.849 413.01099999999997
4.06 4.06 3.11 65.56 16.79 0 1521 81.176 14 331.849 413.025
4.06 4.06 7.98 65.56 16.79 0 1521 81.176 14 331.849 413.025
4.06 4.06 11.77 65.56 16.79 0 1521 81.176 14 331.849 413.025
4.06 4.06 15.31 65.56 16.79 0 1521 81.176 14 331.849 413.025
4.06 4.06 18.46 65.56 16.79 0 1521 81.176 14 331.849 413.025
4.06 4.06 22.79 65.56 16.79 0 1521 81.176 14 331.849 413.025
4.06 4.06 26.79 65.56 16.79 0 1521 81.176 14 331.849 413.025
4.06 4.06 31.30 65.56 16.79 0 1521 81.176 14 331.849 413.025
4.06 4.06 34.87 65.56 16.79 0 1521 81.176 14 331.849 413.025
4.06 4.06 38.15 65.56 16.79 0 1521 81.176 14 331.849 413.025
4.06 4.06 41.82 65.56 16.79 0 1521 81.176 14 331.849 413.025
4.06 4.06 45.41 65.56 16.79 0 1521 81.176 14 331.849 413.025
正常机器(Eden区满了立马回收):
S S E O M CCS YGC YGCT FGC FGCT GCT
0.34 0.34 70.50 65.49 16.79 0 1552 70.31 12 384.598 454.908
0.34 0.34 72.48 65.49 16.79 0 1552 70.31 12 384.598 454.908
0.34 0.34 75.19 65.49 16.79 0 1552 70.31 12 384.598 454.908
0.34 0.34 79.18 65.49 16.79 0 1552 70.31 12 384.598 454.908
0.34 0.34 83.25 65.49 16.79 0 1552 70.31 12 384.598 454.908
0.34 0.34 86.90 65.49 16.79 0 1552 70.31 12 384.598 454.908
0.34 0.34 90.91 65.49 16.79 0 1552 70.31 12 384.598 454.908
0.34 0.34 94.94 65.49 16.79 0 1552 70.31 12 384.598 454.908
0.34 0.34 98.14 65.49 16.79 0 1552 70.31 12 384.598 454.908
0.82 0.82 1.42 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 4.85 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 8.16 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 12.08 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 16.31 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 21.45 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 26.60 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 29.62 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 33.99 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 37.71 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 41.35 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 46.32 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 49.87 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 53.91 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 58.62 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 61.42 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 66.22 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 70.84 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 73.24 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 77.48 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 81.86 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 86.28 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 90.45 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 94.98 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.82 0.82 98.17 65.49 16.79 0 1553 70.322 12 384.598 454.92
0.61 0.61 1.25 65.49 16.79 0 1554 70.334 12 384.598 454.932
0.61 0.61 4.43 65.49 16.79 0 1554 70.334 12 384.598 454.932
0.61 0.61 8.14 65.49 16.79 0 1554 70.334 12 384.598 454.932
0.61 0.61 11.87 65.49 16.79 0 1554 70.334 12 384.598 454.932
0.61 0.61 16.30 65.49 16.79 0 1554 70.334 12 384.598 454.932
机器 load:
具体问题就是 上面描述的,请各路大神帮忙分析,附上 gc日志:
链接: https://pan.baidu.com/s/1Nd1EE7F91vZl7VisOYxv0Q 密码: 8v95