前言
每周一次的例行发版,照常来临。我熟悉的打开发版系统,同时也打开了服务监控系统,好让别人看的出来我是一个对业务负责的小伙子。吨吨吨,吨吨吨,一顿操作...妈耶,这是咋回事,下游服务dubbo请求超时这么多,看了看我的劳力土,再看了看监控,一分钟400+的请求超时,请求超时数是原来的10倍,上升到了百的量级...
故事发生在这周二的深夜。当时准备上一个需求,会导致服务中单个线程的IO变长,这个是预知的。在灰度中,发现了下游的服务dubbo请求超时数增加,脑海中立马想到了去查了底层存储的监控,发现底层存储服务的999线都正常,凭借着多年的经验,脑海中一个声音告诉我一定是服务GC出问题了。
右边的请求超时个数是之前的10倍(涉及到公司业务保密,相关监控数据打码处理)
然后看了服务的GC,果然单次gc的时间比之前增加了不少。
当时立马找了下游的服务,告诉我这部分的超时是在范围允许之内的,心里终于松了一口气。于是继续搞其他的业务去了,虽然我故作淡定,其实当时心里已经慌的不行了。
这时候机智的你是不是说不行就扩容呗,但我是那么容易妥协的人吗?这不是考验我内功的时候吗?虽然我平时干着CRUD的活,但哪一个程序员不喜欢研究各种技术呢?这是所有程序员的浪漫!于是忙完了手头上的活之后开始了gc调优。
处理过程
- 将gc日志打印出来。可以在虚拟机的启动gc参数中增加-XX:+PrintGCDetails参数,将每次的GC的耗时信息打印出来。
- 分析日志,找到到底是哪些GC阶段导致了GC时间的上涨。
当时的GC日志:
%xwEx[GC pause (G1 Evacuation Pause) (young), 0.0962103 secs]
[Parallel Time: 23.3 ms, GC Workers: 4]
[GC Worker Start (ms): Min: 146441.2, Avg: 146441.3, Max: 146441.3, Diff: 0.1]
[Ext Root Scanning (ms): Min: 1.5, Avg: 1.8, Max: 2.4, Diff: 1.0, Sum: 7.2]
[Update RS (ms): Min: 1.0, Avg: 1.5, Max: 1.7, Diff: 0.6, Sum: 5.9]
[Processed Buffers: Min: 27, Avg: 34.8, Max: 41, Diff: 14, Sum: 139]
[Scan RS (ms): Min: 0.3, Avg: 0.3, Max: 0.3, Diff: 0.0, Sum: 1.3]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.5, Max: 1.1, Diff: 1.1, Sum: 1.9]
[Object Copy (ms): Min: 18.3, Avg: 19.0, Max: 19.6, Diff: 1.2, Sum: 76.1]
[Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Termination Attempts: Min: 1, Avg: 1.8, Max: 3, Diff: 2, Sum: 7]
[GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 0.3]
[GC Worker Total (ms): Min: 23.1, Avg: 23.2, Max: 23.2, Diff: 0.1, Sum: 92.7]
[GC Worker End (ms): Min: 146464.4, Avg: 146464.4,