最近设计开发的系统本周正式生产上线了。这周的主要工作除了完成整个链路打通之外,其他时间都花在生产系统的细节调教上了。比如分布式环境下一致性的问题该怎么解决(基本没得解决);再比如昵称中各种诡异的特殊符号导致的异常等等。
目前系统运行的情况大概这样:
QPS在2000左右,正常压测时200K的QPS基本表现正常,枫哥说理论的设计上限大概在500K左右,所以现在的QPS基本是毛毛雨。
响应时间上,当时期望是压缩到2ms以内,但实际现在看起来稳定在2ms左右了,有一定差距但基本也还能用,整体上比较正常。
日志层面上,经过这两天的优化,基本已经消除了所有的Exception。除了有几个调皮的使用error作为昵称的用户之外,已经基本搜索不到error日志了。
响应时间分布上,第二行是写入效率,第三行是查询效率。写入效率可以暂时不管,毕竟系统的特性是写少读多。
读取时间的99.9线在3.9ms基本也还是能接受,不过99.999线到50ms这个就有点不太能接受了。因为系统的每一行代码都是我自己亲自写的,所以整个业务流程执行时间超过10ms,那是绝对是不可能的。
为什么会出现极少的请求响应时间明显升高这么多呢?排查了下,大致定位了两个地方可能会存在问题:
-
MySQL数据库会出现极少情况的查询缓慢的现象,比如下面的图中可以很明显看到查询数据耗时170ms了:
-
另一个就是JVM的垃圾回收了,对于平均响应时间在ms级别的服务来讲,垃圾回收的STW(Stop The World)中断,真的是太扎眼了。
为了能直观观察垃圾回收导致的中断时间,我在生产服务上增加了部分参数:
prod:
args: [
'-XX:+AlwaysPreTouch',
'-XX:+UseG1GC',
'-XX:ParallelGCThreads=4',
'-XX:ConcGCThreads=1',
'-Xloggc:gc.log',
'-XX:+PrintGCDetails',
'-XX:+PrintGCApplicationStoppedTime',
'-Xms6g',
'-Xmx6g',
'-Xss256k',
'-XX:MetaspaceSize=256m',
'-XX:MaxMetaspaceSize=256m',
...
]
上面这段配置是JVM垃圾回收相关的调优参数
-
-XX:+UseG1GC:使用G1垃圾回收器
-
-XX:ParallelGCThreads=4:并发线程数4,官方推荐和物理核心数量保持一致
-
-Xloggc:gc.log:打印垃圾回收日志到gc.log文件中
-
-XX:+PrintGCDetails:打印垃圾回收的详细日志信息
-
-XX:+PrintGCApplicationStoppedTime:打印每次业务线程被中断的时长
将上面参数调整后发布至生产后,主要观察gc.log文件,贴出文件的部分内容展示如下:
4725.374: Total time for which application threads were stopped: 0.0004082 seconds, Stopping threads took: 0.0000912 seconds
4725.375: Total time for which application threads were stopped: 0.0009361 seconds, Stopping threads took: 0.0000177 seconds
4726.375: Total time for which application threads were stopped: 0.0003315 seconds, Stopping threads took: 0.0000699 seconds
4753.889: Total time for which application threads were stopped: 0.0012248 seconds, Stopping threads took: 0.0000997 seconds
4758.889: Total time for which application threads were stopped: 0.0003172 seconds, Stopping threads took: 0.0000676 seconds
4780.966: Total time for which application threads were stopped: 0.0003868 seconds, Stopping threads took: 0.0000940 seconds
4780.966: Total time for which application threads were stopped: 0.0001544 seconds, Stopping threads took: 0.0000170 seconds
4803.916: Total time for which application threads were stopped: 0.0003575 seconds, Stopping threads took: 0.0000808 seconds
4803.917: Total time for which application threads were stopped: 0.0001241 seconds, Stopping threads took: 0.0000108 seconds
4803.917: Total time for which application threads were stopped: 0.0001276 seconds, Stopping threads took: 0.0000098 seconds
4803.917: Total time for which application threads were stopped: 0.0001207 seconds, Stopping threads took: 0.0000083 seconds
4803.917: Total time for which application threads were stopped: 0.0001238 seconds, Stopping threads took: 0.0000094 seconds
4803.917: Total time for which application threads were stopped: 0.0001376 seconds, Stopping threads took: 0.0000076 seconds
4813.889: Total time for which application threads were stopped: 0.0014204 seconds, Stopping threads took: 0.0001986 seconds
4813.900: Total time for which application threads were stopped: 0.0004117 seconds, Stopping threads took: 0.0001083 seconds
4836.306: Total time for which application threads were stopped: 0.0003935 seconds, Stopping threads took: 0.0000729 seconds
4836.306: Total time for which application threads were stopped: 0.0001860 seconds, Stopping threads took: 0.0000518 seconds
4838.307: Total time for which application threads were stopped: 0.0004173 seconds, Stopping threads took: 0.0000994 seconds
4850.490: [GC pause (G1 Evacuation Pause) (young), 0.0110336 secs]
[Parallel Time: 8.6 ms, GC Workers: 4]
[GC Worker Start (ms): Min: 4850490.5, Avg: 4850490.5, Max: 4850490.5, Diff: 0.0]
[Ext Root Scanning (ms): Min: 1.1, Avg: 1.6, Max: 2.7, Diff: 1.6, Sum: 6.3]
[Update RS (ms): Min: 0.4, Avg: 1.3, Max: 1.6, Diff: 1.3, Sum: 5.2]
[Processed Buffers: Min: 27, Avg: 43.0, Max: 54, Diff: 27, Sum: 172]
[Scan RS (ms): Min: 0.2, Avg: 0.2, Max: 0.2, Diff: 0.0, Sum: 0.8]
[Code Root Scanning (ms): Min: 0.0, Avg: 0.4, Max: 0.8, Diff: 0.7, Sum: 1.5]
[Object Copy (ms): Min: 4.5, Avg: 4.9, Max: 5.3, Diff: 0.9, Sum: 19.7]
[Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
[Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 4]
[GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.0, Sum: 0.2]
[GC Worker Total (ms): Min: 8.4, Avg: 8.4, Max: 8.4, Diff: 0.0, Sum: 33.7]
[GC Worker End (ms): Min: 4850498.9, Avg: 4850498.9, Max: 4850498.9, Diff: 0.1]
[Code Root Fixup: 0.1 ms]
[Code Root Purge: 0.0 ms]
[Clear CT: 0.5 ms]
[Other: 1.9 ms]
[Choose CSet: 0.0 ms]
[Ref Proc: 0.4 ms]
[Ref Enq: 0.0 ms]
[Redirty Cards: 0.2 ms]
[Humongous Register: 0.0 ms]
[Humongous Reclaim: 0.0 ms]
[Free CSet: 0.9 ms]
[Eden: 3668.0M(3668.0M)->0.0B(3668.0M) Survivors: 18.0M->18.0M Heap: 3746.9M(6144.0M)->78.5M(6144.0M)]
[Times: user=0.04 sys=0.00, real=0.00 secs]
非G1评估阶段的中断基本都在1ms之内,基本可以忽略不计了。
再查看下每次GC回收时的停顿时间:
59.596: [GC pause (G1 Evacuation Pause) (young), 0.0349198 secs]
74.724: [GC pause (G1 Evacuation Pause) (young), 0.0359142 secs]
168.096: [GC pause (G1 Evacuation Pause) (young), 0.0269115 secs]
189.029: [GC pause (G1 Evacuation Pause) (young), 0.0123414 secs]
328.759: [GC pause (G1 Evacuation Pause) (young), 0.0127940 secs]
527.381: [GC pause (G1 Evacuation Pause) (young), 0.0131356 secs]
807.420: [GC pause (G1 Evacuation Pause) (young), 0.0143278 secs]
1089.651: [GC pause (G1 Evacuation Pause) (young), 0.0162936 secs]
1369.647: [GC pause (G1 Evacuation Pause) (young), 0.0147197 secs]
1647.889: [GC pause (G1 Evacuation Pause) (young), 0.0164485 secs]
1924.121: [GC pause (G1 Evacuation Pause) (young), 0.0184118 secs]
2199.834: [GC pause (G1 Evacuation Pause) (young), 0.0192690 secs]
2473.893: [GC pause (G1 Evacuation Pause) (young), 0.0190029 secs]
2746.734: [GC pause (G1 Evacuation Pause) (young), 0.0178737 secs]
3017.829: [GC pause (G1 Evacuation Pause) (young), 0.0181908 secs]
3287.533: [GC pause (G1 Evacuation Pause) (young), 0.0197779 secs]
3552.442: [GC pause (G1 Evacuation Pause) (young), 0.0175253 secs]
3816.262: [GC pause (G1 Evacuation Pause) (young), 0.0165769 secs]
4078.101: [GC pause (G1 Evacuation Pause) (young), 0.0128744 secs]
4339.099: [GC pause (G1 Evacuation Pause) (young), 0.0116070 secs]
4594.891: [GC pause (G1 Evacuation Pause) (young), 0.0122078 secs]
4850.490: [GC pause (G1 Evacuation Pause) (young), 0.0110336 secs]
基本的停顿时间都在10 ~ 40 ms之间,垃圾回收周期大概在5min左右(主要是现在流量也不大在,这个周期还是挺长的)。
还有啥方法可以用来降低中断时间吗?网上找了下,基本没找到啥有用的办法。G1的垃圾回收停顿时间大部分经验值都是在10ms ~ 600ms之间。目前系统大部分情况可以保证只停顿10ms多一点的样子,已经很不错了。
哎,也难怪业界对ZGC抱有如此高的期望了。对于G1来讲,想达到100%请求都在ms级别完成响应,实在是一句"成妾做不到"啊!