记一次外部agent侵入导致的OOM排查过程

业务系统从公有云在迁移到专有云后,有几台服务器一直有问题,直接导致迁移进度无进展。通过

`jstat -gcutil pid  5000`

查看,发现应用程序的老年代已满,fullgc一直在增长,反而ygc很小。

猜测:

  1. 对象生成没有经过年轻代,直接进了老年代,程序里有大对象;

  2. 年轻代里没有连续的空间,导致无法创建数组直接进入老年代;

然后

jmap -heap pid

查看到应用的堆有2g,但是新生代只有167MB,新生代太小,不符合1:2的配置

Server compiler detected.
JVM version is 25.281-b09

using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC

Heap Configuration:
   MinHeapFreeRatio         = 40
   MaxHeapFreeRatio         = 70
   MaxHeapSize              = 2147483648 (2048.0MB)
   NewSize                  = 174456832 (166.375MB)
   MaxNewSize               = 174456832 (166.375MB)
   OldSize                  = 1973026816 (1881.625MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 536870912 (512.0MB)
   CompressedClassSpaceSize = 134217728 (128.0MB)
   MaxMetaspaceSize         = 536870912 (512.0MB)
   G1HeapRegionSize         = 0 (0.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 157024256 (149.75MB)
   used     = 115393952 (110.04824829101562MB)
   free     = 41630304 (39.701751708984375MB)
   73.48797882538607% used
Eden Space:
   capacity = 139591680 (133.125MB)
   used     = 97961376 (93.42324829101562MB)
   free     = 41630304 (39.701751708984375MB)
   70.17708791813381% used
From Space:
   capacity = 17432576 (16.625MB)
   used     = 17432576 (16.625MB)
   free     = 0 (0.0MB)
   100.0% used
To Space:
   capacity = 17432576 (16.625MB)
   used     = 0 (0.0MB)
   free     = 17432576 (16.625MB)
   0.0% used
concurrent mark-sweep generation:
   capacity = 1973026816 (1881.625MB)
   used     = 320379248 (305.53746032714844MB)
   free     = 1652647568 (1576.0875396728516MB)
   16.237957102352937% used

这么小的新生代,不合理。先调整jvm参数的新生代为768m,并且把gc日志的明细输出。

-Xmn768m
-XX:+PrintGC -XX:+DisableExplicitGC -XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xloggc:/app/logs/gc.log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/app/logs/

事故再次发生后

2021-04-23T08:36:12.948+0800: 71534.555: [Full GC (Last ditch collection) 
2021-04-23T08:36:12.948+0800: 71534.555: [CMS: 1638219K->1659362K(1926784K), 2.6197429 secs] 1780257K->1659362K(2080128K), [Metaspace: 232942K->232942K(626688K)], 2.6200527 secs] [Times: user=2.57 sys=0.00, real=2.62 secs] 
Heap after GC invocations=755 (full 647):
 par new generation   total 153344K, used 0K [0x0000000080000000, 0x000000008a660000, 0x000000008a660000)
  eden space 136320K,   0% used [0x0000000080000000, 0x0000000080000000, 0x0000000088520000)
  from space 17024K,   0% used [0x00000000895c0000, 0x00000000895c0000, 0x000000008a660000)
  to   space 17024K,   0% used [0x0000000088520000, 0x0000000088520000, 0x00000000895c0000)
 concurrent mark-sweep generation total 1926784K, used 1659362K [0x000000008a660000, 0x0000000100000000, 0x0000000100000000)
 Metaspace       used 232905K, capacity 524136K, committed 524288K, reserved 626688K
  class space    used 18997K, capacity 28795K, committed 28800K, reserved 131072K
}
2021-04-23T08:36:15.570+0800: 71537.177: Application time: 0.0020953 seconds
{Heap before GC invocations=755 (full 647):
 par new generation   total 153344K, used 122K [0x0000000080000000, 0x000000008a660000, 0x000000008a660000)
  eden space 136320K,   0% used [0x0000000080000000, 0x000000008001e968, 0x0000000088520000)
  from space 17024K,   0% used [0x00000000895c0000, 0x00000000895c0000, 0x000000008a660000)
  to   space 17024K,   0% used [0x0000000088520000, 0x0000000088520000, 0x00000000895c0000)
 concurrent mark-sweep generation total 1926784K, used 1659362K [0x000000008a660000, 0x0000000100000000, 0x0000000100000000)
 Metaspace       used 232914K, capacity 524144K, committed 524288K, reserved 626688K
  class space    used 18998K, capacity 28800K, committed 28800K, reserved 131072K
2021-04-23T08:36:15.576+0800: 71537.183: [Full GC (Metadata GC Threshold) 
2021-04-23T08:36:15.576+0800: 71537.183: [CMS[YG occupancy: 122 K (153344 K)]
2021-04-23T08:36:16.706+0800: 71538.313: [weak refs processing, 0.0020164 secs]
2021-04-23T08:36:16.708+0800: 71538.315: [class unloading, 0.0453498 secs]
2021-04-23T08:36:16.753+0800: 71538.360: [scrub symbol table, 0.0375497 secs]
2021-04-23T08:36:16.791+0800: 71538.398: [scrub string table, 0.0051911 secs]: 1659362K->1659218K(1926784K), 1.5252660 secs] 1659484K->1659341K(2080128K), [Metaspace: 232914K->232914K(626688K)], 1.5254988 secs] [Times: user=1.53 sys=0.00, real=1.52 secs] 
Heap after GC invocations=756 (full 648):
 par new generation   total 153344K, used 122K [0x0000000080000000, 0x000000008a660000, 0x000000008a660000)
  eden space 136320K,   0% used [0x0000000080000000, 0x000000008001e968, 0x0000000088520000)
  from space 17024K,   0% used [0x00000000895c0000, 0x00000000895c0000, 0x000000008a660000)
  to   space 17024K,   0% used [0x0000000088520000, 0x0000000088520000, 0x00000000895c0000)
 concurrent mark-sweep generation total 1926784K, used 1659218K [0x000000008a660000, 0x0000000100000000, 0x0000000100000000)
 Metaspace       used 232914K, capacity 524144K, committed 524288K, reserved 626688K
  class space    used 18998K, capacity 28800K, committed 28800K, reserved 131072K
}
{Heap before GC invocations=756 (full 648):
 par new generation   total 153344K, used 122K [0x0000000080000000, 0x000000008a660000, 0x000000008a660000)
  eden space 136320K,   0% used [0x0000000080000000, 0x000000008001e968, 0x0000000088520000)
  from space 17024K,   0% used [0x00000000895c0000, 0x00000000895c0000, 0x000000008a660000)
  to   space 17024K,   0% used [0x0000000088520000, 0x0000000088520000, 0x00000000895c0000)
 concurrent mark-sweep generation total 1926784K, used 1659218K [0x000000008a660000, 0x0000000100000000, 0x0000000100000000)
 Metaspace       used 232914K, capacity 524144K, committed 524288K, reserved 626688K
  class space    used 18998K, capacity 28800K, committed 28800K, reserved 131072K
2021-04-23T08:36:17.101+0800: 71538.708: [Full GC (Last ditch collection) 
2021-04-23T08:36:17.101+0800: 71538.708: [CMS: 1659218K->1659176K(1926784K), 2.3683046 secs] 1659341K->1659176K(2080128K), [Metaspace: 232914K->232914K(626688K)], 2.3685772 secs] [Times: user=2.37 sys=0.00, real=2.37 secs] 
Heap after GC invocations=757 (full 649):
 par new generation   total 153344K, used 0K [0x0000000080000000, 0x000000008a660000, 0x000000008a660000)
  eden space 136320K,   0% used [0x0000000080000000, 0x0000000080000000, 0x0000000088520000)
  from space 17024K,   0% used [0x00000000895c0000, 0x00000000895c0000, 0x000000008a660000)
  to   space 17024K,   0% used [0x0000000088520000, 0x0000000088520000, 0x00000000895c0000)
 concurrent mark-sweep generation total 1926784K, used 1659176K [0x000000008a660000, 0x0000000100000000, 0x0000000100000000)
 Metaspace       used 232913K, capacity 524142K, committed 524288K, reserved 626688K
  class space    used 18998K, capacity 28799K, committed 28800K, reserved 131072K
}
2021-04-23T08:36:19.471+0800: 71541.078: Application time: 0.0011945 seconds
2021-04-23T08:36:19.472+0800: 71541.079: [GC (CMS Initial Mark) [1 CMS-initial-mark: 1659176K(1926784K)] 1659401K(2080128K), 0.0156439 secs] [Times: user=0.02 sys=0.00, real=0.02 secs] 
2021-04-23T08:36:19.488+0800: 71541.095: [CMS-concurrent-mark-start]
2021-04-23T08:36:19.510+0800: 71541.117: Application time: 0.0219117 seconds
2021-04-23T08:36:37.221+0800: 71558.828: Application time: 0.0803242 seconds
{Heap before GC invocations=757 (full 650):
 par new generation   total 153344K, used 29816K [0x0000000080000000, 0x000000008a660000, 0x000000008a660000)
  eden space 136320K,  21% used [0x0000000080000000, 0x0000000081d1e390, 0x0000000088520000)
  from space 17024K,   0% used [0x00000000895c0000, 0x00000000895c0000, 0x000000008a660000)
  to   space 17024K,   0% used [0x0000000088520000, 0x0000000088520000, 0x00000000895c0000)
 concurrent mark-sweep generation total 1926784K, used 1659176K [0x000000008a660000, 0x0000000100000000, 0x0000000100000000)
 Metaspace       used 232940K, capacity 524160K, committed 524288K, reserved 626688K
  class space    used 18999K, capacity 28800K, committed 28800K, reserved 131072K
2021-04-23T08:36:37.223+0800: 71558.830: [Full GC (Metadata GC Threshold) 
2021-04-23T08:36:37.223+0800: 71558.830: [CMS2021-04-23T08:36:38.338+0800: 71559.945: [CMS-concurrent-mark: 1.165/18.850 secs] [Times: user=10.13 sys=1.41, real=18.85 secs] 
 (concurrent mode failure): 1659176K->1668506K(1926784K), 3.6348258 secs] 1688993K->1668506K(2080128K), [Metaspace: 232940K->232940K(626688K)], 3.6351573 secs] [Times: user=3.63 sys=0.02, real=3.64 secs] 
 .....

发现几个现象:

  1. 年轻代不往里写数据;

  2. 不停的fullgc;

  3. 每次fullgc后并没有释放太多的内存;

堆也打下来了,把堆拉下来,用jprofile打开,查看大对象

AgentClassLoader ??? 我们项目启动参数没有加agent啊。再往下点

`com.qt.memchk.MemCheck` 

让开发全局搜索了下,没有发现这个类。查看了下系统里所有java应用,发现了云盾相关的监控,难道是这个?先让运维把其中各一个节点的这两个监控kill掉(一个是两个服务A/B,一个服务两个节点A1/A2,B1/B2)kill掉的是A1和B1。

项目重启,两个小时没问题。触发了一次fullgc,老年代也没有问题。这个时候已经是4月25日了下班了。4月26日早上9点多,开发又说服务爆了。我查看这四台服务器。发现两台fullgc相对正常,两台fullgc都1w+了。当时没注意看,以为就是云盾影响的。还很高兴的给开发说,解决了,是阿里云盾的问题。

我把项目kill重启,这个时候发现不是停掉云盾的机器fullgc了。

想了下还是在机器里安装一个arthas吧,看看具体是什么玩意。查看了下classLoader,发现arthas的ClassLoader进来了, 还出现了AgentClassLoader。

这个时候引起我的注意了,arthas是通过attach的方式来监控的,那AgentClassLoader一定是一个Agent并且是通过attach进来的。

堆里一定有我没注意的东西。继续查看,选择对应的类,右键选择Use Selected Objects

直接点ok。

这是什么鬼?系统里怎么会有这个服务?

找运维,运维说这个是青藤云的agent,主要是做主机安全的。找安全部门,并让安全部门赶紧拉群。

`带着两个问题问对方`
`1,侵入了应用程序(这个是为什么?);`
`2,强引用,导致应用程序无法gc掉;`
`对方回答:`
`1. 这个是为了进行内存webshell检测,所以利用java attach机制在应用中运行了一个jar包`
`2. 这个问题目前我们已经发现,正在进行修复,会尽快上线`

无法理解,一个主机监控服务,尽然能侵入到业务应用中,而且我们的人还不知道这件事,多可怕。让运维把青藤云的服务都停了。重启,一直到现在也没再发现问题。

后续问题:

  1. 当-Xms -Xmx 都是2g的时候,没有设置Xmn ,启动后自由167mb。

  2. 90来次ygc就发生了2次fullgc

1 和我理解的不太一样,不设置xmn 系统应该走默认的1:2的分配机制,然并卵。我又查看了下初始化的参数

`java -XX:+PrintFlagsInitial|grep NewSize`

发现eden区的大小和这个值很接近。

有空了把剩下的两个问题处理下。

想了解更多,请关注公众号5ycode
公众号图片

评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值