0. 问题
新版本上线之后,发现内存猛涨,入站流量猛增,不清楚具体原因,部分接口提示 OOM 异常,随后 Pod 直接崩溃无限重启。
1. 准备
Pod 已经接入了 NewRelic 和 Graylog,但是仍然没有办法找到真正的罪魁祸手,此时只能进入 Pod 容器当中抓取内存 Dump 信息。我们容器的基础镜像是基于 Apline-3.18 的,进入容器之后执行了以下命令开始安装相应的工具。
# 我们的镜像是基于 runtime 的,因此需要手动安装一下 SDK,以便后续操作。
# 这里还安装了 bash,后续会使用 bash 进行交互操作,自带的 sh 不好用。
apk add dotnet6-sdk bash
# 安装 Dump 工具
dotnet tool install --global dotnet-dump
因为容器的 ENTRYPOINT 就是直接运行的 dotNET 程序,一般来说其 PID 都是 1,如果你不清楚具体的进程 ID,可以执行
尝试运行 dotnet-dump collect -p 1
收集 Dump 信息,但是得到了以下错误:
/build# dotnet-dump collect -p 1
Writing full to /build/core_20240307_090401
Write dump failed - HRESULT: 0x00000000.
搜索一番之后,得知这是 Pod 没有足够的权限去执行 Dump 操作,因此修改了 Rollouts(或者 Deplotment) 的 YAML 定义,添加对应的 securityContext
应用即可,随后便能够正确地获取 Dump 文件。
securityContext:
capabilities:
add:
- SYS_PTRACE
- SYS_ADMIN
seccompProfile:
type: RuntimeDefault
再次执行 dotnet-dump collect -p 1
获取到了对应的 Dump 文件,将文件拷贝到挂载的 NFS 卷当中,随即下载到本地以便进行调试排查问题。
2. 调查
得到 Dump 文件之后,我们可以使用多种工具来分析 Dump 文件,这里我使用的是 dotnet-dump
命令。因为我是 macOS 的机器,使用 dotnet-dump
我可以直接开始进行分析,你也可以使用 Visual Studio 、dotnetMemory、WinDBG 来打开 Dump 的文件,具体看你的喜好了。
使用 dotnet dump analyze <dump file path>
进入交互式页面:
Loading core dump: D:\dotNET_Dumps\\core_20240307_142201 ...
Ready to process analysis commands. Type 'help' to list available commands or 'help [command]' to get detailed help on a command.
Type 'quit' or 'exit' to exit the session.
首先我们可以看一下目前 GC 堆的信息:
> eeheap -gc
========================================
Number of GC Heaps: 3
----------------------------------------
Heap 0 (00007faa2a73b6b0)
generation 0 starts at 7fa2495932e8
generation 1 starts at 7fa2458279f0
generation 2 starts at 7fa232703000
ephemeral segment allocation context: none
Small object heap
segment begin allocated committed allocated size committed size
7fa232702000 7fa232703000 7fa249be4020 7fa252174000 0x174e1020 (390991904) 0x1fa72000 (531046400)
Large object heap starts at 7fa3b2703000
segment begin allocated committed allocated size committed size
7fa3b2702000 7fa3b2703000 7fa3e3dfc348 7fa3e3dfd000 0x316f9348 (829395784) 0x316fb000 (829403136)
Pinned object heap starts at 7fa6b2703000
segment begin allocated committed allocated size committed size
7fa6b2702000 7fa6b2703000 7fa6b27d4bb8 7fa6b27d5000 0xd1bb8 (859064) 0xd3000 (864256)
------------------------------
Heap 1 (00007faa2a68b6e0)
generation 0 starts at 7fa2c75ae080
generation 1 starts at 7fa2c40eec00
generation 2 starts at 7fa2b2703000
ephemeral segment allocation context: none
Small object heap
segment begin allocated committed allocated size committed size
7fa2b2702000 7fa2b2703000 7fa2c9b1ebb0 7fa2d00b8000 0x1741bbb0 (390183856) 0x1d9b6000 (496721920)
Large object heap starts at 7fa4b2703000
segment begin allocated committed allocated size committed size
7fa4b2702000 7fa4b2703000 7fa4e3f804f0 7fa4e3f81000 0x3187d4f0 (830985456) 0x3187f000 (830992384)
Pinned object heap starts at 7fa7b2703000
segment begin allocated committed allocated size committed size
7fa7b2702000 7fa7b2703000 7fa7b2703018 7fa7b2704000 0x18 (24) 0x2000 (8192)
------------------------------
Heap 2 (00007faa2a5db720)
generation 0 starts at 7fa3466d0298
generation 1 starts at 7fa343173ee0
generation 2 starts at 7fa332703000
ephemeral segment allocation context: none
Small object heap
segment begin allocated committed allocated size committed size
7fa332702000 7fa332703000 7fa348631878 7fa34f736000 0x15f2e878 (368240760) 0x1d034000 (486752256)
Large object heap starts at 7fa5b2703000
segment begin allocated committed allocated size committed size
7fa5b2702000 7fa5b2703000 7fa5e519c3b0 7fa5e519d000 0x32a993b0 (849974192) 0x32a9b000 (849981440)
Pinned object heap starts at 7fa8b2703000
segment begin allocated committed allocated size committed size
7fa8b2702000 7fa8b2703000 7fa8b270c0f0 7fa8b2714000 0x90f0 (37104) 0x12000 (73728)
------------------------------
GC Allocated Heap Size: Size: 0xda315cf0 (3660668144) bytes.
GC Committed Heap Size: Size: 0xeff58000 (4025843712) bytes.
可以看到有 3 个 GC 堆,并且大部分内存占用都在 LOH 上,我们使用 dumpheap -stat -min 85000
搜索一下大小大于 85000 字节的对象有多少?
> dumpheap -stat -min 85000
Statistics:
MT Count TotalSize Class Name
7fa9b9be29c0 1 85,112 Serilog.Events.LogEventPropertyValue[]
7fa9ba87d710 1 117,464 Microsoft.AspNetCore.Routing.Matching.DfaState[]
7fa9b327b110 2 261,648 System.Object[]
7fa9b3348080 2 849,380 System.Int32[]
7fa9bb1e29f8 5 1,441,912 ***.Core.***.*************[]
7fa9b334d2e0 6 1,939,370 System.String
7fa9bb3589a0 1 2,097,176 ***.Core.***.***.***[]
7fa9b5200528 9 2,228,440 ***.Core.***.***[]
7fa9b5206200 20 3,670,496 ***.Core.***.***[]
7fa9bb3625e8 1 4,506,048 System.Collections.Generic.Dictionary<System.String, ***.***.***.***.***>+Entry[]
7fa9b338edd0 20 9,716,748 System.Char[]
7faa2cb14350 76 13,295,160 Free
7fa9b3d60c98 1,100 2,464,160,840 System.Byte[]
Total 1,244 objects, 2,504,369,794 bytes
可以看到这里面有 1100 个对象的大小都超过了 85000 字节,总共加起来快 2.3GB 了,所以问题出在这里。随后使用 dumpheap -type System.Byte[]
查看这些具体的对象列表,以便得到具体对象的地址:
7fa5d5175480 7fa9b3d60c98 18,749,311
7fa5d6356c20 7fa9b3d60c98 6,734,857
7fa5d69c3050 7fa9b3d60c98 878,704
7fa5d6a998e0 7fa9b3d60c98 174,565
7fa5d6ad21c0 7fa9b3d60c98 18,749,311
7fa5d7cb3960 7fa9b3d60c98 6,734,857
7fa5d831fd90 7fa9b3d60c98 10,670,254
7fa5d8d4ce60 7fa9b3d60c98 10,670,254
7fa5d9779f30 7fa9b3d60c98 18,749,311
7fa5da95b6d0 7fa9b3d60c98 18,749,311
7fa5dbb3ce70 7fa9b3d60c98 1,931,776
7fa5dbd6b8e0 7fa9b3d60c98 6,842,488
7fa5dc3f2178 7fa9b3d60c98 7,773,830
7fa5dcb5c020 7fa9b3d60c98 7,773,830
7fa5dd2c5ec8 7fa9b3d60c98 7,773,830
7fa5dda2fd70 7fa9b3d60c98 12,585,235
7fa5de6306a8 7fa9b3d60c98 1,889,260
7fa5de7fdab8 7fa9b3d60c98 1,172,106
7fa5de91bd68 7fa9b3d60c98 134,508
7fa5de94dff8 7fa9b3d60c98 8,857,584
7fa5df1c0808 7fa9b3d60c98 6,842,488
7fa5df8470a0 7fa9b3d60c98 6,842,488
7fa5dfecd938 7fa9b3d60c98 6,842,488
7fa5e05541d0 7fa9b3d60c98 8,857,584
7fa5e0dc69e0 7fa9b3d60c98 7,773,449
7fa5e1530710 7fa9b3d60c98 7,773,449
7fa5e1c9a440 7fa9b3d60c98 980,321
7fa5e1d899c8 7fa9b3d60c98 1,052,316
7fa5e1e8a888 7fa9b3d60c98 1,052,316
7fa5e1f8b748 7fa9b3d60c98 7,373,509
7fa5e2693a30 7fa9b3d60c98 7