原文作者:dm03514
SRE: Debugging Simple Memory Leaks in Go
Memory leaks are a class of bugs where memory is not released even after it is no longer needed. They are often explicit, and highly visible, which makes them a great candidate to begin learning debugging. Go is a language particularly well suited to identifying memory leaks because of its powerful toolchain, which ships with amazingly capable tools (pprof) which make pinpointing memory usage easy.
I’m hoping this post will illustrate how to visually identify memory, narrow it down to a specific process, correlate the process leak with work, and finally find the source of the memory leak using pprof. This post is contrived in order to allow for the simple identification of the root cause of the memory leak. The pprof overview is intentionally brief and aims to illustrate what it is capable of and isn’t an exhaustive overview of its features.
The service used to generate the data in this post is available here.
What is a memory leak?
If memory grows unbounded and never reaches a steady state then there is probably a leak. The key here is that memory grows without ever reaching a steady state, and eventually causes problems through explicit crashes or by impacting system performance.
Memory leaks can happen for any number of reasons. There can be logical leaks where data-structures grow unbounded, leaks from complexities of poor object reference handling, or just for any number of other reasons. Regardless of the source, many memory leaks elicit a visually noticeable pattern: the “sawtooth”.
Debug Process
This blog post is focused on exploring how to identify and pinpoint root cause for a go memory leak. We’ll be focusing primarily on characteristics of memory leaks, how to identify them, and how to determine their root cause using go. Because of this our actual debug process will be relatively superficial and informal.
The goal of our analysis is to progressively narrow scope of the problem by widdling away possibilities until we have enough information to form and propose a hypothesis. After we have enough data and reasonable scope of the cause, we should form a hypothesis and try to invalidate it with data.
Each step will try to either pinpoint a cause of an issue or invalidate a non-cause. Along the way we’ll be forming a series of hypotheses, they will be necessarily general at first then progressively more specific. This is loosely based on the scientific method. Brandon Gregg does an amazing job of covering different methodologies for system investigation (primarily focused on performance).
Just to reiterate we’ll try to:
-Ask a question
-Form a Hypothesis
-Analyze the hypothesis
-Repeat until the root cause is found
Identification
How do we even know if there is a problem (ie memory leak)? Explicit errors are direct indicators of an issue. For memory leaks a common error are OOM errors or explicit system crashes.
OOM errors
Errors are the most explicit indicator of a problem. User generated errors have the potential to generate false positives if the logic is off OOM is the OS literally indicating something is using too much memory. In the error below this manifests as cgroup limits being reached and the container being killed.
dmesg
1[14808.063890] main invoked oom-killer: gfp_mask=0x24000c0, order=0, oom_score_adj=0 [7/972]
2[14808.063893] main cpuset=34186d9bd07706222bd427bb647ceed81e8e108eb653ff73c7137099fca1cab6 mems_allowed=0
3[14808.063899] CPU: 2 PID: 11345 Comm: main Not tainted 4.4.0-130-generic #156-Ubuntu
4[14808.063901] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
5[14808.063902] 0000000000000286 ac45344c9134371f ffff8800b8727c88 ffffffff81401c43
6[14808.063906] ffff8800b8727d68 ffff8800b87a5400 ffff8800b8727cf8 ffffffff81211a1e
7[14808.063908] ffffffff81cdd014 ffff88006a355c00 ffffffff81e6c1e0 0000000000000206
8[14808.063911] Call Trace:
9[14808.063917] [<ffffffff81401c43>] dump_stack+0x63/0x90
10[14808.063928] [<ffffffff81211a1e>] dump_header+0x5a/0x1c5
11[14808.063932] [<ffffffff81197dd2>] oom_kill_process+0x202/0x3c0
12[14808.063936] [<ffffffff81205514>] ? mem_cgroup_iter+0x204/0x3a0
13[14808.063938] [<ffffffff81207583>] mem_cgroup_out_of_memory+0x2b3/0x300
14[14808.063941] [<ffffffff8120836d>] mem_cgroup_oom_synchronize+0x33d/0x350
15[14808.063944] [<ffffffff812033c0>] ? kzalloc_node.constprop.49+0x20/0x20
16[14808.063947] [<ffffffff81198484>] pagefault_out_of_memory+0x44/0xc0
17[14808.063967] [<ffffffff8106d622>] mm_fault_error+0x82/0x160
18[14808.063969] [<ffffffff8106dae9>] __do_page_fault+0x3e9/0x410
19[14808.063972] [<ffffffff8106db32>] do_page_fault+0x22/0x30
20[14808.063978] [<ffffffff81855c58>] page_fault+0x28/0x30
21[14808.063986] Task in /docker/34186d9bd07706222bd427bb647ceed81e8e108eb653ff73c7137099fca1cab6 killed as a result of limit of /docker/34186d9bd07706222bd427bb647ceed81e8e108eb653ff73c7137099fca1cab6
22[14808.063994] memory: usage 204800kB, limit 204800kB, failcnt 4563
23[14808.063995] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
24[14808.063997] kmem: usage 7524kB, limit 9007199254740988kB, failcnt 0
25[14808.063986] Task in /docker/34186d9bd07706222bd427bb647ceed81e8e108eb653ff73c7137099fca1cab6 killed as a result of limit of /docker/34186d9bd07706222bd427bb647ceed81e8e108eb653ff73c7137099fca1cab6
26[14808.063994] memory: usage 204800kB, limit 204800kB, failcnt 4563
27[14808.063995] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
28[14808.063997] kmem: usage 7524kB, limit 9007199254740988kB, failcnt 0
29[14808.063998] Memory cgroup stats for /docker/34186d9bd07706222bd427bb647ceed81e8e108eb653ff73c7137099fca1cab6: cache:108KB rss:197168KB rss_huge:0KB mapped_file:4KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:197168KB inacti
30ve_file:88KB active_file:4KB unevictable:0KB
31[14808.064008] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
32[14808.064117] [10517] 0 10517 74852 4784 32 5 0 0 go
33[14808.064121] [11344] 0 11344 97590 46185 113 5 0 0 main
34[14808.064125] Memory cgroup out of memory: Kill process 11344 (main) score 904 or sacrifice child
35[14808.083306] Killed process 11344 (main) total-vm:390360kB, anon-rss:183712kB, file-rss:1016kB
Question: Is the error a regular repeating issue?
Hypothesis: OOM errors are significant enough that they should rarely occur. There is a memory leak in one of the processes.
Prediction: Either the Process memory limit has been set too low and there was a uncharacteristic bump or there is a larger issue.
Test: Upon further inspection there are quite a few OOM errors suggesting this is a serious issue and not a one off. Check the system memory for historic view into memory usage.
System Memory
The next step after identifying a potential problem is to get an idea of system wide memory usage. Memory Leaks frequently display a “sawtooth” pattern. The spikes correspond to the application running while the dips correspond to a service restart.
Sawtooth characterizes a memory leak especially corresponding with a service deploy. I’m using a test project to illustrate memory leaks but even a slow leak would look like saw tooth if the range is zoomed out far enough. With a smaller time range it would look like a gradual rise and then a drop off on process restart.
The graph above shows the example of a sawtooth memory growth. Memory continually grows without flatlining. This is a smoking gun for memory issues.
Question: Which process (or processes) is (are) responsible for the memory growth?
Test: Analyze per process memory. There could be information in thedmesg logs to indicate a process or class of processes that are the targets of OOM.
Per Process Memory
Once a memory leak is suspected the next step is to identify a process that is contributing, or causing the system memory growth. Having per process historical memory metrics is a crucial requirement (container based system resources are available through the a tool like cAdvisor). Go’s prometheus client provides per process memory metrics by default, which is where the graph below gets its data.
The below graph shows a process which is very similar to the system sawtooth memory leak graph above: continual growth until process restarts.
Memory is a critical resource and can be used to indicate abnormal resource usage or could be used as a dimension for scaling. Additionally, having memory stats help inform how to set container based (cgroups) memory limits. The specifics of the graph values above can be found metric code source. After the process has been identified it’s time to dig in and find out which specific part of the code is responsible for this memory growth.
Memory is a critical resource and can be used to indicate abnormal resource usage or could be used as a dimension Memory is a critical resource and can be used to indicate abnormal resource usage or could be used as a dimension for scaling. Additionally, having memory stats help inform how to set container based (cgroups) memory limits. The specifics of the graph values above can be found metric code source. After the process has been identified it’s time to dig in and find out which specific part of the code is responsible for this memory growth.
Root Cause Analysis / Source Analysis
Go Memory Analysis
Once again prometheus gives us detailed information into the go runtime, and what our process is doing. The chart shows that bytes are continually allocated to the heap until a restart. Each dip corresponds with the service process restart.
Question: Which part(s) of the application is(are) leaking memory?
Question: Which part(s) of the application is(are) leaking memory?
Hypothesis: There’s a memory leak in a routine which is continually allocating memory to the heap (global variable or pointer, potentially visible through escape analysis)
Test: Correlate the memory usage with an event.
Correlation With Work
Establishing a correlation will help to partition the problem space by answering: is this happening online (in relation to transactions) or in the background?
One way to determine this could be to start the service and let it idle without applying any transactional load. Is the service leaking? If so it could be the framework or a shared library. Our example happens to have a strong correlation with transactional workload.
HTTP Request Count
The above graph show the count of HTTP requests. These directly match the system memory growth and time and establish diving into HTTP request handling as a good place to start.
Question: Which part of the application are responsible for the heap allocations?
Hypothesis: There is an HTTP handler that is continually allocating memory to the heap.
Test: Periodically analyze heap allocations during program running in order to track memory growth.
Go Memory Allocations
In order to inspect how much memory is being allocated and the source of those allocations we’ll use pprof. pprof is an absolutely amazing tool and one of the main reasons that I personally use go. In order to use it we’ll have to first enable it, and then take some snapshots. If you’re already using http, enabling it is literally as easy as:
1import _ "net/http/pprof"
Once pprof is enabled we’ll periodically take heap snapshots throughout the life of process memory growth. Taking a heap snapshot is just as trivial:
1curl http://localhost:8080/debug/pprof/heap > heap.0.pprof
2sleep 30
3curl http://localhost:8080/debug/pprof/heap > heap.1.pprof
4sleep 30
5curl http://localhost:8080/debug/pprof/heap > heap.2.pprof
6sleep 30
7curl http://localhost:8080/debug/pprof/heap > heap.3.pprof
The goal is to get an idea of how memory is growing throughout the life of the program. Let’s inspect the most recent heap snapshot:
1$ go tool pprof pprof/heap.3.pprof
2Local symbolization failed for main: open /tmp/go-build598947513/b001/exe/main: no such file or directory
3Some binary filenames not available. Symbolization may be incomplete.
4Try setting PPROF_BINARY_PATH to the search path for local binaries.
5File: main
6Type: inuse_space
7Time: Jul 30, 2018 at 6:11pm (UTC)
8Entering interactive mode (type "help" for commands, "o" for options)
9(pprof) svg
10Generating report in profile002.svg
11(pprof) top20
12Showing nodes accounting for 410.75MB, 99.03% of 414.77MB total
13Dropped 10 nodes (cum <= 2.07MB)
14 flat flat% sum% cum cum%
15 408.97MB 98.60% 98.60% 408.97MB 98.60% bytes.Repeat
16 1.28MB 0.31% 98.91% 410.25MB 98.91% main.(*RequestTracker).Track
17 0.50MB 0.12% 99.03% 414.26MB 99.88% net/http.(*conn).serve
18 0 0% 99.03% 410.25MB 98.91% main.main.func1
19 0 0% 99.03% 410.25MB 98.91% net/http.(*ServeMux).ServeHTTP
20 0 0% 99.03% 410.25MB 98.91% net/http.HandlerFunc.ServeHTTP
21 0 0% 99.03% 410.25MB 98.91% net/http.serverHandler.ServeHTTP
This is absolutely amazing. pprof defaults to Type: inuse_space
which displays all the objects that are currently in memory at the time of the snapshot. We can see here that bytes.Repeat
is directly responsible for 98.60% of all of our memory!!!
The line below the bytes.Repeat
entry shows:
11.28MB 0.31% 98.91% 410.25MB 98.91% main.(*RequestTracker).Track
This is really interesting, it shows that Track
itself has1.28MB
or 0.31%
but is responsible for 98.91%
of all in use memory!!!!!!!!!!!!! Further more we can see that http has even less memory in use but is responsible for even more than Track
(since Track
is called from it).
pprof exposes many ways to introspect and visualize memory (in use memory size, in use number of objects, allocated memory size, allocated memory objects), it allows listing the track method and showing how much each line is responsible for:
1(pprof) list Track
2Total: 414.77MB
3ROUTINE ======================== main.(*RequestTracker).Track in /vagrant_data/go/src/github.com/dm03514/grokking-go/simple-memory-leak/main.go
4 1.28MB 410.25MB (flat, cum) 98.91% of Total
5 . . 19:
6 . . 20:func (rt *RequestTracker) Track(req *http.Request) {
7 . . 21: rt.mu.Lock()
8 . . 22: defer rt.mu.Unlock()
9 . . 23: // alloc 10KB for each track
10 1.28MB 410.25MB 24: rt.requests = append(rt.requests, bytes.Repeat([]byte("a"), 10000))
11 . . 25:}
12 . . 26:
13 . . 27:var (
14 . . 28: requests RequestTracker
15 . . 29:
This directly pinpoints the culprit:
11.28MB 410.25MB 24: rt.requests = append(rt.requests, bytes.Repeat([]byte("a"), 10000))
pprof also allows visual generation of the textual information above:
1(pprof) svg
2Generating report in profile003.svg
This clearly shows the current objects occupying the process memory. Now that we have the culprit Track
we can verify that it is allocating memory to a global without ever cleaning it up, and fix the root issue.
Resolution: Memory was being continually allocated to a global variable on each HTTP request. The memory growth in this case should have been inspected.
I hope that this post was able to illustrate the power of visually identifying memory leaks and a loose process for progressively narrowing down its source. Finally, I hope that it was able to touch on the power of pprof for go memory introspection and analysis. As always I would love any feedback. Thank you for reading.
版权申明:内容来源网络,版权归原创者所有。除非无法确认,我们都会标明作者及出处,如有侵权烦请告知,我们会立即删除并表示歉意。谢谢。
Golang语言社区
ID:Golangweb
游戏服务器架构丨分布式技术丨大数据丨游戏算法学习