【译】 Golang 中的垃圾回收:Go Traces（二）

最新推荐文章于 2023-06-17 17:25:02 发布

shankusu2017

最新推荐文章于 2023-06-17 17:25:02 发布

阅读量249

点赞数

分类专栏： golang 文章标签： go gc

原文链接：https://www.ardanlabs.com/blog/2019/05/garbage-collection-in-go-part2-gctraces.html

版权

golang 专栏收录该内容

116 篇文章 10 订阅

订阅专栏

以下内容转载自 https://juejin.cn/post/6844903892212269064 https://www.ardanlabs.com/blog/2019/05/garbage-collection-in-go-part2-gctraces.html

这系列的文章如下

第一部分，我花了时间去描述了golang 垃圾收集器的行为，并说明了程序运行时候收集器所产生的延迟。我分享了怎么样去产生以及去解释GC traces。并展示了堆内内存如何改变，并且解释了GC不同阶段对延迟代价的影响。

最后的结论是，如果你减少了堆的压力，你就会减少延迟代价并提高了程序性能。我也指出了，降低收集速率，推迟收集，设法增加两次收集的时间间隔并不是很好的策略。即使是收集速率很快，但是以一致的速度进行，也能保持程序以最佳性能运行。

这部分里，我会带你通过一个实际的web应用并向你展示怎么样生成GC traces和程序状态。然后我会解释这些工具的输出内容，这样你可以找到方法去提高应用程序的性能。

运行应用

看一下go training的web应用程序

图1.1

github.com/ardanlabs/g…

图一是应用程序的样子。这个程序从不同的地方下载三个rss 订阅集合，并且允许用户去进行搜索。build之后，启动程序

$ go build
$ GOGC=off ./project > /dev/null
复制代码

L1中我们设置了GOGC变量为off，也就是关闭垃圾回收。日志重定向到/dev/null。随着程序运行，我们可以向server中发送请求了。

$ hey -m POST -c 100 -n 10000 "http://localhost:5000/search?term=topic&cnn=on&bbc=on&nyt=on"
复制代码

L2展示了通过hey使用100个连接发送了10k的请求。所有请求发送到服务端，这个过程结果产生如下结果

图1.2

图2.2给出了关闭GC处理requests的可视化的过程。10k请求的处理大概花费了4188ms，也就是服务端每秒处理大概2387次请求。

打开GC

如果开启GC的话会怎么样

$ GODEBUG=gctrace=1 ./project > /dev/null
复制代码

L3中GOGC移除了，使用了GODEBUG变量，我们可以看到GC traces。 GODEBUG设置之后runtime就会在每次收集的时候生成GC trace。现在我们再次跑同样10k的请求。所有请求发送到服务端，我们可以看到GC traces和hey提供的信息。

$ GODEBUG=gctrace=1 ./project > /dev/null
gc 3 @3.182s 0%: 0.015+0.59+0.096 ms clock, 0.19+0.10/1.3/3.0+1.1 ms cpu, 4->4->2 MB, 5 MB goal, 12 P
.
.
.
gc 2553 @8.452s 14%: 0.004+0.33+0.051 ms clock, 0.056+0.12/0.56/0.94+0.61 ms cpu, 4->4->2 MB, 5 MB goal, 12 P
复制代码

L4中展示了从程序运行开始的第三次和最后一次的GC trace。我没展示前两次的collection因为在这些收集发生之后，负载信息才会通过服务发送。最后一次collection表明了处理请求一共产生了2551次收集（没算前两个）

下面是我对每次trace进行了拆分

gc 2553 @8.452s 14%: 0.004+0.33+0.051 ms clock, 0.056+0.12/0.56/0.94+0.61 ms cpu, 4->4->2 MB, 5 MB goal, 12 P

gc 2553     : The 2553 GC runs since the program started
@8.452s     : Eight seconds since the program started
14%         : Fourteen percent of the available CPU so far has been spent in GC

// wall-clock
0.004ms     : STW        : Write-Barrier - Wait for all Ps to reach a GC safe-point.
0.33ms      : Concurrent : Marking
0.051ms     : STW        : Mark Term     - Write Barrier off and clean up.

// CPU time
0.056ms     : STW        : Write-Barrier
0.12ms      : Concurrent : Mark - Assist Time (GC performed in line with allocation)
0.56ms      : Concurrent : Mark - Background GC time
0.94ms      : Concurrent : Mark - Idle GC time
0.61ms      : STW        : Mark Term

4MB         : Heap memory in-use before the Marking started
4MB         : Heap memory in-use after the Marking finished
2MB         : Heap memory marked as live after the Marking finished
5MB         : Collection goal for heap memory in-use after Marking finished

// Threads
12P         : Number of logical processors or threads used to run Goroutines.
复制代码

L5展示了最后一次collection的实际数字。多亏hey,下面是我们看到的运行时的性能结果。

Requests            : 10,000
------------------------------------------------------
Requests/sec        : 1,882 r/s   - Hey
Total Duration      : 5,311ms     - Hey
Percent Time in GC  : 14%         - GC Trace
Total Collections   : 2,551       - GC Trace
------------------------------------------------------
Total GC Duration   : 744.54ms    - (5,311ms * .14)
Average Pace of GC  : ~2.08ms     - (5,311ms / 2,551)
Requests/Collection : ~3.98 r/gc  - (10,000 / 2,511)
复制代码

L6给出了结果。下面给出了过程的可视化，来看一下发生了什么。

图1.3

图1.3给出了可视化的发生过程。开启回收器之后，相同10k的请求，它必须去处理大概~2.5k次collection。每次collection平均处理速度大概是2ms，处理这些collection增加了额外大概1.1秒的延迟。

图1.4

图1.4对比了两种情况下应用程序的运行情况。

减少分配

我们最好是可以看看堆内存的分配情况，来看看是否有可以去掉的non-productive的分配

go tool pprof http://localhost:5000/debug/pprof/allocs
复制代码

L7使用了pprof去调用/debug/pprof/allocs接口，然后取从运行程序里拉取内存分配信息。这个接口的存在是因为用了下面的代码。

import _ "net/http/pprof"

go func() {
    http.ListenAndServe("localhost:5000", http.DefaultServeMux)
}()
复制代码

导入net/http/pprof去绑定系统的默认服务。然后使用http.ListenAndServer携带http.DefaultServerMux去开启服务。

(pprof) top 6 -cum
Showing nodes accounting for 0.56GB, 5.84% of 9.56GB total
Dropped 80 nodes (cum <= 0.05GB)
Showing top 6 nodes out of 51
      flat  flat%   sum%        cum   cum%
         0     0%     0%     4.96GB 51.90%  net/http.(*conn).serve
    0.49GB  5.11%  5.11%     4.93GB 51.55%  project/service.handler
         0     0%  5.11%     4.93GB 51.55%  net/http.(*ServeMux).ServeHTTP
         0     0%  5.11%     4.93GB 51.55%  net/http.HandlerFunc.ServeHTTP
         0     0%  5.11%     4.93GB 51.55%  net/http.serverHandler.ServeHTTP
    0.07GB  0.73%  5.84%     4.55GB 47.63%  project/search.rssSearch
复制代码

L9的底下出现了rssSearch的方法。这个方法分配了4.55GB的内存。接下来使用list去看一下这个方法的内部细节

L10

(pprof) list rssSearch
Total: 9.56GB
ROUTINE ======================== project/search.rssSearch in project/search/rss.go
   71.53MB     4.55GB (flat, cum) 47.63% of Total


         .          .    117:	// Capture the data we need for our results if we find ...
         .          .    118:	for _, item := range d.Channel.Items {
         .     4.48GB    119:		if strings.Contains(strings.ToLower(item.Description), strings.ToLower(term)) {
   48.53MB    48.53MB    120:			results = append(results, Result{
         .          .    121:				Engine:  engine,
         .          .    122:				Title:   item.Title,
         .          .    123:				Link:    item.Link,
         .          .    124:				Content: item.Description,
         .          .    125:			})
复制代码

L10列出了代码，第119行出现了大量的内存分配

L11

         .     4.48GB    119:		if strings.Contains(strings.ToLower(item.Description), strings.ToLower(term)) {
复制代码

L11 给出了出现问题的代码。整个函数分配了4.55GB的数据，但这一行就分配了4.48GB。接下来看一下这一行代码做了什么，然后看看有哪里可以优化的地方。

L12

117 // Capture the data we need for our results if we find the search term.
118 for _, item := range d.Channel.Items {
119     if strings.Contains(strings.ToLower(item.Description), strings.ToLower(term)) {
120         results = append(results, Result{
121             Engine:  engine,
122             Title:   item.Title,
123             Link:    item.Link,
124             Content: item.Description,
125        })
126    }
127 }
复制代码

L12列出了tight loop里的代码。调用strings.ToLower会产生内存分配，因为需要创造新的strings，而且它们需要分配到堆内存上。在循环里调用strings.ToLower根本没有必要，因为完全可以在循环外面进行处理。

119行代码可以做一下调整来去掉不必要的内存分配。

L13

// Before the code change.
if strings.Contains(strings.ToLower(item.Description), strings.ToLower(term)) {

// After the code change.
if strings.Contains(item.Description, term) {
复制代码

注意：在把item塞进缓存之前已经把item.Description进行lower处理了，这部分代码没列出来。新塞进来item会每15分钟缓存一次。把term进行lower处理直接在循环外面进行就可以了

L13里给出了我们是怎么把strings.ToLower从tight loop中拿掉的。重新编译项目然后运行，然后再次拿10k个requests去请求。

L14

$ go build
$ GODEBUG=gctrace=1 ./project > /dev/null
gc 3 @6.156s 0%: 0.011+0.72+0.068 ms clock, 0.13+0.21/1.5/3.2+0.82 ms cpu, 4->4->2 MB, 5 MB goal, 12 P
.
.
.
gc 1404 @8.808s 7%: 0.005+0.54+0.059 ms clock, 0.060+0.47/0.79/0.25+0.71 ms cpu, 4->5->2 MB, 5 MB goal, 12 P
复制代码

L14中，在调整了代码之后，相同10K请求下，现在只有1402次的collections次数了。

L15

With Extra Allocations              Without Extra Allocations
======================================================================
Requests            : 10,000        Requests            : 10,000
----------------------------------------------------------------------
Requests/sec        : 1,882 r/s     Requests/sec        : 3,631 r/s
Total Duration      : 5,311ms       Total Duration      : 2,753 ms
Percent Time in GC  : 14%           Percent Time in GC  : 7%
Total Collections   : 2,551         Total Collections   : 1,402
----------------------------------------------------------------------
Total GC Duration   : 744.54ms      Total GC Duration   : 192.71 ms
Average Pace of GC  : ~2.08ms       Average Pace of GC  : ~1.96ms
Requests/Collection : ~3.98 r/gc    Requests/Collection : 7.13 r/gc
复制代码

L15展示了和上一次结果的对比。下面提供更加清晰的可视化图例来看看发生了什么。

图1.5

图1.5展示了对比图。这一次收集器运行少了1149次（1402 vs 2551）在相同10k请求情况下。GC的时间比率从14%降到了7%。collection节省了%74的时间，这让应用程序处理速度提高了48%。

图1.6

学习到什么

在上一篇文章中提到，提升收集器的效率在于减少堆内存的压力。记住，压力可以定义为：在给定时间内，应用程序在可用堆内存上的分配有多快。当压力减少，收集器产生的延时就会减少。这些延迟会拖慢你的应用程序。

L16

With Extra Allocations              Without Extra Allocations
======================================================================
Requests            : 10,000        Requests            : 10,000
----------------------------------------------------------------------
Requests/sec        : 1,882 r/s     Requests/sec        : 3,631 r/s
Total Duration      : 5,311ms       Total Duration      : 2,753 ms
Percent Time in GC  : 14%           Percent Time in GC  : 7%
Total Collections   : 2,551         Total Collections   : 1,402
----------------------------------------------------------------------
Total GC Duration   : 744.54ms      Total GC Duration   : 192.71 ms
Average Pace of GC  : ~2.08ms       Average Pace of GC  : ~1.96ms
Requests/Collection : ~3.98 r/gc    Requests/Collection : 7.13 r/gc
复制代码

L16是上面两个带GC版本的程序运行对比。很明显移除了4.48GB的不必要分配使程序运行更快。有趣的事情是，两个版本里，每次垃圾收集的平均速度时间几乎是相同的，大概都是在2ms左右。我们改变的事情是在于每次收集之间做了更多的work处理。应用程序从3.98r/gc提升到7.13r/gc(请求次数/gc)。工作量提高了79.1%。

在两次收集之间处理更多work会帮助减少收集的次数，可以看到gc次数从2551减少到了1402，大概45%的降低。应用程序的gc时间减少了74%，从745ms减少到193ms，也就是总时间的14%减少到7%。如果你运行关闭gc的优化版本代码，性能差异只有13%，应用程序处理时间从2753ms降到2398ms。