golang pprof工具

pprof工具

pprof是什么

pprof是分析和显示性能相关数据的工具

pprof读取profile.proto格式的分析抽样集合数据,同时创建报告来展现和帮助分析数据,它能创建包括文本和图型报告。

profile.proto 是一个描述调用者和符号信息集合的protocol buffer。一个通常的使用是用来从统计分析数据中表现抽样调用栈集合。

profile 可以被读取从一个本地文件,或者通过http。多个profile可以被合并和比较。

pprof使用

pprof在golang中存在于两处

  • runtime/pprof
  • net/http/pprof
net/http/pprof的使用

使用下面的代码实现一个web服务

package main

import (
	"net/http"
	_ "net/http/pprof"
	"time"
)

func counter() {
	slice := make([]int, 0)
	for i := 0; i < 100000; i++ {
		slice = append(slice, i)
	}
}

func httpGet(w http.ResponseWriter, r *http.Request) {
	counter()
}

func main() {
	http.HandleFunc("/get", httpGet)
	http.ListenAndServe("localhost:8000", nil)
}

然后编译运行,并进行性能分析

$ go build netmain.go
$ ./netmain &

$ go tool pprof http://localhost:8000/debug/pprof/heap
Fetching profile over HTTP from http://localhost:8000/debug/pprof/heap
Saved profile in /Users/zhaoyong/pprof/pprof.alloc_objects.alloc_space.inuse_objects.inuse_space.005.pb.gz
Type: inuse_space
Time: Jul 26, 2021 at 7:14am (CST)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 902.59kB, 100% of 902.59kB total
Showing top 10 nodes out of 13
      flat  flat%   sum%        cum   cum%
  902.59kB   100%   100%   902.59kB   100%  compress/flate.NewWriter
         0     0%   100%   902.59kB   100%  compress/gzip.(*Writer).Write
         0     0%   100%   902.59kB   100%  net/http.(*ServeMux).ServeHTTP
         0     0%   100%   902.59kB   100%  net/http.(*conn).serve
         0     0%   100%   902.59kB   100%  net/http.HandlerFunc.ServeHTTP
         0     0%   100%   902.59kB   100%  net/http.serverHandler.ServeHTTP
         0     0%   100%   902.59kB   100%  net/http/pprof.Index
         0     0%   100%   902.59kB   100%  net/http/pprof.handler.ServeHTTP
         0     0%   100%   902.59kB   100%  runtime/pprof.(*Profile).WriteTo
         0     0%   100%   902.59kB   100%  runtime/pprof.(*profileBuilder).build
(pprof)

这里显示的是程度堆内存的使用分配情况

Go 内置的内存 profiler可以让我们对线上系统进行内存使用采样,有四个相应的指标:

  • inuse_objects:当我们认为内存中的驻留对象过多时,就会关注该指标
  • inuse_space:当我们认为应用程序占据的 RSS 过大时,会关注该指标
  • alloc_objects:当应用曾经发生过历史上的大量内存分配行为导致 CPU 或内存使用大幅上升时,可能关注该指标
  • alloc_space:当应用历史上发生过内存使用大量上升时,会关注该指标

针对不同的指标可以使用不同参数查看,下面使用alloc_space查看效果

go tool pprof -alloc_space  http://localhost:8000/debug/pprof/heap

下面解释一下其他字段
top:pprof的指令之一,默认显示pprof中的前10项数据,可以通过top 20 来显示20行数据;

flat/flat%:分别表示在当前调用中的数据占整体的百分比。例如compress/flate.NewWriter的占用内存为902.59kB,占比总分配的58.11%。

cum/cum%:分别表示在当前累计数据占整体的百分比。例如 compress/flate.NewWriter的累积占比100%。

当然也可以直接在浏览器上访问 http://127.0.0.1:8000/debug/pprof/ 来总览并进行选择查看

/debug/pprof/

Types of profiles available:
Count	Profile
44	allocs
0	block
0	cmdline
7	goroutine
44	heap
0	mutex
0	profile
13	threadcreate
0	trace
full goroutine stack dump
Profile Descriptions:

allocs: A sampling of all past memory allocations
block: Stack traces that led to blocking on synchronization primitives
cmdline: The command line invocation of the current program
goroutine: Stack traces of all current goroutines
heap: A sampling of memory allocations of live objects. You can specify the gc GET parameter to run GC before taking the heap sample.
mutex: Stack traces of holders of contended mutexes
profile: CPU profile. You can specify the duration in the seconds GET parameter. After you get the profile file, use the go tool pprof command to investigate the profile.
threadcreate: Stack traces that led to the creation of new OS threads
trace: A trace of execution of the current program. You can specify the duration in the seconds GET parameter. After you get the trace file, use the go tool trace command to investigate the trace.

这样就可以具体选择需要关注的内容进行查看
如果在页面上选择profile会获取到cpu的profile文件,再通过pprof工具进行分析, 并可以通过pprof工具下执行web 来生成运行是的调用关系图, 获得更直观的信息.
在查看图形数据前需要安装graphviz 工具,Mac下的安装命令

brew install graphviz

完成后就可以在命令行中输入web命令来生成svg格式的文件了
在这里插入图片描述

模拟并发情况分析

这里使用wrk工具来模拟并发
wrk -c20 -t20 -d30s http://localhost:8000/get

  • -c20: 有20个并发连接
  • -t20 : 使用20个线程并发请求
  • -d30s 测试时间持续30s

看下运行统计,后面可以对比

Running 30s test @ http://localhost:8000/get
  20 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.36ms    3.62ms  42.07ms   65.63%
    Req/Sec   197.45     32.92   330.00     68.80%
  118208 requests in 30.07s, 8.45MB read
Requests/sec:   3931.53
Transfer/sec:    287.95KB

在浏览器中下载profile文件进行分析
Type: cpu
Time: Jul 26, 2021 at 8:54am (CST)
Duration: 30.03s, Total samples = 34.10s (113.54%)
Entering interactive mode (type “help” for commands, “o” for options)
(pprof) top20
Showing nodes accounting for 30.83s, 90.41% of 34.10s total
Dropped 245 nodes (cum <= 0.17s)
Showing top 20 nodes out of 119
flat flat% sum% cum cum%
10.93s 32.05% 32.05% 10.93s 32.05% runtime.memmove
4.19s 12.29% 44.34% 4.23s 12.40% syscall.syscall
2.61s 7.65% 51.99% 2.61s 7.65% runtime.usleep
1.89s 5.54% 57.54% 1.89s 5.54% runtime.memclrNoHeapPointers
1.65s 4.84% 62.38% 1.66s 4.87% runtime.kevent
1.53s 4.49% 66.86% 1.53s 4.49% runtime.pthread_kill
1.45s 4.25% 71.11% 17.68s 51.85% main.counter
1.37s 4.02% 75.13% 1.37s 4.02% runtime.asyncPreempt
1.23s 3.61% 78.74% 1.23s 3.61%

[外链图片转存中...(img-U6fD5ZQe-1627268222840)]

下面对这里出现最多的runtime.memmove分析下,从web展现的图看,memmove主要针对的是counter函数,切片的插入对0大小的切片来说就存在不断分配新的空间,然后把历史数据移动到新空间下,这就导致大量的memmove操作,我们就从这里下手优化下。
我们这里将slice的初始分配为100000

slice := make([]int, 100000)

运行后数据分析如下

Running 30s test @ http://localhost:8000/get
  20 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.72ms    4.97ms  50.64ms   78.91%
    Req/Sec   161.15     33.73   280.00     70.83%
  96590 requests in 30.10s, 6.91MB read
Requests/sec:   3209.47
Transfer/sec:    235.07KB

profile数据如下

Showing top 10 nodes out of 99
      flat  flat%   sum%        cum   cum%
    51.81s 43.81% 43.81%     51.81s 43.81%  runtime.memmove
    13.29s 11.24% 55.05%     13.29s 11.24%  runtime.memclrNoHeapPointers
    13.04s 11.03% 66.07%     13.19s 11.15%  syscall.syscall
     5.60s  4.74% 70.81%      5.60s  4.74%  runtime.pthread_kill
     4.98s  4.21% 75.02%      4.99s  4.22%  runtime.asyncPreempt
     4.68s  3.96% 78.98%      4.68s  3.96%  runtime.pthread_cond_wait
     4.59s  3.88% 82.86%      4.59s  3.88%  runtime.usleep
     3.76s  3.18% 86.04%     77.21s 65.29%  main.counter
     3.26s  2.76% 88.80%      3.27s  2.77%  runtime.kevent
     2.61s  2.21% 91.00%      2.61s  2.21%  runtime.pthread_cond_signal

整体吞吐数据下降了,从分析数据看主要是memmove大幅增加了, 原因可能与GC和系统内存分配有关,下面调小初次分配slice的大小为1000,吞吐量马上提升

Running 30s test @ http://localhost:8000/get
  20 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.14ms    3.46ms  32.02ms   64.85%
    Req/Sec   205.56     33.36   373.00     63.98%
  123046 requests in 30.07s, 8.80MB read
Requests/sec:   4092.56
Transfer/sec:    299.75KB

验证下GC关闭的情况,手动关闭GC

// 关闭GC
debug.SetGCPercent(-1)

结果情况更糟

Running 30s test @ http://localhost:8000/get
  20 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.89ms    3.77ms  62.83ms   80.45%
    Req/Sec   223.38     42.17   370.00     68.62%
  133731 requests in 30.07s, 9.57MB read
Requests/sec:   4447.47
Transfer/sec:    325.74KB

经过多次测试,理想的分配值在1000左右
设置大小1000
Running 30s test @ http://localhost:8000/get
20 threads and 20 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 5.15ms 3.48ms 37.80ms 65.03%
Req/Sec 205.09 32.57 340.00 65.03%
122758 requests in 30.07s, 8.78MB read
Requests/sec: 4083.08
Transfer/sec: 299.05KB

代码地址:https://github.com/zyong/go.programming/tree/main/pprof

参考:

  • https://mp.weixin.qq.com/s/szSAAyWqK3Bz2NxVxJfwXw
  • https://github.com/google/pprof/blob/master/doc/README.md#interpreting-the-callgraph
  • https://morsmachine.dk/http2-causalprof
  • https://zhuanlan.zhihu.com/p/141640004
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值