Valgrind的安装使用
- Valgrind安装
- Valgrind可检测的内存问题
- valgrind memcheck内存检测
- Cachegrind: a cache and branch-prediction profiler
- Callgrind+gprof2dot+graphviz生成图形化性能数据
- 使用Helgrind进行线程检测
- 使用Massif检测堆栈
- gprof+gprof2dot+graphviz生成图形化性能数据
valgrind安装
(1)wget http://www.valgrind.org/downloads/valgrind-3.11.0.tar.bz2 #下载安装包
(2)bzip2 -d valgrind-3.11.0.tar.bz2
(3)tar xvf valgrind-3.11.0.tar
(4)使用超级用户执行以下命令:
sudo ./configure
sudo make
sudo make install
(5)配置环境变量
切换到cd /etc/profile.d目录下,使用超级用户创建文件valgrind.sh
里面添加如下内容
#!/bin/sh
VALGRIND_ROOT=/home/Lyndon/valgrind-3.11.0
VALGRIND_INCLUDE=/usr/local/include/valgrind
VALGRIND_LIB=/usr/local/lib/valgrind
export VALGRIND_ROOT VALGRIND_INCLUDE VALGRIND_LIB
修改valgrind.sh的权限 sudo chmod +x valgrind.sh,执行./valgrind.sh
valgrind可检测的内存问题
valgrind memcheck内存检测
- gdb调试valgrind运行程序
valgrind --tool=memcheck --vgdb=yes --vgdb-error=0 ./prog
gdb ./prog (another shell)
(gdb) target remote | vgdb
(gdb) target remote | vgdb --pid=2479
thean you can gdb your progrom with valgrind.
- valgrind无需修改代码让服务程序退出
valgrind --tool=memcheck --vgdb=yes --vgdb-error=0 ./prog
让程序正常运行,当需要退出时
gdb ./prog (another shell)
(gdb) target remote | vgdb
打断点到服务退出前最后执行的函数
通过gdb 强制退回使程序正常退出;
后台运行:
nohup valgrind --error-limit=no --suppressions=suppress ./prog >nohup.out &
- gdb+valgrind实时查看内存泄露情况
valgrind --vgdb=yes --vgdb-error=0 ./main
(gdb) target remote | vgdb
(gdb) monitor leak_check full reachable any (持续发送命令,可查看到实时内存统计情况)
-------------------------------
(gdb) monitor leak_check full reachable any
==13181== 1,777,869 bytes in 592,623 blocks are definitely lost in loss record 4 of 5
==13181== at 0x4C29C23: malloc (vg_replace_malloc.c:299)
==13181== by 0x40060B: c (main.c:20)
==13181== by 0x400651: b (main.c:29)
==13181== by 0x400676: main (main.c:36)
==13181==
==13181== 8,999,980 bytes in 899,998 blocks are definitely lost in loss record 5 of 5
==13181== at 0x4C29C23: malloc (vg_replace_malloc.c:299)
==13181== by 0x4005C9: a (main.c:11)
==13181== by 0x400647: b (main.c:28)
==13181== by 0x400676: main (main.c:36)
==13181==
==13181== LEAK SUMMARY:
==13181== definitely lost: 10,777,849 bytes in 1,492,621 blocks
==13181== indirectly lost: 0 bytes in 0 blocks
==13181== possibly lost: 10 bytes in 1 blocks
==13181== still reachable: 16 bytes in 3 blocks
==13181== suppressed: 0 bytes in 0 blocks
==13181==
----------------------------------------
Also you can only monitor one leak block use the commond:
(gdb) monitor block_list 7
But how we can get the block num,see as follow /**/:
/*loss record 5 of 5*/
the record 5 the 5 is the block num,if nessiary we can only monitor
this block dont care other information.
--trace-children=yes 追踪子进程
--log-file=<filename> --xml-file=<filename> 文本格式报告和xml格式报告的报告名称,\
推荐使用%p_log.memcheck之类的名字,%p会被valgrind转义成进程号,这样每一个进程都拥有一\
份独立的报告。
Cachegrind: a cache and branch-prediction profiler
Cachegrind 用来检查你的应用程序对CPU的多级Cache的使用效率,通过Cache使用情况的分析,调整代码结构和分支结构,重新调模块,使其Cache性能达到最优。L1故障通常会花费大约10个周期,LL错误可能花费多达200个周期,并且在10到30个周期的区域中,错误的分支成本。详细的缓存和分支分析可以帮助你了解您的程序如何与机器交互,从而如何使其更快,不同类型的CPU应有不同的结果。Cachegrind检测示例结果如下:
- Get the total cache static summary
heyongxin@8d9f0352366a:~/test$ valgrind --tool=cachegrind ./main
==20205== Cachegrind, a cache and branch-prediction profiler
==20205== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==20205== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==20205== Command: ./main
==20205==
--20205-- warning: L3 cache found, using its data for the LL simulation.
main() function()
==20205== brk segment overflow in thread #1: can't grow to 0x4a43000
==20205== (see section Limitations in user manual)
==20205== NOTE: further instances of this message will not be shown
------call a ()------
-----call c ()------
---- call b()------
exit success
==20205==
==20205== I refs: 401,581,319
==20205== I1 misses: 790
==20205== LLi misses: 784
==20205== I1 miss rate: 0.00%
==20205== LLi miss rate: 0.00%
==20205==
==20205== D refs: 126,063,974 (66,646,287 rd + 59,417,687 wr)
==20205== D1 misses: 904,531 ( 3,431 rd + 901,100 wr)
==20205== LLd misses: 902,558 ( 1,982 rd + 900,576 wr)
==20205== D1 miss rate: 0.7% ( 0.0% + 1.5% )
==20205== LLd miss rate: 0.7% ( 0.0% + 1.5% )
==20205==
==20205== LL refs: 905,321 ( 4,221 rd + 901,100 wr)
==20205== LL misses: 903,342 ( 2,766 rd + 900,576 wr)
==20205== LL miss rate: 0.2% ( 0.0% + 1.5% )
- Get the The Global and Function-level Counts
The function-by-function counts are more useful to look at, as they pinpoint which functions are causing large numbers of counts. However, beware that inlining can make these counts misleading. If a functionfis always inlined, counts will be attributed to the functions it is inlined into, rather than itself. However, if you look at the line-by-line annotations forfyou’ll see the counts that belong tof. (This is hard to avoid, it’s how the debug info is structured.) So it’s worth looking for large numbers in the line-by-line annotations.
heyongxin@8d9f0352366a:~/test$ cg_annotate cachegrind.out.20205
--------------------------------------------------------------------------------
I1 cache: 32768 B, 64 B, 4-way associative
D1 cache: 32768 B, 64 B, 8-way associative
LL cache: 4194304 B, 64 B, 16-way associative
Command: ./main
Data file: cachegrind.out.20205
Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds: 0.1 100 100 100 100 100 100 100 100
Include dirs:
User annotated:
Auto-annotation: off
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
401,581,319 790 784 66,646,287 3,431 1,982 59,417,687 901,100 900,576 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
266,399,455 20 20 34,200,219 3 2 41,399,894 900,024 899,915 ???:_int_malloc
72,000,070 5 5 21,600,006 4 0 5,400,002 0 0 ???:malloc
28,800,000 2 2 1,800,000 0 0 5,400,000 0 0 ???:usleep
9,000,014 1 1 900,003 2 1 3,600,003 1 0 /home/heyongxin/test/main.c:a
9,000,014 0 0 900,003 0 0 3,600,003 0 0 /home/heyongxin/test/main.c:c
9,000,000 0 0 1,800,000 0 0 0 0 0 ???:__nanosleep_nocancel
3,600,191 21 17 3,600,091 6 3 27 1 1 ???:???
3,600,000 1 1 1,800,000 0 0 0 0 0 ???:nanosleep
- Get the Line-by-line Counts
The line-by-line source code annotations are much more useful. In our experience, the best place to start is by looking at theIrnumbers. They simply measure how many instructions were executed for each line, and don’t include any cache information, but they can still be very useful for identifying bottlenecks.
heyongxin@8d9f0352366a:~/test$ cg_annotate --auto=yes cachegrind.out.20205 main.c
--------------------------------------------------------------------------------
I1 cache: 32768 B, 64 B, 4-way associative
D1 cache: 32768 B, 64 B, 8-way associative
LL cache: 4194304 B, 64 B, 16-way associative
Command: ./main
Data file: cachegrind.out.20205
Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds: 0.1 100 100 100 100 100 100 100 100
Include dirs:
User annotated: main.c
Auto-annotation: on
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
401,581,319 790 784 66,646,287 3,431 1,982 59,417,687 901,100 900,576 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw file:function
--------------------------------------------------------------------------------
266,399,455 20 20 34,200,219 3 2 41,399,894 900,024 899,915 ???:_int_malloc
72,000,070 5 5 21,600,006 4 0 5,400,002 0 0 ???:malloc
28,800,000 2 2 1,800,000 0 0 5,400,000 0 0 ???:usleep
9,000,014 1 1 900,003 2 1 3,600,003 1 0 /home/heyongxin/test/main.c:a
9,000,014 0 0 900,003 0 0 3,600,003 0 0 /home/heyongxin/test/main.c:c
9,000,000 0 0 1,800,000 0 0 0 0 0 ???:__nanosleep_nocancel
3,600,191 21 17 3,600,091 6 3 27 1 1 ???:???
3,600,000 1 1 1,800,000 0 0 0 0 0 ???:nanosleep
--------------------------------------------------------------------------------
-- Auto-annotated source: /home/heyongxin/test/main.c
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
. . . . . . . . . #include <stdio.h>
. . . . . . . . . #include <unistd.h>
. . . . . . . . . #include <stdlib.h>
. . . . . . . . .
. . . . . . . . . int i=900000;
. . . . . . . . . int j=900000;
. . . . . . . . . void a()
3 0 0 0 0 0 1 0 0 {
4,500,006 1 1 900,001 2 1 900,001 0 0 while(j--)
. . . . . . . . . {
2,700,000 0 0 0 0 0 1,800,000 1 0 char* p=(char*)malloc(10);
1,800,000 0 0 0 0 0 900,000 0 0 usleep(100);
. . . . . . . . . }
2 0 0 0 0 0 1 0 0 printf("------call a ()------\n");
3 0 0 2 0 0 0 0 0 }
. . . . . . . . . void c()
3 0 0 0 0 0 1 0 0 {
4,500,006 0 0 900,001 0 0 900,001 0 0 while(i--)
. . . . . . . . . {
2,700,000 0 0 0 0 0 1,800,000 0 0 char* p=(char*)malloc(3);
1,800,000 0 0 0 0 0 900,000 0 0 usleep(100);
. . . . . . . . . }
2 0 0 0 0 0 1 0 0 printf("-----call c ()------\n");
3 0 0 2 0 0 0 0 0 }
. . . . . . . . .
. . . . . . . . . void b()
2 1 1 0 0 0 1 0 0 {
2 0 0 0 0 0 1 0 0 a();
2 0 0 0 0 0 1 0 0 c();
2 0 0 0 0 0 1 0 0 printf("---- call b()------\n");
3 0 0 2 0 0 0 0 0 }
. . . . . . . . .
. . . . . . . . . int main(void)
2 1 1 0 0 0 1 0 0 {
2 0 0 0 0 0 1 0 0 printf(" main() function() \n");
2 0 0 0 0 0 1 0 0 b();
2 0 0 0 0 0 1 0 0 printf("exit success \n");
1 0 0 0 0 0 0 0 0 return 0;
2 0 0 2 0 0 0 0 0 }
--------------------------------------------------------------------------------
-- User-annotated source: main.c
--------------------------------------------------------------------------------
No information has been collected for main.c
--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
--------------------------------------------------------------------------------
4 0 0 3 0 0 12 0 0 percentage of events annotated
Most modern desktop and server CPUs have at least three independent caches: an instruction cache to speed up executable instruction fetch, a data cache to speed up data fetch and store, and a translation lookaside buffer (TLB) used to speed up virtual-to-physical address translation for both executable instructions and data. A single TLB could be provided for access to both instructions and data, or a separate Instruction TLB (ITLB) and data TLB (DTLB) can be provided.[4] The data cache is usually organized as a hierarchy of more cache levels (L1, L2, etc.; see also multi-level caches below). However, the TLB cache is part of the memory management unit (MMU) and not directly related to the CPU caches.
•I cache reads (Ir, which equals the number of instructions executed),
I1 cache read misses (I1mr) and LL cacheinstruction read misses (ILmr).
•D cache reads (Dr, which equals the number of memory reads),
D1 cache read misses (D1mr), and LL cache dataread misses (DLmr).
•D cache writes (Dw, which equals the number of memory writes),
D1 cache write misses (D1mw), and LL cachedata write misses (DLmw).
•Conditional branches executed (Bc) and conditional branches mispredicted (Bcm).
•Indirect branches executed (Bi) and indirect branches mispredicted (Bim).
Note that D1 total accesses is given byD1mr+D1mw, and that LL total accesses is given by ILmr+DLmr+DLmw.
Callgrind+gprof2dot+graphviz生成图形化性能数据
call_grind_annote:
reads in the profile data, and prints a sorted lists of functions, optionally with source annotation
callgrind_control:
This command enables you to interactively observe and control the status of a program currently running under Callgrind’s control, without stopping the program. You can get statistics information as well as the current stacktrace, and you can request zeroing of counters or dumping of profile data.
note
Callgrind’s ability to detect function calls and returns depends on the instruction set of the platform it is run on. It works best on x86 and amd64, and unfortunately currently does not work so well on PowerPC, ARM, Thumb or MIPS code. This is because there are no explicit call or return instructions in these instruction sets, so Callgrind has to rely on heuristics to detect calls and returns.
- 工具使用
valgrind --tool=callgrind ./test
callgrind_annotate callgrind.out.85095
python ../../gprof2dot-2017.9.19/gprof2dot.py -f callgrind \
callgrind.out.85095 | dot -Tsvg -o report.svg
callgrind_annotate callgrind.out.85095 (shell 显示)
heyongxin@linux-oc89:~/test/callgrind> callgrind_annotate callgrind.out.85095
--------------------------------------------------------------------------------
Profile data file 'callgrind.out.85095' (creator: callgrind-3.13.0)
--------------------------------------------------------------------------------
I1 cache:
D1 cache:
LL cache:
Timerange: Basic block 0 - 10893861
Trigger: Program termination
Profiled target: ./test (PID 85095, part 1)
Events recorded: Ir
Events shown: Ir
Event sort order: Ir
Thresholds: 99
Include dirs:
User annotated:
Auto-annotation: off
--------------------------------------------------------------------------------
Ir
--------------------------------------------------------------------------------
36,368,828 PROGRAM TOTALS
--------------------------------------------------------------------------------
Ir file:function
--------------------------------------------------------------------------------
19,200,000 ???:usleep [/lib64/libc-2.11.3.so]
6,000,000 ???:__nanosleep_nocancel [/lib64/libc-2.11.3.so]
2,400,000 ???:nanosleep [/lib64/libc-2.11.3.so]
2,130,000 test.c:test_f32 [/home/heyongxin/test/callgrind/test]
2,130,000 test.c:test_f31 [/home/heyongxin/test/callgrind/test]
1,420,000 test.c:test_f21 [/home/heyongxin/test/callgrind/test]
1,420,000 test.c:test_f22 [/home/heyongxin/test/callgrind/test]
710,004 test.c:test_f11 [/home/heyongxin/test/callgrind/test]
710,000 test.c:test_f12 [/home/heyongxin/test/callgrind/test]
- 生成结果
- kcachegrind生成结果
heyongxin@linux-oc89:~/jTTS-6.3.0/bin> callgrind_control -e -b
PID 90634: ./jTTSService4.exe
sending command status internal to pid 90634
Totals: Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw Bc Bcm Bi Bim
Th 1 56,004,130 15,619,132 7,833,614 497,213 157,949 20,610 6,029 23,490 6,043 6,516,248 302,957 1,048,341 316,495
Th 2 17,107,736 5,123,871 3,619,998 309,867 40,150 14,542 3,271 1,296 2,578 2,136,839 100,493 528,451 209,473
Th 3 157,598,060 51,245,720 33,457,528 2,227,975 577,022 233,265 59,651 15,633 9,613 16,950,557 760,041 5,091,547 1,488,914
Th 4 325,601,799 97,720,521 74,782,937 976,076 8,793,776 10,419,767 120,247 5,362,104 9,474,172 35,181,904 1,252,680 2,424,142 828,478
Frame: Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw Bc Bcm Bi Bim Backtrace for Thread 1
[ 0] 5,162 952 678 235 139 0 66 65 0 1,088 170 . . nanosleep (138 x)
[ 1] 7,322 1,083 1,086 308 139 0 97 65 0 1,084 168 . . usleep (136 x)
[ 2] 45,227,568 13,295,408 7,136,954 495,154 85,525 14,784 4,060 2,379 1,796 5,054,186 242,086 1,035,385 315,614 main (1 x)
[ 3] 45,238,849 13,298,687 7,138,535 495,218 85,615 14,813 4,118 2,388 1,818 5,055,966 242,232 1,035,463 315,644 (below main) (1 x)
[ 4] 45,241,424 13,299,456 7,138,758 495,221 85,634 14,813 4,121 2,390 1,818 5,056,433 242,267 1,035,466 315,646 0x0000000000420e60 (1 x)
[ 5] . . . . . . . . . . . . . 0x0000000000000b00
Frame: Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw Bc Bcm Bi Bim Backtrace for Thread 2
[ 0] 4,699 843 603 414 259 0 2 58 0 964 267 . . select (121 x)
[ 1] 17,105,240 5,123,076 3,619,734 309,812 40,045 14,525 3,265 1,295 2,564 2,136,382 100,438 528,446 209,469 ListenThreadProc(void*) (1 x)
[ 2] 17,107,727 5,123,869 3,619,997 309,865 40,149 14,542 3,270 1,296 2,578 2,136,836 100,491 528,450 209,472 start_thread (3 x)
[ 3] . . . . . . . . . . . . . clone
使用Helgrind进行线程检测
1、使用Helgrind进行线程检测的线程需为POSIX线程;
2、Potential deadlocks arising from lock ordering problems;
3、Data races – accessing memory without adequate locking or synchronisation;
heyongxin@linux-oc89:~/test/deadlock> valgrind --tool=helgrind ./deadlock
==98724== Helgrind, a thread error detector
==98724== Copyright (C) 2007-2017, and GNU GPL'd, by OpenWorks LLP et al.
==98724== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==98724== Command: ./deadlock
==98724==
before in thread1
before in thread 2
^C==98724==
==98724== Process terminating with default action of signal 2 (SIGINT)
==98724== at 0x5306F95: pthread_join (in /lib64/libpthread-2.11.3.so)
==98724== by 0x4C2AF45: pthread_join_WRK (hg_intercepts.c:553)
==98724== by 0x4C2B01D: pthread_join (hg_intercepts.c:572)
==98724== by 0x400A40: main (deadlock.c:64)
==98724== ---Thread-Announcement------------------------------------------
==98724==
==98724== Thread #2 was created
==98724== at 0x55F5D2E: clone (in /lib64/libc-2.11.3.so)
==98724== by 0x5305950: do_clone (in /lib64/libpthread-2.11.3.so)
==98724== by 0x5305F37: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.11.3.so)
==98724== by 0x4C32AFF: pthread_create_WRK (hg_intercepts.c:427)
==98724== by 0x4C32C87: pthread_create@* (hg_intercepts.c:460)
==98724== by 0x400A0C: main (deadlock.c:60)
==98724==
==98724== ----------------------------------------------------------------
==98724==
==98724== Thread #2: Exiting thread still holds 1 lock
==98724== at 0x530D294: __lll_lock_wait (in /lib64/libpthread-2.11.3.so)
==98724== by 0x5308618: _L_lock_1008 (in /lib64/libpthread-2.11.3.so)
==98724== by 0x530842C: pthread_mutex_lock (in /lib64/libpthread-2.11.3.so)
==98724== by 0x4C2B4A5: mutex_lock_WRK (hg_intercepts.c:902)
==98724== by 0x4C2B58F: pthread_mutex_lock (hg_intercepts.c:925)
==98724== by 0x400842: print (deadlock.c:9)
==98724== by 0x4008D8: thread1 (deadlock.c:28)
==98724== by 0x4C32D10: mythread_wrapper (hg_intercepts.c:389)
==98724== by 0x53067B5: start_thread (in /lib64/libpthread-2.11.3.so)
==98724==
==98724==
==98724== For counts of detected and suppressed errors, rerun with: -v
==98724== Use --history-level=approx or =none to gain increased speed, at
==98724== the cost of reduced accuracy of conflicting-access information
==98724== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 41 from 22)