Valgrind+gProf安装及性能分析应用

最新推荐文章于 2024-08-15 09:49:39 发布

dodonei

最新推荐文章于 2024-08-15 09:49:39 发布

阅读量2.6k

点赞数 1

分类专栏： C\C++ 性能优化 CC++ 文章标签： Valgrind 内存泄露性能优化 gprof

本文链接：https://blog.csdn.net/dodonei/article/details/79806931

版权

CC++ 同时被 3 个专栏收录

12 篇文章 0 订阅

订阅专栏

性能优化

2 篇文章 0 订阅

订阅专栏

C\C++

1 篇文章 0 订阅

订阅专栏

Valgrind的安装使用

Valgrind安装
Valgrind可检测的内存问题
valgrind memcheck内存检测
Cachegrind: a cache and branch-prediction profiler
Callgrind+gprof2dot+graphviz生成图形化性能数据
使用Helgrind进行线程检测
使用Massif检测堆栈
gprof+gprof2dot+graphviz生成图形化性能数据

valgrind安装

（1）wget http://www.valgrind.org/downloads/valgrind-3.11.0.tar.bz2 #下载安装包
（2）bzip2 -d valgrind-3.11.0.tar.bz2
（3）tar xvf valgrind-3.11.0.tar
（4）使用超级用户执行以下命令：
　　sudo ./configure
　　sudo make
　　sudo make install
（5）配置环境变量
　　切换到cd /etc/profile.d目录下，使用超级用户创建文件valgrind.sh
　　里面添加如下内容
　　#!/bin/sh
　　VALGRIND_ROOT=/home/Lyndon/valgrind-3.11.0
　　VALGRIND_INCLUDE=/usr/local/include/valgrind
　　VALGRIND_LIB=/usr/local/lib/valgrind
　　export VALGRIND_ROOT VALGRIND_INCLUDE VALGRIND_LIB
修改valgrind.sh的权限 sudo chmod +x valgrind.sh，执行./valgrind.sh

valgrind可检测的内存问题

valgrind memcheck内存检测

gdb调试valgrind运行程序

valgrind --tool=memcheck --vgdb=yes --vgdb-error=0 ./prog
gdb ./prog (another shell)
(gdb) target remote | vgdb  
(gdb) target remote | vgdb --pid=2479
thean you can gdb your progrom with valgrind.

valgrind无需修改代码让服务程序退出

valgrind --tool=memcheck --vgdb=yes --vgdb-error=0 ./prog
让程序正常运行，当需要退出时
gdb ./prog (another shell)
(gdb) target remote | vgdb 
打断点到服务退出前最后执行的函数
通过gdb 强制退回使程序正常退出；
后台运行:
nohup valgrind --error-limit=no --suppressions=suppress ./prog >nohup.out &

gdb+valgrind实时查看内存泄露情况

valgrind --vgdb=yes --vgdb-error=0 ./main
(gdb) target remote | vgdb
(gdb) monitor leak_check full reachable any （持续发送命令，可查看到实时内存统计情况）
-------------------------------
(gdb) monitor leak_check full reachable any
==13181== 1,777,869 bytes in 592,623 blocks are definitely lost in loss record 4 of 5
==13181==    at 0x4C29C23: malloc (vg_replace_malloc.c:299)
==13181==    by 0x40060B: c (main.c:20)
==13181==    by 0x400651: b (main.c:29)
==13181==    by 0x400676: main (main.c:36)
==13181== 
==13181== 8,999,980 bytes in 899,998 blocks are definitely lost in loss record 5 of 5
==13181==    at 0x4C29C23: malloc (vg_replace_malloc.c:299)
==13181==    by 0x4005C9: a (main.c:11)
==13181==    by 0x400647: b (main.c:28)
==13181==    by 0x400676: main (main.c:36)
==13181== 
==13181== LEAK SUMMARY:
==13181==    definitely lost: 10,777,849 bytes in 1,492,621 blocks
==13181==    indirectly lost: 0 bytes in 0 blocks
==13181==      possibly lost: 10 bytes in 1 blocks
==13181==    still reachable: 16 bytes in 3 blocks
==13181==         suppressed: 0 bytes in 0 blocks
==13181==
----------------------------------------
Also you can only monitor one leak block use the commond:
(gdb) monitor block_list 7
But how we can get the block num,see as follow /**/:
/*loss record 5 of 5*/  
the record 5 the 5 is the block num,if nessiary we can only monitor
this block dont care other information.

--trace-children=yes  追踪子进程
--log-file=<filename> --xml-file=<filename> 文本格式报告和xml格式报告的报告名称，\
推荐使用%p_log.memcheck之类的名字，%p会被valgrind转义成进程号，这样每一个进程都拥有一\
份独立的报告。

Cachegrind: a cache and branch-prediction profiler

Cachegrind 用来检查你的应用程序对CPU的多级Cache的使用效率，通过Cache使用情况的分析，调整代码结构和分支结构，重新调模块，使其Cache性能达到最优。L1故障通常会花费大约10个周期，LL错误可能花费多达200个周期，并且在10到30个周期的区域中，错误的分支成本。详细的缓存和分支分析可以帮助你了解您的程序如何与机器交互，从而如何使其更快，不同类型的CPU应有不同的结果。Cachegrind检测示例结果如下：

Get the total cache static summary

heyongxin@8d9f0352366a:~/test$ valgrind --tool=cachegrind ./main
==20205== Cachegrind, a cache and branch-prediction profiler
==20205== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==20205== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==20205== Command: ./main
==20205== 
--20205-- warning: L3 cache found, using its data for the LL simulation.
 main() function() 
==20205== brk segment overflow in thread #1: can't grow to 0x4a43000
==20205== (see section Limitations in user manual)
==20205== NOTE: further instances of this message will not be shown
------call a ()------
-----call c ()------
---- call b()------
exit success 
==20205== 
==20205== I   refs:      401,581,319
==20205== I1  misses:            790
==20205== LLi misses:            784
==20205== I1  miss rate:        0.00%
==20205== LLi miss rate:        0.00%
==20205== 
==20205== D   refs:      126,063,974  (66,646,287 rd   + 59,417,687 wr)
==20205== D1  misses:        904,531  (     3,431 rd   +    901,100 wr)
==20205== LLd misses:        902,558  (     1,982 rd   +    900,576 wr)
==20205== D1  miss rate:         0.7% (       0.0%     +        1.5%  )
==20205== LLd miss rate:         0.7% (       0.0%     +        1.5%  )
==20205== 
==20205== LL refs:           905,321  (     4,221 rd   +    901,100 wr)
==20205== LL misses:         903,342  (     2,766 rd   +    900,576 wr)
==20205== LL miss rate:          0.2% (       0.0%     +        1.5%  )

Get the The Global and Function-level Counts
The function-by-function counts are more useful to look at, as they pinpoint which functions are causing large numbers of counts. However, beware that inlining can make these counts misleading. If a functionfis always inlined, counts will be attributed to the functions it is inlined into, rather than itself. However, if you look at the line-by-line annotations forfyou’ll see the counts that belong tof. (This is hard to avoid, it’s how the debug info is structured.) So it’s worth looking for large numbers in the line-by-line annotations.

heyongxin@8d9f0352366a:~/test$ cg_annotate cachegrind.out.20205 
--------------------------------------------------------------------------------
I1 cache:         32768 B, 64 B, 4-way associative
D1 cache:         32768 B, 64 B, 8-way associative
LL cache:         4194304 B, 64 B, 16-way associative
Command:          ./main
Data file:        cachegrind.out.20205
Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds:       0.1 100 100 100 100 100 100 100 100
Include dirs:     
User annotated:   
Auto-annotation:  off

--------------------------------------------------------------------------------
         Ir I1mr ILmr         Dr  D1mr  DLmr         Dw    D1mw    DLmw 
--------------------------------------------------------------------------------
401,581,319  790  784 66,646,287 3,431 1,982 59,417,687 901,100 900,576  PROGRAM TOTALS

--------------------------------------------------------------------------------
         Ir I1mr ILmr         Dr  D1mr DLmr         Dw    D1mw    DLmw  file:function
--------------------------------------------------------------------------------
266,399,455   20   20 34,200,219     3    2 41,399,894 900,024 899,915  ???:_int_malloc
 72,000,070    5    5 21,600,006     4    0  5,400,002       0       0  ???:malloc
 28,800,000    2    2  1,800,000     0    0  5,400,000       0       0  ???:usleep
  9,000,014    1    1    900,003     2    1  3,600,003       1       0  /home/heyongxin/test/main.c:a
  9,000,014    0    0    900,003     0    0  3,600,003       0       0  /home/heyongxin/test/main.c:c
  9,000,000    0    0  1,800,000     0    0          0       0       0  ???:__nanosleep_nocancel
  3,600,191   21   17  3,600,091     6    3         27       1       1  ???:???
  3,600,000    1    1  1,800,000     0    0          0       0       0  ???:nanosleep

Get the Line-by-line Counts
The line-by-line source code annotations are much more useful. In our experience, the best place to start is by looking at theIrnumbers. They simply measure how many instructions were executed for each line, and don’t include any cache information, but they can still be very useful for identifying bottlenecks.

heyongxin@8d9f0352366a:~/test$ cg_annotate --auto=yes cachegrind.out.20205 main.c
--------------------------------------------------------------------------------
I1 cache:         32768 B, 64 B, 4-way associative
D1 cache:         32768 B, 64 B, 8-way associative
LL cache:         4194304 B, 64 B, 16-way associative
Command:          ./main
Data file:        cachegrind.out.20205
Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Thresholds:       0.1 100 100 100 100 100 100 100 100
Include dirs:     
User annotated:   main.c
Auto-annotation:  on

--------------------------------------------------------------------------------
         Ir I1mr ILmr         Dr  D1mr  DLmr         Dw    D1mw    DLmw 
--------------------------------------------------------------------------------
401,581,319  790  784 66,646,287 3,431 1,982 59,417,687 901,100 900,576  PROGRAM TOTALS

--------------------------------------------------------------------------------
         Ir I1mr ILmr         Dr  D1mr DLmr         Dw    D1mw    DLmw  file:function
--------------------------------------------------------------------------------
266,399,455   20   20 34,200,219     3    2 41,399,894 900,024 899,915  ???:_int_malloc
 72,000,070    5    5 21,600,006     4    0  5,400,002       0       0  ???:malloc
 28,800,000    2    2  1,800,000     0    0  5,400,000       0       0  ???:usleep
  9,000,014    1    1    900,003     2    1  3,600,003       1       0  /home/heyongxin/test/main.c:a
  9,000,014    0    0    900,003     0    0  3,600,003       0       0  /home/heyongxin/test/main.c:c
  9,000,000    0    0  1,800,000     0    0          0       0       0  ???:__nanosleep_nocancel
  3,600,191   21   17  3,600,091     6    3         27       1       1  ???:???
  3,600,000    1    1  1,800,000     0    0          0       0       0  ???:nanosleep

--------------------------------------------------------------------------------
-- Auto-annotated source: /home/heyongxin/test/main.c
--------------------------------------------------------------------------------
       Ir I1mr ILmr      Dr D1mr DLmr        Dw D1mw DLmw 

        .    .    .       .    .    .         .    .    .  #include <stdio.h>
        .    .    .       .    .    .         .    .    .  #include <unistd.h>
        .    .    .       .    .    .         .    .    .  #include <stdlib.h>
        .    .    .       .    .    .         .    .    .  
        .    .    .       .    .    .         .    .    .  int i=900000;
        .    .    .       .    .    .         .    .    .  int j=900000;
        .    .    .       .    .    .         .    .    .  void a()
        3    0    0       0    0    0         1    0    0  {
4,500,006    1    1 900,001    2    1   900,001    0    0   while(j--)
        .    .    .       .    .    .         .    .    .   {
2,700,000    0    0       0    0    0 1,800,000    1    0       char* p=(char*)malloc(10);
1,800,000    0    0       0    0    0   900,000    0    0       usleep(100);
        .    .    .       .    .    .         .    .    .   }
        2    0    0       0    0    0         1    0    0   printf("------call a ()------\n");
        3    0    0       2    0    0         0    0    0  }
        .    .    .       .    .    .         .    .    .  void c()
        3    0    0       0    0    0         1    0    0  {
4,500,006    0    0 900,001    0    0   900,001    0    0   while(i--)
        .    .    .       .    .    .         .    .    .   {   
2,700,000    0    0       0    0    0 1,800,000    0    0       char* p=(char*)malloc(3);
1,800,000    0    0       0    0    0   900,000    0    0       usleep(100);
        .    .    .       .    .    .         .    .    .   }
        2    0    0       0    0    0         1    0    0   printf("-----call c ()------\n");
        3    0    0       2    0    0         0    0    0  }
        .    .    .       .    .    .         .    .    .  
        .    .    .       .    .    .         .    .    .  void b()
        2    1    1       0    0    0         1    0    0  {
        2    0    0       0    0    0         1    0    0   a();
        2    0    0       0    0    0         1    0    0   c();
        2    0    0       0    0    0         1    0    0   printf("---- call b()------\n");
        3    0    0       2    0    0         0    0    0  }
        .    .    .       .    .    .         .    .    .  
        .    .    .       .    .    .         .    .    .  int main(void)
        2    1    1       0    0    0         1    0    0  {
        2    0    0       0    0    0         1    0    0   printf(" main() function() \n");
        2    0    0       0    0    0         1    0    0   b();
        2    0    0       0    0    0         1    0    0   printf("exit success \n");
        1    0    0       0    0    0         0    0    0   return 0;
        2    0    0       2    0    0         0    0    0  }

--------------------------------------------------------------------------------
-- User-annotated source: main.c
--------------------------------------------------------------------------------
  No information has been collected for main.c

--------------------------------------------------------------------------------
Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw 
--------------------------------------------------------------------------------
 4    0    0  3    0    0 12    0    0  percentage of events annotated

Most modern desktop and server CPUs have at least three independent caches: an instruction cache to speed up executable instruction fetch, a data cache to speed up data fetch and store, and a translation lookaside buffer (TLB) used to speed up virtual-to-physical address translation for both executable instructions and data. A single TLB could be provided for access to both instructions and data, or a separate Instruction TLB (ITLB) and data TLB (DTLB) can be provided.[4] The data cache is usually organized as a hierarchy of more cache levels (L1, L2, etc.; see also multi-level caches below). However, the TLB cache is part of the memory management unit (MMU) and not directly related to the CPU caches.
•I cache reads (Ir, which equals the number of instructions executed),
I1 cache read misses (I1mr) and LL cacheinstruction read misses (ILmr).
•D cache reads (Dr, which equals the number of memory reads),
D1 cache read misses (D1mr), and LL cache dataread misses (DLmr).
•D cache writes (Dw, which equals the number of memory writes),
D1 cache write misses (D1mw), and LL cachedata write misses (DLmw).
•Conditional branches executed (Bc) and conditional branches mispredicted (Bcm).
•Indirect branches executed (Bi) and indirect branches mispredicted (Bim).
Note that D1 total accesses is given byD1mr+D1mw, and that LL total accesses is given by ILmr+DLmr+DLmw.

Callgrind+gprof2dot+graphviz生成图形化性能数据

call_grind_annote:
reads in the profile data, and prints a sorted lists of functions, optionally with source annotation
callgrind_control:
This command enables you to interactively observe and control the status of a program currently running under Callgrind’s control, without stopping the program. You can get statistics information as well as the current stacktrace, and you can request zeroing of counters or dumping of profile data.
note
Callgrind’s ability to detect function calls and returns depends on the instruction set of the platform it is run on. It works best on x86 and amd64, and unfortunately currently does not work so well on PowerPC, ARM, Thumb or MIPS code. This is because there are no explicit call or return instructions in these instruction sets, so Callgrind has to rely on heuristics to detect calls and returns.

工具使用

valgrind --tool=callgrind ./test
callgrind_annotate callgrind.out.85095
python ../../gprof2dot-2017.9.19/gprof2dot.py  -f callgrind \
callgrind.out.85095 | dot -Tsvg -o report.svg

callgrind_annotate callgrind.out.85095 （shell 显示）
heyongxin@linux-oc89:~/test/callgrind> callgrind_annotate callgrind.out.85095 
--------------------------------------------------------------------------------
Profile data file 'callgrind.out.85095' (creator: callgrind-3.13.0)
--------------------------------------------------------------------------------
I1 cache: 
D1 cache: 
LL cache: 
Timerange: Basic block 0 - 10893861
Trigger: Program termination
Profiled target:  ./test (PID 85095, part 1)
Events recorded:  Ir
Events shown:     Ir
Event sort order: Ir
Thresholds:       99
Include dirs:     
User annotated:   
Auto-annotation:  off

--------------------------------------------------------------------------------
        Ir 
--------------------------------------------------------------------------------
36,368,828  PROGRAM TOTALS

--------------------------------------------------------------------------------
        Ir  file:function
--------------------------------------------------------------------------------
19,200,000  ???:usleep [/lib64/libc-2.11.3.so]
 6,000,000  ???:__nanosleep_nocancel [/lib64/libc-2.11.3.so]
 2,400,000  ???:nanosleep [/lib64/libc-2.11.3.so]
 2,130,000  test.c:test_f32 [/home/heyongxin/test/callgrind/test]
 2,130,000  test.c:test_f31 [/home/heyongxin/test/callgrind/test]
 1,420,000  test.c:test_f21 [/home/heyongxin/test/callgrind/test]
 1,420,000  test.c:test_f22 [/home/heyongxin/test/callgrind/test]
   710,004  test.c:test_f11 [/home/heyongxin/test/callgrind/test]
   710,000  test.c:test_f12 [/home/heyongxin/test/callgrind/test]

生成结果
kcachegrind生成结果

heyongxin@linux-oc89:~/jTTS-6.3.0/bin> callgrind_control -e -b
PID 90634: ./jTTSService4.exe
sending command status internal to pid 90634

  Totals:         Ir         Dr         Dw      I1mr      D1mr       D1mw    ILmr      DLmr      DLmw         Bc       Bcm        Bi       Bim 
   Th 1   56,004,130 15,619,132  7,833,614   497,213   157,949     20,610   6,029    23,490     6,043  6,516,248   302,957 1,048,341   316,495 
   Th 2   17,107,736  5,123,871  3,619,998   309,867    40,150     14,542   3,271     1,296     2,578  2,136,839   100,493   528,451   209,473 
   Th 3  157,598,060 51,245,720 33,457,528 2,227,975   577,022    233,265  59,651    15,633     9,613 16,950,557   760,041 5,091,547 1,488,914 
   Th 4  325,601,799 97,720,521 74,782,937   976,076 8,793,776 10,419,767 120,247 5,362,104 9,474,172 35,181,904 1,252,680 2,424,142   828,478 

  Frame:          Ir         Dr         Dw      I1mr      D1mr       D1mw    ILmr      DLmr      DLmw         Bc       Bcm        Bi       Bim Backtrace for Thread 1
   [ 0]        5,162        952        678       235       139          0      66        65         0      1,088       170         .         . nanosleep (138 x)
   [ 1]        7,322      1,083      1,086       308       139          0      97        65         0      1,084       168         .         . usleep (136 x)
   [ 2]   45,227,568 13,295,408  7,136,954   495,154    85,525     14,784   4,060     2,379     1,796  5,054,186   242,086 1,035,385   315,614 main (1 x)
   [ 3]   45,238,849 13,298,687  7,138,535   495,218    85,615     14,813   4,118     2,388     1,818  5,055,966   242,232 1,035,463   315,644 (below main) (1 x)
   [ 4]   45,241,424 13,299,456  7,138,758   495,221    85,634     14,813   4,121     2,390     1,818  5,056,433   242,267 1,035,466   315,646 0x0000000000420e60 (1 x)
   [ 5]            .          .          .         .         .          .       .         .         .          .         .         .         . 0x0000000000000b00


  Frame:          Ir         Dr         Dw      I1mr      D1mr       D1mw    ILmr      DLmr      DLmw         Bc       Bcm        Bi       Bim Backtrace for Thread 2
   [ 0]        4,699        843        603       414       259          0       2        58         0        964       267         .         . select (121 x)
   [ 1]   17,105,240  5,123,076  3,619,734   309,812    40,045     14,525   3,265     1,295     2,564  2,136,382   100,438   528,446   209,469 ListenThreadProc(void*) (1 x)
   [ 2]   17,107,727  5,123,869  3,619,997   309,865    40,149     14,542   3,270     1,296     2,578  2,136,836   100,491   528,450   209,472 start_thread (3 x)
   [ 3]            .          .          .         .         .          .       .         .         .          .         .         .         . clone

使用Helgrind进行线程检测

1、使用Helgrind进行线程检测的线程需为POSIX线程；
2、Potential deadlocks arising from lock ordering problems；
3、Data races – accessing memory without adequate locking or synchronisation；

heyongxin@linux-oc89:~/test/deadlock> valgrind --tool=helgrind ./deadlock
==98724== Helgrind, a thread error detector
==98724== Copyright (C) 2007-2017, and GNU GPL'd, by OpenWorks LLP et al.
==98724== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==98724== Command: ./deadlock
==98724== 
before in thread1
before in thread 2
^C==98724== 
==98724== Process terminating with default action of signal 2 (SIGINT)
==98724==    at 0x5306F95: pthread_join (in /lib64/libpthread-2.11.3.so)
==98724==    by 0x4C2AF45: pthread_join_WRK (hg_intercepts.c:553)
==98724==    by 0x4C2B01D: pthread_join (hg_intercepts.c:572)
==98724==    by 0x400A40: main (deadlock.c:64)
==98724== ---Thread-Announcement------------------------------------------
==98724== 
==98724== Thread #2 was created
==98724==    at 0x55F5D2E: clone (in /lib64/libc-2.11.3.so)
==98724==    by 0x5305950: do_clone (in /lib64/libpthread-2.11.3.so)
==98724==    by 0x5305F37: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.11.3.so)
==98724==    by 0x4C32AFF: pthread_create_WRK (hg_intercepts.c:427)
==98724==    by 0x4C32C87: pthread_create@* (hg_intercepts.c:460)
==98724==    by 0x400A0C: main (deadlock.c:60)
==98724== 
==98724== ----------------------------------------------------------------
==98724== 
==98724== Thread #2: Exiting thread still holds 1 lock
==98724==    at 0x530D294: __lll_lock_wait (in /lib64/libpthread-2.11.3.so)
==98724==    by 0x5308618: _L_lock_1008 (in /lib64/libpthread-2.11.3.so)
==98724==    by 0x530842C: pthread_mutex_lock (in /lib64/libpthread-2.11.3.so)
==98724==    by 0x4C2B4A5: mutex_lock_WRK (hg_intercepts.c:902)
==98724==    by 0x4C2B58F: pthread_mutex_lock (hg_intercepts.c:925)
==98724==    by 0x400842: print (deadlock.c:9)
==98724==    by 0x4008D8: thread1 (deadlock.c:28)
==98724==    by 0x4C32D10: mythread_wrapper (hg_intercepts.c:389)
==98724==    by 0x53067B5: start_thread (in /lib64/libpthread-2.11.3.so)
==98724== 
==98724== 
==98724== For counts of detected and suppressed errors, rerun with: -v
==98724== Use --history-level=approx or =none to gain increased speed, at
==98724== the cost of reduced accuracy of conflicting-access information
==98724== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 41 from 22)