Valgrind underground

5. Let's Go Deeper

Valgrind simulates an Intel x86 processor and runs our test program in this synthetic processor. The two processors are not exactly same. Valgrind is compiled into a shared object, valgrind.so. A shell script valgrindsets the LD_PRELOAD environment variable to point to valgrind.so. This causes the .so to be loaded as an extra library to any subsequently executed dynamically-linked ELF binary, permitting the program to be debugged.

The dynamic linker calls the initialization function of Valgrind. Then the synthetic CPU takes control from the real CPU. In the memory there may be some other .so files. The dynamic linker calls the initialization function of all such .so files. Now the dynamic linker calls the main of the loaded program. When main returns, the synthetic CPU calls the finalization function of valgrind.so. During the execution of the finalization function, summary of all errors detected are printed and memory leaks are checked. Finalization function exits giving back the control from the synthetic CPU to the real one.


5.1. How Valgrind Tracks Validity of Each Byte

For every byte processed, the synthetic processor maintains 9 bits, 8 'V' bits and 1 'A' bit. The 'V' bits indicate the validity of the 8 bits in the byte and the 'A' bit indicates validity of the byte address. These valid-value(V) bits are checked only in two situations:


when data is used for address generation,


when control flow decision is to be made.


In any of these two situations, if the data is found to be undefined an error report will be generated. But no error reports are generated while copying or adding undefined data.

However the case with floating-point data is different. During a floating-point read instruction the 'V' bits corresponding to the data are checked. Thus copying of uninitialized value will produce error in case of floating-point numbers.

#include <stdlib.h>  int main()  {          int *p, *a;          p = malloc(10*sizeof(int));          a = malloc(10*sizeof(int));          a[3] = p[3];          free(a);          free(p);          return 0;  }    /*  produce no errors */    

#include <stdlib.h>  int main()  {          float *p, *a;          p = malloc(10*sizeof(float));          a = malloc(10*sizeof(float));          a[3] = p[3];          free(a);          free(p);          return 0;  }    /* produces error */    

All bytes that are in memory but not in CPU have an associated valid-address(A) bit, which indicates whether the corresponding memory location is accessible by the program. When a program starts, the 'A' bits corresponding to each global variables are set. When a call mallocnew or any other memory allocating function is made, the 'A' bits corresponding to the allocated bytes are set. Upon freeing the allocated block usingfree/new/new‘’ the corresponding 'A' bits are cleared. While doing a system call the 'A' bits are changed appropriately.

When values are loaded from memory the 'A' bits corresponding to each bytes are checked by Valgrind, and if the 'A' bit corresponding to a byte is set then its 'V' bits is checked. If the 'V' bits are not set, an error will be generated and the 'V' bits are set to indicate validity. This avoids long chain of errors. If the 'A' bit corresponding to a loaded byte is 0 then its 'V' bits are forced to set, despite the value being invalid.

Have a look on the following program. Run it.

#include <stdlib.h>  int main()  {          int *p, j;          p = malloc(5*sizeof(int));          j = p[5];          if (p[5] == 1)                  i = p[5]+1;          free(p);          return 0;  }  

Here two errors occur. Both of them are due to the accessing address location p + sizeof(int)*5 which is not allocated to the program. During the execution of j = p[5], since the address p + sizeof(int)*5 is invalid, the 'V' bits of 4 bytes starting at location p+sizeof(int)*5 are forced to set. Therefore uninitialized value occurs neither during the execution of j = p[5] nor during the execution of if(p[5]==1).


5.2. Cache Profiling

Modern x86 machines use two levels of caching. These levels are L1 and L2, in which L1 is a split cache that consists of Instruction cache(I1) and Data cache(D1). L2 is a unified cache.

The configuration of a cache means its size, associativity and number of lines. If the data requested by the processor appears in the upper level it is called a hit. If the data is not found in the upper level, the request is called a miss. The lower level in the hierarchy is then accessed to retrieve the block containing requested data. In modern machines L1 is first searched for data/instruction requested by the processor. If it is a hit then that data/instruction is copied to some register in the processor. Otherwise L2 is searched. If it is a hit then data/instruction is copied to L1 and from there it is copied to a register. If the request to L2 also is a miss then main memory has to be accessed.

Valgrind can simulate the cache, meaning it can display the things that occur in the cache when a program is running. For this, first compile your program with -g option as usual. Then use the shell script cachegrindinstead of valgrind.

Sample output:

==7436== I1  refs:      12,841  ==7436== I1  misses:       238  ==7436== L2i misses:       237  ==7436== I1  miss rate:   1.85%  ==7436== L2i miss rate:   1.84%  ==7436==  ==7436== D   refs:       5,914  (4,626 rd + 1,288 wr)  ==7436== D1  misses:       357  (  324 rd +    33 wr)  ==7436== L2d misses:       352  (  319 rd +    33 wr)  ==7436== D1  miss rate:    6.0% (  7.0%   +   2.5%  )  ==7436== L2d miss rate:    5.9% (  6.8%   +   2.5%  )  ==7436==  ==7436== L2 refs:          595  (  562 rd +    33 wr)  ==7436== L2 misses:        589  (  556 rd +    33 wr)  ==7436== L2 miss rate:     3.1% (  3.1%   +   2.5%  )  

   L2i misses means the number of instruction misses that occur in L2  cache.     L2d misses means the number of data misses that occur in L2 cache.     Total number of data references = Number of reads + Number of writes.     Miss rate means fraction of misses that are not found in the upper  level.  

The shell script cachegrind also produces a file, cachegrind.out, that contains line-by-line cache profiling information which is not humanly understandable. A program vg_annotate can easily interpret this information. If the shell script vg_annotate is used without any arguments it will read the file cachegrind.out and produce an output which is humanly understandable.

When C, C++ or assembly source programs are passed as input to vg_annotate it displays the number of cache reads, writes, misses etc.

I1 cache:         16384 B, 32 B, 4-way associative  D1 cache:         16384 B, 32 B, 4-way associative  L2 cache:         262144 B, 32 B, 8-way associative  Command:          ./a.out  Events recorded:  Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw  Events shown:     Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw  Event sort order: Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw  Thresholds:       99 0 0 0 0 0 0 0 0  Include dirs:  User annotated:   valg_flo.c  Auto-annotation:  off  

User-annotated source: valg_flo.c:

Ir I1mr I2mr Dr D1mr D2mr Dw D1mw D2mw     .   .   .   .   .    .   .   .    .   #include<stdlib.h>   .   .   .   .   .    .   .   .    .   int main()   3   1   1   .   .    .   1   0    0   {   .   .   .   .   .    .   .   .    .           float *p, *a;   6   1   1   .   .    .   3   0    0           p = malloc(10*sizeof(float));   6   0   0   .   .    .   3   0    0           a = malloc(10*sizeof(float));   6   1   1   3   1    1   1   1    1           a[3] = p[3];   4   0   0   1   0    0   1   0    0           free(a);   4   0   0   1   0    0   1   0    0           free(p);   2   0   0   2   0    0   .   .    .   }  


Ir = Total instruction cache reads.


I1mr = I1 cache read misses.


I2mr = L2 cache instruction read misses.


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值