linux debugging and performance tunning
LTT linux trace toolkits for summery
date;ps;date
time 命令可以用来测量实际运行的时间
gettimeofday return seconds and microseconds
profile就是用来看花费在subroutine或者func上面的时间
grof
▪Beforeprograms can be profiled using gprof, they must be compiled with the -pg gcc option.
$ gcc –pg –o sample1sample1.c
▪Thesample1 program prints the prime numbers up to 50,000.
▪Whenthe sample1 program is run, the gmon.out file is created
▪$gprof-b ./sample1
–The -b optioneliminates the text output that explains the data output provided by gprof
▪Youcan use the output from gprof to increase this program's performance by changingthe code to perform faster
#include<stdlib.h>
#include<stdio.h>
int prime (int num);
int main()
{
int i;
int colcnt = 0;
for (i=2; i <= 50000; i++)
if (prime(i)) {
colcnt++;
if (colcnt%9 == 0) {
printf("%5d\n",i);
colcnt =0;
} else
printf("%5d", i);
}
putchar('\n');
return 0;
}
intprime (int num)
{
/* check to see if the number is a prime? */
inti;
for (i=2; i < num; i++)
if (num %i == 0)
return 0;
return 1;
}
▪Nextwe can use the gcov program to look at the actual number of times eachline of the program was executed (See Chapter 2)
▪Buildthe sample1 program with two additional options :
$ gcc -pg -fprofile-arcs -ftest-coverage -o sample1sample1.c
▪Runningsample1 and creating gcov output
▪./sample1
▪gcov sample1.c
▪
▪Runninggcovon the source code produces the file sample1.c.gcov. It shows the actual number of times each line ofthe program was executed
oprofile tools
▪A performancecounter is the part of a microprocessor that measures and gathersperformance-relevant events on the microprocessor.
▪The numberand type of available events differ significantly between existingmicroprocessors.
▪These counters imposeno overheadon the system
▪One performance area ofconcern is cache misses.The following section describes different types of coding areas that can causecache misses
▪Cache missesare costly, try to minimize by following suggestions:
▪Keep frequentlyaccessed data together.
▪Storeand access frequently used data in flat, sequential data structuresand avoid pointer indirection.
▪Access data sequentially.
▪If theprogram is accessing data sequentially, each cache miss brings in n words. Ifthe program is accessing only nth word, it brings in unneeded data, degradingperformance.
▪Avoid simultaneously traversing several large buffers of data
▪Therecan be cache conflicts between the buffers. Instead, pack the contentssequentially into one buffer whenever possible. If you are using vertex arrays,try to use interleaved arrays.
▪Some framebuffers have cache-likebehaviors as well.
▪It isa good idea to group geometry so that the drawing is done to one part of thescreen at a time.
Padding
▪Some compilers (or compiler options) automatically pad structures.
▪Referencinga data structure that spans two cache blocks may incur two misses, even if thestructure itself is smaller than the block size.
▪Paddingstructures to a multiple of the block size and aligning them on a blockboundary can eliminate these "misalignment" misses
Aligning
▪Alignment is a little more difficult, since the structure'saddress must be a multiple of the cache block size.
▪Aligningstatically declared structures generally requires compiler support. Some versions of malloc() return cache block aligned memory
▪Theprogrammer can align dynamically allocated structures using simple pointerarithmetic
Packing
▪Packing is the opposite of padding
▪By packingan array into the smallest space possible, the programmer increases locality,which can reduce both conflict and capacitymisses.
Loop Grouping
▪Numericprograms often consist of several operations on the same data, coded as multipleloops over the same arrays
▪Combining these loops may increases the program's temporallocality and frequently reduces the number of capacity misses
ophelp
code optimize
requires: summery on all debug tools
cache size? 16K?