linux gprof tool
https://www.ibm.com/developerworks/cn/linux/l-gnuprof.html
Tenslica
Profiling with the Xtensa ISS has several advantages over hardware profiling:
- You do not need to compile the Xtensa program with special options (e.g., ‘-hwpg’)
before profiling it. - There is no instrumentation code added to the Xtensa program, so the profile results
are not distorted by any extra code. - The Xtensa ISS can easily record the execution of every instruction, so there is no need
to rely on statistical approximations like PC-sampling. - Instead of counting execution cycles, the Xtensa ISS can optionally record profile data
for other events, such as cache misses. You can then use xt-gprof or Xplorer to view
a profile of these other events.
benchmark
- pipeline interlock
However, consider the following instructions:
LD adr -> r10
AND r10,r3 -> r11
The data read from the address adr is not present in the data cache until after the Memory Access stage of the LD instruction. By this time, the AND instruction is already through the ALU. To resolve this would require the data from memory to be passed backwards in time to the input to the ALU. This is not possible. The solution is to delay the AND instruction by one cycle. The data hazard is detected in the decode stage, and the fetch and decode stages are stalled - they are prevented from flopping their inputs and so stay in the same state for a cycle. The execute, access, and write-back stages downstream see an extra no-operation instruction (NOP) inserted between the LD and AND instructions.
This NOP is termed a pipeline bubble since it floats in the pipeline, like an air bubble, occupying resources but not producing useful results. The hardware to detect a data hazard and stall the pipeline until the hazard is cleared is called a pipeline interlock. - branch delay
performance tuning
- data alignment for vectorization
void sum(int *a, int *b, int *c, int n)
{
#pragma aligned (a, 8)
#pragma aligned (b, 8)
#pragma aligned (c, 8)
int i;
for (i=0; i<n; i++) {
a[i] = b[i] + c[i];
} }
- Controlling Vectorization Through Pragmas
each iteration of the loop is independent of all other iterations. This pragma will often make a loop vectorizable.
void copy (int *a, int *b, int n)
{
int i;
#pragma concurrent
for (i = 0; i < n; i++)
a[i] = b[i];
}