Xtensa 仿真环境tunning分析（ISS profile）

最新推荐文章于 2022-01-12 15:11:56 发布

OString2024

最新推荐文章于 2022-01-12 15:11:56 发布

阅读量1.2k

点赞数

分类专栏： DSP

本文链接：https://blog.csdn.net/huntershuai/article/details/100541647

版权

DSP 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

linux gprof tool

https://www.ibm.com/developerworks/cn/linux/l-gnuprof.html

Tenslica

Profiling with the Xtensa ISS has several advantages over hardware profiling:

You do not need to compile the Xtensa program with special options (e.g., ‘-hwpg’)
before profiling it.
There is no instrumentation code added to the Xtensa program, so the profile results
are not distorted by any extra code.
The Xtensa ISS can easily record the execution of every instruction, so there is no need
to rely on statistical approximations like PC-sampling.
Instead of counting execution cycles, the Xtensa ISS can optionally record profile data
for other events, such as cache misses. You can then use xt-gprof or Xplorer to view
a profile of these other events.

benchmark

pipeline interlock

However, consider the following instructions:
LD adr -> r10
AND r10,r3 -> r11
The data read from the address adr is not present in the data cache until after the Memory Access stage of the LD instruction. By this time, the AND instruction is already through the ALU. To resolve this would require the data from memory to be passed backwards in time to the input to the ALU. This is not possible. The solution is to delay the AND instruction by one cycle. The data hazard is detected in the decode stage, and the fetch and decode stages are stalled - they are prevented from flopping their inputs and so stay in the same state for a cycle. The execute, access, and write-back stages downstream see an extra no-operation instruction (NOP) inserted between the LD and AND instructions.
This NOP is termed a pipeline bubble since it floats in the pipeline, like an air bubble, occupying resources but not producing useful results. The hardware to detect a data hazard and stall the pipeline until the hazard is cleared is called a pipeline interlock.
branch delay

performance tuning

data alignment for vectorization

void sum(int *a, int *b, int *c, int n)
{
#pragma aligned (a, 8)
#pragma aligned (b, 8)
#pragma aligned (c, 8)
    int i;
    for (i=0; i<n; i++) {
        a[i] = b[i] + c[i];
} }

Controlling Vectorization Through Pragmas
each iteration of the loop is independent of all other iterations. This pragma will often make a loop vectorizable.

void copy (int *a, int *b, int n)
{
    int i;
#pragma concurrent
    for (i = 0; i < n; i++)
        a[i] = b[i];
}

OString2024

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Xtensa 仿真环境tunning分析（ISS profile）

Profiling with the Xtensa ISS has several advantages over hardware profiling:You do not need to compile the Xtensa program with special options (e.g., ‘-hwpg’)before profiling it.There is no instr...
复制链接

扫一扫

专栏目录