C-LEVEL 优化

最新推荐文章于 2022-11-06 23:56:28 发布

sac761

最新推荐文章于 2022-11-06 23:56:28 发布

阅读量1.7k

点赞数

文章标签： VEC-C CEVA 矢量技术代码优化 C-LEVEL优化

本文链接：https://blog.csdn.net/sac761/article/details/75125512

版权

1，使用-INLINE

当你的代码中的函数已经是使用开发者定制的最优化代码，无需编译器自行优化修改时使用这个关键字。或者是一些开关，用法如下：
Usage: -INLINE:
Switches:
={on|off}
none
all
must=routine_name[,routine_name]*
never=routine_name[,routine_name]*
Static={on|off}

例如：

–INLINE:never=foo:must=bar
–INLINE:static=on

2，尽量减少函数的形参

函数的形参太多的时候就要做成结构体，例如：

Instead of:
void init_func(int n,
short lim,
int x,
int y,
int z,
short *p1,
short *p2,
int *p3,
int *p4,
int *p5);

Use:

typedef struct { 
int n; 
short lim;
 int x;
 int y; 
int z;
 short *p1; 
short *p2;
 int *p3;
 int *p4; 
int *p5;
 } params_t; 
void init_func(params_t *args);

3，合理的优化等级

降低cycle数的最优组合: –O4 –Os0
降低代码量的最优组合: –O3 –Os4
在关键代码处（内核）使用-O4来得到最佳性能
在非关键处使用-O3 -Os[1-4] 来得到最佳综合性能

4，降低取地址，static，和全局变量的使用频率

编译器不会为这三种变量分配寄存器，所以它们的运算速度很慢，要将少它们的使用次数。例如：
Instead of:

int global_counter;
void func(int *p)
{
int i;
for (i=0; i
{
foo();
global_counter++;
}
}

Use:

int global_counter;
void func(int *p)
{
int i;
int local_counter=0;
for (i=0; i
{
foo(); // foo() doesn’t access global_counter
local_counter++;
}
global_counter += local_counter;
}

5，将循环体中的内容降到最小

6，循环体的循环次数设置为已知

Instead of:

for (i=0; i<foo();i++)
{
...
}

Use:

int limit = foo();
for (i=0; i<limit;i++)
{
…
}

Instead of:

while ((*p != 0) && (i<200))
 { 
…
 i++; 
}

Use:

while ( i<200 ) 
{
 if (*p == 0) 
break;
 …
 i++;
 }

7,使用intrinsics和库函数

例如

memcpy( x, y, LIMIT*sizeof( int ) ); 
memset( arr, 0, LIMIT*sizeof( int ) );

而不要手动取循环赋值

8，控制循环体的展开

#pragma dsp_ceva_unroll= 告诉编译器循环展开N次
#pragma dsp_ceva_trip_count = 告诉编译器估计的行程为N
#pragma dsp_ceva_trip_count_factor = 告诉编译器迭代次数可以被N除尽
#pragma dsp_ceva_trip_count_min = 告诉编译器迭代次数最少为N次
-OPT:unroll_times_max=1 告诉编译器，全局的循环展开次数最大为1
#pragma dsp_ceva_unroll=1 告诉编译器，某个局部的循环展开次数最大为1

例1：

void foo(int* in, int* out, int N) {
int i = 0;
for(i=0; i
#pragma dsp_ceva_unroll=1
{
*out = *in++;
out++;
}
}

生成的汇编为：

; Guarding if may be created in 
; cases where software pipeline 
; optimization occurs
… 
; Loop Body 
PCU.bkrep {ds1} lci0.ui 
SC0.nop 
{
LS0.ld (r3.ui).i +#4, modu0.i
LS0.st modu0.ui, (r4.ui).i+#4
}

例2：

void foo(int* in, int* out, int N)
 {
 int i = 0; 
for(i=0; i<N;i++)
#pragma dsp_ceva_unroll=2 
{
 *out = *in++; 
out++;
 } 
}

生成的汇编为：

; Guarding if may be created in 
; cases where software pipeline 
; optimization occurs
 … 
 SC0.cmp {le} modu0.i, #0, pr0
 || SC1.mov r0.i, r4.i
 || SC2.mov r1.i, r3.i
 || SC3.shiftr r5.i, #0x1, r1.i
 SC0.nop
 PCU.brr {t} #BB18_foo ,?pr0
BB11_foo:; Reminder loop
 LS0.ld (r4.ui).i +#4, modu0.i
 || SC0.shiftr r5.i, #0x1, r1.i
 LS0.st modu0.ui, (r3.ui).i+#4
 … 
 ; Loop Body
 PCU.bkrep lci0.ui
 {
 LS0.ld (r4.ui).i +#4, modu1.i
 LS0.st modu1.ui, (r3.ui).i+#4
 LS0.ld (r4.ui).i +#4, modu0.i
 LS0.st modu0.ui, (r3.ui).i+#4
 }

例3：

void foo(int* in, int* out, int N)
 {
 int i = 0; 
for(i=0; i<N;i++)
#pragma dsp_ceva_unroll=2 
#pragma dsp_ceva_trip_count_min=2
{
 *out = *in++; 
out++;
 } 
}

生成的汇编为：

; Guarding if will not be created in this case 
 … 
 SC0.cmp {le} modu0.i, #0, pr0
 || SC1.mov r0.i, r4.i
 || SC2.mov r1.i, r3.i
 || SC3.shiftr r5.i, #0x1, r1.i
 SC0.nop
 PCU.brr {t} #BB18_foo ,?pr0
BB11_foo:; Reminder loop
 LS0.ld (r4.ui).i +#4, modu0.i
 || SC0.shiftr r5.i, #0x1, r1.i
 LS0.st modu0.ui, (r3.ui).i+#4
 … 
 ; Loop Body
 PCU.bkrep lci0.ui
 {
 LS0.ld (r4.ui).i +#4, modu1.i
 LS0.st modu1.ui, (r3.ui).i+#4
 LS0.ld (r4.ui).i +#4, modu0.i
 LS0.st modu0.ui, (r3.ui).i+#4
 }

例4：

void foo(int* in, int* out, int N)
 {
 int i = 0; 
for(i=0; i<N;i++)
#pragma dsp_ceva_unroll=2 
#pragma dsp_ceva_trip_count_min=2
#pragma dsp_ceva_trip_count_factor=2
{
 *out = *in++; 
out++;
 } 
}

生成的汇编为：

 ; Loop Body
 PCU.bkrep lci0.ui
 {
 LS0.ld (r4.ui).i +#4, modu1.i
 LS0.st modu1.ui, (r3.ui).i+#4
 LS0.ld (r4.ui).i +#4, modu0.i
 LS0.st modu0.ui, (r3.ui).i+#4
 }

9，告诉编译器指针内容不能混叠

使用以下指令：
1）-OPT:alias=restrict
所有指针指向的内存不重叠，严格独立
2）-OPT:alias=strongly_typed
指针中不同类型的数据指向不同的独立内存
3）使用 restrict关键字修饰形参
例如：

void vec_mem_copy( int *__restrict__ p1, int *__restrict__ p2, int n) 
{ 
… 
OR -OPT:alias=restrict
｝

生成的汇编为：
不告诉编译器内存不混叠：

…
; 5 cycles per iteration
PCU.bkrep lci0.ui
{
LS0.ld (r4.ui).i +#4, modu3.i
SC0.nop
SC0.nop
SC0.add modu3.i, #80, modu3.i
LS0.st modu3.ui, (r3.ui).i+#
…
; Additional unrolled iterations
}

告诉编译器内存不混叠：

… 
; 1 cycle per iteration
PCU.bkrep lci0.ui
 { 
 LS0.st r2.ui, (r3.ui).i+#4
 || LS1.ld (r4.ui).i +#4, r2.i
 || SC0.add r0.i, #80, modu2.i
…
; Additional unrolled iterations
 }