DSP代码优化

1 注意使用ccs自带的优化工具
1.1 选择恰当的编译器选项

  • 必须要用的选项 –O[2|3]
  • 可以使用-mt(要确保写的数据和读的数据在内存空间上没有重合)
  • -mh<num>  Specify speculative load byte count threshold
  • 如果源代码里含有永远不会执行的代码,使用选项-mo Place each function in a separate subsection
  • 如果考虑可执行程序的大小,加上-ms[0-3]。
  • 不要加上-g –gp –ss –ml3 –mu
  • -s[-k -al]、–mw 、(-on2 -o3)、-consultant 可以在产生分析信息的同时不影响生成代码的性能 

1.2 确保循环中的次数变量(一般for(i; i<N; i++), 此处的i)是有符号整数。

1.3 给编译器提供尽可能多的信息,如限定符restrict, 编译指示MUST_ITERATE、_nasserts()等。

1.4 内存中数据结构的对齐,可使用DATA_ALIGN、 STRUCT_ALIGN等

1.5 参考Compiler Consultant和.nfo文件的建议信息

2 利用优化器的意见
当编译选项中有-s时,在生成的*.asm文件中会有优化器的意见。如"C:/CCStudio_v3.3/C6000/cgtools/bin/cl6x" -g -k -s -on2 -o3 -mt -mw -mv6400+ --mem_model:data=near --consultant -@"Debug.lkf" "lesson_c.c"。其中,
void lesson_c(short *xptr, short *yptr, short *zptr, short *w_sum, int N)
{
    int i, w_vec1, w_vec2;
    short w1,w2;
    w1 = zptr[0];
    w2 = zptr[1];
    for (i = 0; i < N; i++) {
        w_vec1 =  xptr[i] * w1;
        w_vec2 =  yptr[i] * w2;
        w_sum[i] = (w_vec1 + w_vec2) >> 15;
    }
}

生成的优化器意见为:
;** --------------------------------------------------------------------------*
;** 27    -----------------------    w1 = *zptr;
;** 28    -----------------------    w2 = zptr[1];
;** 29    -----------------------    if ( N <= 0 ) goto g4;
;** --------------------------------------------------------------------------*
;**      -----------------------    U$17 = xptr;
;**      -----------------------    U$20 = yptr;
;**      -----------------------    U$26 = w_sum;
;** 31    -----------------------    L$1 = N;
;**      -----------------------    #pragma MUST_ITERATE(1, 1099511627775, 1)
;**      -----------------------    #pragma LOOP_FLAGS(4096u)
;**    -----------------------g3:
;** 31    -----------------------    *U$26++ = _mpy(*U$17++, w1)+_mpy(*U$20++, w2)>>15;
;** 29    -----------------------    if ( --L$1 ) goto g3;
;**    -----------------------g4:
;**      -----------------------    return;
从中可以看出,编译器自动加入了N是否为0的判断。如果改为:
void lesson_c(short *xptr, short *yptr, short *zptr, short *w_sum, int N)
{
    int i, w_vec1, w_vec2;
    short w1,w2;
    w1 = zptr[0];
    w2 = zptr[1];
    #pragma MUST_ITERATE(1)  //告诉编译器至少循环一次(编译器就不会再判断N是否为0)
    for (i = 0; i < N; i++)
    {
        w_vec1 =  xptr[i] * w1;
        w_vec2 =  yptr[i] * w2;
        w_sum[i] = (w_vec1 + w_vec2) >> 15;
    }
}
如下意见中已经没有了对N是否为0的判断:
;** --------------------------------------------------------------------------*
;** 27    -----------------------    w1 = *zptr;
;** 28    -----------------------    w2 = zptr[1];
;**      -----------------------    U$15 = xptr;
;**      -----------------------    U$18 = yptr;
;**      -----------------------    U$24 = w_sum;
;** 32    -----------------------    L$1 = N;
;**      -----------------------    #pragma MUST_ITERATE(1, 4294967295, 1)
;**      -----------------------    #pragma LOOP_FLAGS(4096u)
;**    -----------------------g2:
;** 32    -----------------------    *U$24++ = _mpy(*U$15++, w1)+_mpy(*U$18++, w2)>>15;
;** 30    -----------------------    if ( --L$1 ) goto g2;
;**      -----------------------    return;

3 利用软件流水信息优化循环
void lesson_c(short *xptr, short *yptr, short *zptr, short *w_sum, int N)
{
    int i, w_vec1, w_vec2;
    short w1,w2;
    w1 = zptr[0];
    w2 = zptr[1];
    for (i = 0; i < N; i++)
    {
        w_vec1 =  xptr[i] * w1;
        w_vec2 =  yptr[i] * w2;
        w_sum[i] = (w_vec1 + w_vec2) >> 15;
    }
}
上述代码的编译软件流水信息为:
;*----------------------------------------------------------------------------*
;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop source line                 : 30                        ;循环开始的行数
;*      Loop opening brace source line   : 31
;*      Loop closing brace source line   : 35
;*      Known Minimum Trip Count         : 1                ;循环最小次数
;*      Known Max Trip Count Factor      : 1                ;循环的因子,循环次数是循环因子的倍数,如果知道循环因子的话,便于编译器自动铺开(unroll)代码
;*      Loop Carried Dependency Bound(^) : 0           ;内存读写瓶颈,如果有的话,后面的汇编代码注释里相应语句含有^标志
;*      Unpartitioned Resource Bound     : 2               ;资源瓶颈
;*      Partitioned Resource Bound(*)    : 2
;*      Resource Partition:
;*                                   A-side   B-side
;*      .L units                         0        0     
;*      .S units                        1        0     
;*      .D units                        2*       1     
;*      .M units                       1        1     
;*      .X cross paths             1        0     
;*      .T address paths         2*       1     
;*      Long read paths          0        0     
;*      Long write paths          0        0     
;*      Logical  ops (.LS)         0        0     (.L or .S unit)
;*      Addition ops (.LSD)      1        0     (.L or .S or .D unit)
;*      Bound(.L .S .LS)           1        0     
;*      Bound(.L .S .D .LS .LSD)     2*       1                 ;资源使用不平衡(没有完全利用可用的计算能力)
;*
;*      Searching for software pipeline schedule at ...
;*      ii = 2  Schedule found with 6 iterations in parallel
;*
;*      Register Usage Table:
;*          +-----------------------------------------------------------------+
;*          |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
;*          |00000000001111111111222222222233|00000000001111111111222222222233|
;*          |01234567890123456789012345678901|01234567890123456789012345678901|
;*          |--------------------------------+--------------------------------|
;*       0: |   ******                       |    * **                        |
;*       1: |   *  ***                       |    ****                        |
;*          +-----------------------------------------------------------------+
;*
;*      Done
;*
;*      Loop will be splooped
;*      Collapsed epilog stages     : 0
;*      Collapsed prolog stages     : 0
;*      Minimum required memory pad : 0 bytes
;*
;*      Minimum safe trip count     : 1
;*----------------------------------------------------------------------------*
;*        SINGLE SCHEDULED ITERATION                    ;需要加-mw选项
;*
;*        $C$C23:
;*   0              LDH     .D2T2   *B6++,B5          ; |32|              ;一次装载16bit,浪费了剩余16bit带宽(假设数据带宽为32bit)
;*   1              LDH     .D1T1   *A6++,A5          ; |32| 
;*   2              NOP             3
;*   5              MPY     .M2     B5,B7,B4          ; |32| 
;*   6              MPY     .M1     A5,A8,A4          ; |32| 
;*   7              NOP             1
;*   8              ADD     .L1X    B4,A4,A3          ; |32| 
;*   9              SHR     .S1     A3,15,A3          ; |32| 
;*  10              STH     .D1T1   A3,*A7++          ; |32| 
;*         ||           SPBR            $C$C23
;*  11              NOP             1
;*  12              ; BRANCHCC OCCURS {$C$C23}        ; |30|          ;一次循环需要12时钟周期(没有展开)
;*----------------------------------------------------------------------------*

修改为下面的代码时:
#define WORD_ALIGNED(x)  (_nassert(((int)(x) & 0x3) == 0))
#define DWORD_ALIGNED(x) (_nassert(((int)(x) & 0x7) == 0))

void lesson3_c(short * restrict xptr, short * restrict yptr, short *zptr, short *w_sum, int N)
{
    int i, w_vec1, w_vec2;
    short w1, w2;
    WORD_ALIGNED(xptr); //保证内存装载的带宽
    WORD_ALIGNED(yptr);    
                               
    w1 = zptr[0];
    w2 = zptr[1];
    #pragma MUST_ITERATE(48, , 2);   //已知最小循环次数=48,循环因子factor=2, 可以铺开代码
    for (i = 0; i < N; i++)
    {
        w_vec1 =  xptr[i] * w1;
        w_vec2 =  yptr[i] * w2;
        w_sum[i] = (w_vec1 + w_vec2) >> 15;
    }
}
上述代码的编译软件流水信息为:
;*----------------------------------------------------------------------------*
;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop source line                 : 59
;*      Loop opening brace source line   : 60
;*      Loop closing brace source line   : 64
;*      Loop Unroll Multiple             : 4x                       ;循环铺开的次数
;*      Known Minimum Trip Count         : 12                    
;*      Known Max Trip Count Factor      : 1
;*      Loop Carried Dependency Bound(^) : 0
;*      Unpartitioned Resource Bound     : 4
;*      Partitioned Resource Bound(*)    : 4
;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     0        0     
;*      .S units                     2        2     
;*      .D units                     4*       4*                     ;铺开循环保证了资源使用的平衡
;*      .M units                     2        2     
;*      .X cross paths               2        2     
;*      .T address paths             4*       4*    
;*      Long read paths              0        0     
;*      Long write paths             0        0     
;*      Logical  ops (.LS)           0        0     (.L or .S unit)
;*      Addition ops (.LSD)          2        2     (.L or .S or .D unit)
;*      Bound(.L .S .LS)             1        1     
;*      Bound(.L .S .D .LS .LSD)     3        3     
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 4  Schedule found with 4 iterations in parallel
;*
;*      Register Usage Table:
;*          +-----------------------------------------------------------------+
;*          |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
;*          |00000000001111111111222222222233|00000000001111111111222222222233|
;*          |01234567890123456789012345678901|01234567890123456789012345678901|
;*          |--------------------------------+--------------------------------|
;*       0: |    *****        ***            |    ******      ****            |
;*       1: |   *******         *            |    ******      ** *            |
;*       2: |   *******      ****            |    * ****      **              |
;*       3: |   *******      ****            |    ******      **              |
;*          +-----------------------------------------------------------------+
;*
;*      Done
;*
;*      Loop will be splooped
;*      Collapsed epilog stages     : 0
;*      Collapsed prolog stages     : 0
;*      Minimum required memory pad : 0 bytes
;*
;*      Minimum safe trip count     : 1 (after unrolling)
;*----------------------------------------------------------------------------*
;*        SINGLE SCHEDULED ITERATION
;*
;*        $C$C24:
;*   0              LDW     .D1T1   *A7++(8),A9       ; |61|             ;每次装载一个字
;*     ||           LDW     .D2T2   *B6++(8),B17      ; |61| 
;*   1              LDW     .D1T1   *A4++(8),A3       ; |61| 
;*   2              NOP             1
;*   3              LDW     .D2T2   *B16++(8),B4      ; |61| 
;*   4              NOP             2
;*   6              MPY2    .M1     A9,A8,A17:A16     ; |61| 
;*   7              MPY2    .M1     A3,A8,A19:A18     ; |61| 
;*     ||           MPY2    .M2     B17,B7,B5:B4      ; |61| 
;*   8              MPY2    .M2     B4,B7,B19:B18     ; |61| 
;*   9              NOP             2
;*  11              ADD     .L2X    B4,A16,B17        ; |61| 
;*  12              ADD     .L2X    B18,A18,B5        ; |61| 
;*     ||           SHR     .S2     B17,15,B4         ; |61| 
;*     ||           ADD     .L1X    B5,A17,A3         ; |61| 
;*  13              SHR     .S2     B5,15,B4          ; |61| 
;*     ||           ADD     .L1X    B19,A19,A19       ; |61| 
;*     ||           STH     .D2T2   B4,*B9++(8)       ; |61| 
;*     ||           SHR     .S1     A3,15,A18         ; |61| 
;*  14              STH     .D2T2   B4,*B8++(8)       ; |61| 
;*     ||           SHR     .S1     A19,15,A9         ; |61| 
;*     ||           STH     .D1T1   A18,*A5++(8)      ; |61| 
;*  15              STH     .D1T1   A9,*A6++(8)       ; |61| 
;*     ||           SPBR            $C$C24
;*  16              ; BRANCHCC OCCURS {$C$C24}        ; |59|       ;4次循环使用16个时钟周期(循环展开,一次循环展开为4次)
;*----------------------------------------------------------------------------*

4 consultant advice 和 *.nfo文件
当编译时加上--consultant和–on2 –o3,可以查看相应的consultant advice 和*.nfo文件。打开profile,运行程序,这是可以查看viewer的consultant内容。*.nfo文件编译时便可以生成,内容没有consultant全。下面是lesson3_c.nfo的内容

TMS320C6x C/C++ Optimizer               v6.0.8
Build Number 1GKUL-JA0KH827-RSAQQ-TAV-ZAZG_W_Q_Y

        ======File-level Analysis Summary======

extern void _lesson3_c() is called from 0 sites in this file.
    It appears to be inlineable (size = 58 units)
    It calls these functions:
    <NONE>

        ======= End file-level Analysis =======

extern void _lesson3_c() is called from 0 sites in this file.
    It appears to be inlineable (size = 58 units)
    It calls these functions:
    <NONE>

ADVICE: In function lesson3_c()
    in the 'for' loop with loop variable 'i' at lines 39-44
    for the statement w_sum[i] = _mpy(xptr[i], w1)+_mpy(yptr[i], w2)>>15; at line 41

    The address of w_sum[i] for the first iteration of the loop is &w_sum[0].
    This pointer is aligned to a 16 bit boundary.

    Consider adding an assertion just before the loop:

        _nassert( ((int)w_sum % 4) == 0 );  /* 32-bit aligned */
       or    _nassert( ((int)w_sum % 8) == 0 );  /* 64-bit aligned */

    to specify that multiple elements of w_sum[i]
    may be accessed in parallel.

ADVICE: In function lesson3_c()
    in the 'for' loop with loop variable 'i' at lines 39-44
    for the statement w_sum[i] = _mpy(xptr[i], w1)+_mpy(yptr[i], w2)>>15; at line 41

    The address of yptr[i] for the first iteration of the loop is &yptr[0].
    This pointer is aligned to a 32 bit boundary.

    Consider adding an assertion just before the loop:

        _nassert( ((int)yptr % 8) == 0 );  /* 64-bit aligned */

    to specify that multiple elements of yptr[i]
    may be accessed in parallel.

ADVICE: In function lesson3_c()
    in the 'for' loop with loop variable 'i' at lines 39-44
    for the statement w_sum[i] = _mpy(xptr[i], w1)+_mpy(yptr[i], w2)>>15; at line 41

    The address of xptr[i] for the first iteration of the loop is &xptr[0].
    This pointer is aligned to a 32 bit boundary.

    Consider adding an assertion just before the loop:

        _nassert( ((int)xptr % 8) == 0 );  /* 64-bit aligned */

    to specify that multiple elements of xptr[i]
    may be accessed in parallel.
<<NULL MIX DOMAIN>>

== END OF INFO OUTPUT==

转载于“https://blog.csdn.net/tercel_zhang/article/details/52944286

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值