程序优化方法讨论

cache 是利用计算机系统中的局部性原理进行程序加速的一种系统设计方式。其中结合了空间局部性和时间局部性的处理。

cache的组织方式

Direct-mapped caches perform poorly relative to set associative caches when multiple memory references conflict with each other.

line up in the same cache lines

cache miss

Instruction Cache miss

cache miss 仿真
在这里插入图片描述
通过reorder linker process,即重新组织链接过程后,仿真如下:
在这里插入图片描述

Data Cache miss

在这里插入图片描述

  1. branch penaty:
  • tool-chain might replace a 16-bit instruction with an equivalent 24-bit instruction or might add padding to unexecuted regions of code in order to better align branch targets

  • A 5-stage configuration will suffer significantly fewer branch delays
    than a 7-stage one.

  1. prefetch
    prefetch ahead up to four cache lines
#include <xtensa/hal.h>
int xthal_set_cache_prefetch(unsigned long long mode);

mode can be the following:

XTHAL_PREFETCH_ENABLE (enable cache prefetch)
XTHAL_PREFETCH_DISABLE (disable cache prefetch)
XTHAL_ICACHE_PREFETCH_OFF (disable instruction cache prefetch)
• XTHAL_ICACHE_PREFETCH_LOW (enable, less aggressive prefetch)
• XTHAL_ICACHE_PREFETCH_MEDIUM (enable, midway aggressive prefetch)
• XTHAL_ICACHE_PREFETCH_HIGH (enable, more aggressive prefetch)XTHAL_ICACHE_PREFETCH(n) (explicitly set the InstCtl field of the PREFCTL register to 0..15. See the Prefetch Architectural Additions section of the Prefetch Unit
Option chapter in the Xtensa Microprocessor Data Book for details.)

and for data cache:

 XTHAL_DCACHE_PREFETCH_OFF(disable data cache prefetch)XTHAL_DCACHE_PREFETCH_LOW(enable, less aggressive prefetch)XTHAL_DCACHE_PREFETCH_MEDIUM(enable, midway aggressive prefetch)XTHAL_DCACHE_PREFETCH_HIGH(enable, more aggressive prefetch)XTHAL_DCACHE_PREFETCH(n) (explicitly set the DataCtl field of the PREFCTL
register to 0..15. See the Prefetch Architectural Additions section of the Prefetch Unit
Option chapter in the Xtensa Microprocessor Data Book for details.)
• XTHAL_DCACHE_PREFETCH_L1_OFF (prefetch data to prefetch buffers only)
• XTHAL_DCACHE_PREFETCH_L1 (on configurations that support it, prefetch directly
to L1 data cache)
  • software prefetch
    gcc attribute setting:
//rw 是个编译时的常数,或 1 或 0 。1 时表示写(w),0 时表示读(r) 。
void __builtin_prefetch( const void *addr, int rw, int locality );
  1. 循环的组织
    顺序进行数组的迭代能更好的利用cache的空间局部性,提高cache的命中率。
    以下两个循环中,第二个循环具备更好的空间局部性: 数组地址按照逐行递增的原则进行排列,因此可以对每一行的数据进行cache prefetch,提高cache的命中率。

    int hang = 1024*8;
    int lie = 1024*8;
    int c = 0;
    int **arr = (int **)malloc(sizeof(int*) * lie);
    for(j = 0; j < lie; j++)
    {
         for(i = 0; i < hang; i++)
         {
              arr[j][i] ++;
         }
    }
    
    for(i = 0; i < hang; i++)
    {
         for(j = 0; j < lie; j++)
         {
              arr[j][i] ++;
         }
    }
    

其他优化方向:

  1. Avoid Short Scalar Datatypes
  2. Use Locals Instead of Globals
    如果全局变量在循环中并不会被赋值,而只是以一个参数进行传递,最好改用局部变量,可以避免每个循环进行参数的传递。

Doing so saves a load of g into a register on every loop iteration

int g;
void foo()
{
 int i;
 for (i=0; i<100; i++){
 fred(i,g);
 }
}

优化后:

int g;
void foo()
{
 int i, local_g=g;
 for (i=0; i<100; i++){
 fred(i,local_g);
 }
}

或者通过pure attribute:

int g;
void __attribute__ ((pure)) fred(int, int)
void foo()
{
	 int i;
	 for (i=0; i<100; i++){
	 fred(i,g);
	 }
 }
  • if the function fred does not read or write any global variables other than its
    function arguments, you can mark the function with the pure attribute.
  • If the function fred reads but does not write global variables,
    you can instead using the const attribute. For this example, both const and pure will eliminate the load.
  • In other examples where the variable is written in the calling function, the use of
    pure will eliminate a store but const will not.
  1. Use Arrays Instead of Pointers
    for (i=0; i<100; i++)
     *p++ = ...
    
    优化后:
    for (i=0; i<100; i++)
     p[i] = ...
    

In every iteration of the loop, *p is being assigned, but so is the pointer p. Depending on circumstances, the assignment to the pointer can hinder optimization. In some cases it is possible that the assignment to *p changes the value of the pointer itself, forcing the compiler to generate code to reload and increment the pointer during each iteration. In other cases, the compiler cannot prove that the pointer is not used outside the loop, and the compiler must therefore generate code after the loop to update the pointer with its incremented value. To avoid these types of problems, it is better to use arrays rather than pointers as shown below

  1. Minimizing Conditionals
    使用lookup array替代条件判断,尽量减少条件判断的使用

Every taken branch incurs at least a two cycle penalty

在这里插入图片描述
5. Use Direct Calls

Avoid indirect calls. These are calls via function pointers. Particularly with IPA, the compiler is not able to analyze indirect calls and must assume that an indirect function might cause unknown side effects like modifying global or pointer variables. Even without IPA, every indirect call requires that the address of the call be loaded, leading to additional overhead.

避免函数指针的使用

  1. Passing Function Parameters

Consider a situation where you want to write a function that computes the value of some variable, x, in the caller. You can either have the function return a value and assign the result of the function to x, or alternatively, you can pass the address of x into the function and have the function assign *x directly inside the function. It is better to have the function return a value. By passing the address of a variable into the function, the compiler must assume that the address is saved away by the function and any pointer dereference anywhere in the program might actually change the value of x.

Similarly, scalar variables should always be passed by value. Passing the address of a scalar variable forces the compiler to conservatively assume that the address of the variable is saved by the function.

  • 对于简单参数的传递,如果期望在函数内部进行参数值得修改操作,更推荐使用返回值进行修改。
  • 对于类似结构体参数的传递,最好使用引用或者指针传递,减少参数传递压栈的开销
  • 避免使用可变参数函数,其尤其不高效
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值