cache 是利用计算机系统中的局部性原理进行程序加速的一种系统设计方式。其中结合了空间局部性和时间局部性的处理。
cache的组织方式
Direct-mapped caches perform poorly relative to set associative caches when multiple memory references conflict with each other.
line up in the same cache lines
cache miss
Instruction Cache miss
cache miss 仿真
通过reorder linker process,即重新组织链接过程后,仿真如下:
Data Cache miss
- branch penaty:
-
tool-chain might replace a 16-bit instruction with an equivalent 24-bit instruction or might add padding to unexecuted regions of code in order to better align branch targets
-
A 5-stage configuration will suffer significantly fewer branch delays
than a 7-stage one.
- prefetch
prefetch ahead up to four cache lines
#include <xtensa/hal.h>
int xthal_set_cache_prefetch(unsigned long long mode);
mode can be the following:
XTHAL_PREFETCH_ENABLE (enable cache prefetch)
XTHAL_PREFETCH_DISABLE (disable cache prefetch)
XTHAL_ICACHE_PREFETCH_OFF (disable instruction cache prefetch)
• XTHAL_ICACHE_PREFETCH_LOW (enable, less aggressive prefetch)
• XTHAL_ICACHE_PREFETCH_MEDIUM (enable, midway aggressive prefetch)
• XTHAL_ICACHE_PREFETCH_HIGH (enable, more aggressive prefetch)
• XTHAL_ICACHE_PREFETCH(n) (explicitly set the InstCtl field of the PREFCTL register to 0..15. See the Prefetch Architectural Additions section of the Prefetch Unit
Option chapter in the Xtensa Microprocessor Data Book for details.)
and for data cache:
XTHAL_DCACHE_PREFETCH_OFF(disable data cache prefetch)
• XTHAL_DCACHE_PREFETCH_LOW(enable, less aggressive prefetch)
• XTHAL_DCACHE_PREFETCH_MEDIUM(enable, midway aggressive prefetch)
• XTHAL_DCACHE_PREFETCH_HIGH(enable, more aggressive prefetch)
• XTHAL_DCACHE_PREFETCH(n) (explicitly set the DataCtl field of the PREFCTL
register to 0..15. See the Prefetch Architectural Additions section of the Prefetch Unit
Option chapter in the Xtensa Microprocessor Data Book for details.)
• XTHAL_DCACHE_PREFETCH_L1_OFF (prefetch data to prefetch buffers only)
• XTHAL_DCACHE_PREFETCH_L1 (on configurations that support it, prefetch directly
to L1 data cache)
- software prefetch
gcc attribute setting:
//rw 是个编译时的常数,或 1 或 0 。1 时表示写(w),0 时表示读(r) 。
void __builtin_prefetch( const void *addr, int rw, int locality );
-
循环的组织
顺序进行数组的迭代能更好的利用cache的空间局部性,提高cache的命中率。
以下两个循环中,第二个循环具备更好的空间局部性: 数组地址按照逐行递增的原则进行排列,因此可以对每一行的数据进行cache prefetch,提高cache的命中率。int hang = 1024*8; int lie = 1024*8; int c = 0; int **arr = (int **)malloc(sizeof(int*) * lie); for(j = 0; j < lie; j++) { for(i = 0; i < hang; i++) { arr[j][i] ++; } } for(i = 0; i < hang; i++) { for(j = 0; j < lie; j++) { arr[j][i] ++; } }
其他优化方向:
- Avoid Short Scalar Datatypes
- Use Locals Instead of Globals
如果全局变量在循环中并不会被赋值,而只是以一个参数进行传递,最好改用局部变量,可以避免每个循环进行参数的传递。
Doing so saves a load of g into a register on every loop iteration
int g;
void foo()
{
int i;
for (i=0; i<100; i++){
fred(i,g);
}
}
优化后:
int g;
void foo()
{
int i, local_g=g;
for (i=0; i<100; i++){
fred(i,local_g);
}
}
或者通过pure attribute
:
int g;
void __attribute__ ((pure)) fred(int, int)
void foo()
{
int i;
for (i=0; i<100; i++){
fred(i,g);
}
}
- if the function fred does not read or write any global variables other than its
function arguments, you can mark the function with the pure attribute.- If the function fred reads but does not write global variables,
you can instead using the const attribute. For this example, both const and pure will eliminate the load.- In other examples where the variable is written in the calling function, the use of
pure will eliminate a store but const will not.
- Use Arrays Instead of Pointers
优化后:for (i=0; i<100; i++) *p++ = ...
for (i=0; i<100; i++) p[i] = ...
In every iteration of the loop, *p is being assigned, but so is the pointer p. Depending on circumstances, the assignment to the pointer can hinder optimization. In some cases it is possible that the assignment to *p changes the value of the pointer itself, forcing the compiler to generate code to reload and increment the pointer during each iteration. In other cases, the compiler cannot prove that the pointer is not used outside the loop, and the compiler must therefore generate code after the loop to update the pointer with its incremented value. To avoid these types of problems, it is better to use arrays rather than pointers as shown below
- Minimizing Conditionals
使用lookup array替代条件判断,尽量减少条件判断的使用
Every taken branch incurs at least a two cycle penalty
5. Use Direct Calls
Avoid indirect calls. These are calls via function pointers. Particularly with IPA, the compiler is not able to analyze indirect calls and must assume that an indirect function might cause unknown side effects like modifying global or pointer variables. Even without IPA, every indirect call requires that the address of the call be loaded, leading to additional overhead.
避免函数指针的使用
- Passing Function Parameters
Consider a situation where you want to write a function that computes the value of some variable, x, in the caller. You can either have the function return a value and assign the result of the function to x, or alternatively, you can pass the address of x into the function and have the function assign *x directly inside the function. It is better to have the function return a value. By passing the address of a variable into the function, the compiler must assume that the address is saved away by the function and any pointer dereference anywhere in the program might actually change the value of x.
Similarly, scalar variables should always be passed by value. Passing the address of a scalar variable forces the compiler to conservatively assume that the address of the variable is saved by the function.
- 对于简单参数的传递,如果期望在函数内部进行参数值得修改操作,更推荐使用返回值进行修改。
- 对于类似结构体参数的传递,最好使用引用或者指针传递,减少参数传递压栈的开销
- 避免使用可变参数函数,其尤其不高效