主要原理: 处理器分支语句会表现的比较敏感,存在分支预测,而循环本质上也是分支类型,所以会降低程序的执行速度。打个比方,处理器好比是赛道选手,而软件代码则是跑道。选手必须在每个转弯的地方(也就是每个分支点)减速,所以跑到中的弯道少,就意味着跑完的所需时间比较少。所以循环中的判断也就是弯道,循环次数越多,弯道越多,也就需要更多的时间执行。
所以如果需要提升效率,可以适当的展开循环,但是要注意和缓存行大小的匹配程度。
以下实验环境在ubuntu上进行
![](https://i-blog.csdnimg.cn/blog_migrate/365a19c1f877fe414efd8d30af05f6af.png)
#include<stdio.h>
#include<sys/time.h>
void test1() {
struct timeval start1;
struct timeval end1;
unsigned long timer1;
int x = 0, p[80000];
for (int i = 0; i < 80000; i++) {
p[i] = i;
}
gettimeofday(&start1, NULL);
for (int a = 0; a < 80000; a++) {
x += p[a];
}
gettimeofday(&end1, NULL);
int speedtime = (end1.tv_sec * 1000000 + end1.tv_usec) - (start1.tv_sec * 1000000 + start1.tv_usec);
printf("%d gettimeofday = %d\n", x, speedtime);
}
void test2() {
struct timeval start1;
struct timeval end1;
unsigned long timer1;
int x = 0, p[80000];
for (int i = 0; i < 80000; i++) {
p[i] = i;
}
gettimeofday(&start1, NULL);
for (int a = 0; a < 80000; a+=2) {
x += p[a];
x += p[a + 1];
}
gettimeofday(&end1, NULL);
int speedtime = (end1.tv_sec * 1000000 + end1.tv_usec) - (start1.tv_sec * 1000000 + start1.tv_usec);
printf("%d gettimeofday = %d\n", x, speedtime);
}
void test3() {
struct timeval start1;
struct timeval end1;
unsigned long timer1;
int x = 0, p[80000];
for (int i = 0; i < 80000; i++) {
p[i] = i;
}
gettimeofday(&start1, NULL);
for (int a = 0; a < 80000; a+=4) {
x += p[a];
x += p[a + 1];
x += p[a + 2];
x += p[a + 3];
}
gettimeofday(&end1, NULL);
int speedtime = (end1.tv_sec * 1000000 + end1.tv_usec) - (start1.tv_sec * 1000000 + start1.tv_usec);
printf("%d gettimeofday = %d\n", x, speedtime);
}
void test4() {
struct timeval start1;
struct timeval end1;
unsigned long timer1;
int x = 0, p[80000];
for (int i = 0; i < 80000; i++) {
p[i] = i;
}
gettimeofday(&start1, NULL);
for (int a = 0; a < 80000; a+=8) {
x += p[a];
x += p[a + 1];
x += p[a + 2];
x += p[a + 3];
x += p[a + 4];
x += p[a + 5];
x += p[a + 6];
x += p[a + 7];
}
gettimeofday(&end1, NULL);
int speedtime = (end1.tv_sec * 1000000 + end1.tv_usec) - (start1.tv_sec * 1000000 + start1.tv_usec);
printf("%d gettimeofday = %d\n", x, speedtime);
}
void test5() {
struct timeval start1;
struct timeval end1;
unsigned long timer1;
int x = 0, p[80000];
for (int i = 0; i < 80000; i++) {
p[i] = i;
}
gettimeofday(&start1, NULL);
for (int a = 0; a < 80000; a+=16) {
x += p[a];
x += p[a + 1];
x += p[a + 2];
x += p[a + 3];
x += p[a + 4];
x += p[a + 5];
x += p[a + 6];
x += p[a + 7];
x += p[a + 8];
x += p[a + 9];
x += p[a + 10];
x += p[a + 11];
x += p[a + 12];
x += p[a + 13];
x += p[a + 14];
x += p[a + 15];
}
gettimeofday(&end1, NULL);
int speedtime = (end1.tv_sec * 1000000 + end1.tv_usec) - (start1.tv_sec * 1000000 + start1.tv_usec);
printf("%d gettimeofday = %d\n", x, speedtime);
}
int main()
{
test1();
test2();
test3();
test4();
test5();
return 0;
}
test1~5函数分别将循环展开至1,2,4,8,16.
而结果如下:
![](https://i-blog.csdnimg.cn/blog_migrate/b06317545af1bb315e463738084be7c5.png)
经过对比,4次的情况下,效率最高,为什么继续展开效率会降低呢?
这和开头说的缓存行大小有关系了,缓存行大小16字节,而四次循环展开的情况下,每次循环,刚好是一个缓存行大小,只需涉及一个缓存行的操作。
而如果8次,那么需要32个字节,也就是需要加载两个缓存行,肯定更耗时,所以效率会降下来。