让我们反编译看看GCC 4.8用它做什么
没有__builtin_expect
#include "stdio.h"
#include "time.h"
int main() {
/* Use time to prevent it from being optimized away. */
int i = !time(NULL);
if (i)
printf("%d\n", i);
puts("a");
return 0;
}
使用GCC 4.8.2 x86_64 Linux编译和反编译:
gcc -c -O3 -std=gnu11 main.c
objdump -dr main.o
输出:
0000000000000000 :
0: 48 83 ec 08 sub $0x8,%rsp
4: 31 ff xor %edi,%edi
6: e8 00 00 00 00 callq b
7: R_X86_64_PC32 time-0x4
b: 48 85 c0 test %rax,%rax
e: 75 14 jne 24
10: ba 01 00 00 00 mov $0x1,%edx
15: be 00 00 00 00 mov $0x0,%esi
16: R_X86_64_32 .rodata.str1.1
1a: bf 01 00 00 00 mov $0x1,%edi
1f: e8 00 00 00 00 callq 24
20: R_X86_64_PC32 __printf_chk-0x4
24: bf 00 00 00 00 mov $0x0,%edi
25: R_X86_64_32 .rodata.str1.1+0x4
29: e8 00 00 00 00 callq 2e
2a: R_X86_64_PC32 puts-0x4
2e: 31 c0 xor %eax,%eax
30: 48 83 c4 08 add $0x8,%rsp
34: c3 retq
内存中的指令顺序没有变化:首先是printf然后是puts和retq返回。
用__builtin_expect
现在将if (i)替换为:
if (__builtin_expect(i, 0))
我们得到:
0000000000000000 :
0: 48 83 ec 08 sub $0x8,%rsp
4: 31 ff xor %edi,%edi
6: e8 00 00 00 00 callq b
7: R_X86_64_PC32 time-0x4
b: 48 85 c0 test %rax,%rax
e: 74 11 je 21
10: bf 00 00 00 00 mov $0x0,%edi
11: R_X86_64_32 .rodata.str1.1+0x4
15: e8 00 00 00 00 callq 1a
16: R_X86_64_PC32 puts-0x4
1a: 31 c0 xor %eax,%eax
1c: 48 83 c4 08 add $0x8,%rsp
20: c3 retq
21: ba 01 00 00 00 mov $0x1,%edx
26: be 00 00 00 00 mov $0x0,%esi
27: R_X86_64_32 .rodata.str1.1
2b: bf 01 00 00 00 mov $0x1,%edi
30: e8 00 00 00 00 callq 35
31: R_X86_64_PC32 __printf_chk-0x4
35: eb d9 jmp 10
printf(编译为__printf_chk)移动到函数的最后,在puts之后,返回以改进其他答案所提到的分支预测。
所以它基本上是相同的:
int i = !time(NULL);
if (i)
goto printf;
puts:
puts("a");
return 0;
printf:
printf("%d\n", i);
goto puts;
这种优化不是用-O0完成的。
但是写一个运行速度比without5ѭ快的例子好运,那些时候CPU非常聪明。我天真的尝试就在这里。