无耻抄袭,业务民科,笔记勿看
In code optimization, multi-versioning is a well-known approach to generate code that may adapt to a changing execution context: several versions of an original code snippet are generated at compile-time, each one being the result of some different optimizing transformations. When launching the resulting code, some runtime decisions are performed in order to select the convenient version to be run.
对于loop循环同样有类似的优化,在不同的runtime场景下,使用不同的优化。
GCC
loop vectorization
#include <array>
#include <vector>
#include <algorithm>
void foo(std::array<int, 256> &v) {
std::generate(begin(v), end(v), [](){ return 1;});
}
注:无情抄袭 https://www.youtube.com/watch?v=aB4ZxvPcTUg
# gcc11.2.0 -O2
foo(std::array<int, 256ul>&):
leaq 1024(%rdi), %rax # end(v)
.L2:
movl $1, (%rdi) # 赋值
addq $4, %rdi # 迭代
cmpq %rax, %rdi # 判断是否结束
jne .L2
ret
看得出来,上面只是一个普通的循环
# gcc11.2.0 -O2 -ftree-vectorize
foo(std::array<int, 256ul>&):
# -----------------------------------------------
# | 0x1 | 0x1 | 0x1 | 0x1 | %xmm0
# -----------------------------------------------
movdqa .LC0(%rip), %xmm0 # Move Aligned Double Quadword = 2 * 4 * 2bytes = 16 * 8 = 128bits xmm寄存器是128位宽。
leaq 1024(%rdi), %rax # (rdi * 1 + 0)+1024
.L2:
movups %xmm0, (%rdi) # Move Unaligned Packed Single-Precision Floating
addq $16, %rdi # 一次 迭代赋值4个int,loop unroll
cmpq %rax, %rdi # 判断是否结束
jne .L2
ret
.LC0: # local label,
.long 1
.long 1
.long 1
.long 1
添加-ftree-vectorize
后,enable了loop vectorization。如果添加-march=core-avx2
发现,会enable avx2,256bits。
# gcc11.2.0 -O2 -ftree-vectorize -march=core-avx2
foo(std::array<int, 256ul>&):
pushq %rbp
vpbroadcastd .LC1(%rip), %ymm0
leaq 1024(%rdi), %rax
movq %rsp, %rbp
.L2:
vmovdqu %ymm0, (%rdi)
addq $32, %rdi
cmpq %rax, %rdi
jne .L2
vzeroupper
popq %rbp
ret
.LC1:
.long 1
略微复杂的loop vectorization
前面我们看到了gcc对最初的代码进行了优化,256*int的处理,整好能够被128bits向量化代码完成。但如果我们将std::array<int, 256>
的长度换成了std::array<int, 259>
,就会得到下面的代码
foo(std::array<int, 259ul>&):
movdqa .LC0(%rip), %xmm0
movq %rdi, %rax
leaq 1024(%rdi), %rdx
.L2:
movups %xmm0, (%rax)
addq $16, %rax
cmpq %rax, %rdx
jne .L2
movq .LC1(%rip), %rax # 生成了一段额外的代码片段
movl $1, 1032(%rdi) #
movq %rax, 1024(%rdi) #
ret
.LC0:
.long 1
.long 1
.long 1
.long 1
.set .LC1,.LC0
更复杂的loop vectorization
前面的std::array<int, 256>
在compile-time就能知道需要向量化的size,如何把std::array
改成std::vector
,情况就会更复杂。
#include <array>
#include <vector>
#include <algorithm>
void foo(std::vector<int> &v) {
std::generate(begin(v), end(v), [](){ return 1;});
}
生成的汇编代码如下所示:
foo(std::vector<int, std::allocator<int> >&):
movq 8(%rdi), %rcx # beign
movq (%rdi), %rax # end
cmpq %rcx, %rax # 如果std::vector为空,则直接返回
je .L1
leaq -4(%rcx), %rsi # %rcx - 4 -> %rsi (index -1)
movq %rax, %rdx
subq %rax, %rsi # %rax - %rsi = %rsi address range
movq %rsi, %rdi # %rdi address range
shrq $2, %rdi # %rdi >>2 (rdi / 4) -> %rdi
addq $1, %rdi # std::vector size + 1
cmpq $8, %rsi
jbe .L6 # `jbe` is the same as `jle`, except that it performs an unsigned comparison.
# if %rsi <= $8, jump到.L6
movq %rdi, %rsi # std::vector size + 1
movdqa .LC0(%rip), %xmm0
shrq $2, %rsi # std::vector size / 4
salq $4, %rsi # (std::vector size / 4) * 16 -> %rsi
addq %rax, %rsi # %rax + %rsi -> %rsi
.L4: # vectorization loop version
movups %xmm0, (%rdx) # 赋值
addq $16, %rdx # address + 16,一次处理4个int,128bits
cmpq %rsi, %rdx # 是不是到了address end
jne .L4
movq %rdi, %rdx #
andq $-4, %rdx # 剩余element个数肯定 < 8,把后0x1111 1111 1111 1100 & %rdx,clear last 3bits
leaq (%rax,%rdx,4), %rax # %rax是start address,%rax + %rdx + 4 -> %rax,当前的下一个address
cmpq %rdx, %rdi # 如果向量化循环loop是否能够cover整个循环,不能的话,则使用trivial loop进行
je .L1
.L6:
movl $1, (%rax) # std::vector size <= 2,那么这是一个trivial的loop
addq $4, %rax
cmpq %rax, %rcx
jne .L6
.L1:
ret
.LC0:
.long 1
.long 1
.long 1
.long 1
使用-fopt-info-all
将其优化细节打印出来,如下所示:
# -O2 -ftree-vectorize -fopt-info-all -g3
/opt/compiler-explorer/gcc-11.2.0/include/c++/11.2.0/bits/stl_algo.h:4431:22: optimized: loop vectorized using 16 byte vectors
<source>:5:6: note: vectorized 1 loops in function.
<source>:7:1: note: ***** Analysis failed with vector mode V2DI
<source>:7:1: note: ***** Skipping vector mode V16QI, which would repeat the analysis for V2DI
其中需要解释的是V2DI和V16QI,这两个都是不同的向量化模式,例如具体细节见VECTOR_MODES (INT, 16); /* V16QI V8HI V4SI V2DI */, 如果enable avx2的哈,那么向量化模式就是 V32QI 和 V4DI。
我们可以看到这里,虽然进行了向量化的处理,但是并没有进行loop unroll的优化。关于loop unroll的问题,我们后面再说。