[loop optimization] loop versioning

无耻抄袭,业务民科,笔记勿看

In code optimization, multi-versioning is a well-known approach to generate code that may adapt to a changing execution context: several versions of an original code snippet are generated at compile-time, each one being the result of some different optimizing transformations. When launching the resulting code, some runtime decisions are performed in order to select the convenient version to be run.

对于loop循环同样有类似的优化,在不同的runtime场景下,使用不同的优化。

GCC

loop vectorization

#include <array>
#include <vector>
#include <algorithm>

void foo(std::array<int, 256> &v) {
    std::generate(begin(v), end(v), [](){ return 1;});
}

注:无情抄袭 https://www.youtube.com/watch?v=aB4ZxvPcTUg

# gcc11.2.0 -O2
foo(std::array<int, 256ul>&):
        leaq    1024(%rdi), %rax # end(v)
.L2:
        movl    $1, (%rdi)       # 赋值
        addq    $4, %rdi         # 迭代
        cmpq    %rax, %rdi       # 判断是否结束
        jne     .L2
        ret

看得出来,上面只是一个普通的循环

# gcc11.2.0 -O2 -ftree-vectorize
foo(std::array<int, 256ul>&):
        #  -----------------------------------------------
        # |    0x1    |    0x1    |    0x1    |    0x1    | %xmm0   
        #  -----------------------------------------------
        movdqa  .LC0(%rip), %xmm0   # Move Aligned Double Quadword = 2 * 4 * 2bytes = 16 * 8 = 128bits xmm寄存器是128位宽。
        leaq    1024(%rdi), %rax    # (rdi * 1 + 0)+1024
.L2:
        movups  %xmm0, (%rdi)       # Move Unaligned Packed Single-Precision Floating
        addq    $16, %rdi           # 一次 迭代赋值4个int,loop unroll
        cmpq    %rax, %rdi          # 判断是否结束
        jne     .L2 
        ret
.LC0:                               # local label, 
        .long   1
        .long   1
        .long   1
        .long   1

添加-ftree-vectorize后,enable了loop vectorization。如果添加-march=core-avx2发现,会enable avx2,256bits。

# gcc11.2.0 -O2 -ftree-vectorize -march=core-avx2
foo(std::array<int, 256ul>&):
        pushq   %rbp
        vpbroadcastd    .LC1(%rip), %ymm0
        leaq    1024(%rdi), %rax
        movq    %rsp, %rbp
.L2:
        vmovdqu %ymm0, (%rdi)
        addq    $32, %rdi
        cmpq    %rax, %rdi
        jne     .L2
        vzeroupper
        popq    %rbp
        ret
.LC1:
        .long   1

略微复杂的loop vectorization

前面我们看到了gcc对最初的代码进行了优化,256*int的处理,整好能够被128bits向量化代码完成。但如果我们将std::array<int, 256>的长度换成了std::array<int, 259>,就会得到下面的代码

foo(std::array<int, 259ul>&):
        movdqa  .LC0(%rip), %xmm0
        movq    %rdi, %rax
        leaq    1024(%rdi), %rdx
.L2:
        movups  %xmm0, (%rax)
        addq    $16, %rax
        cmpq    %rax, %rdx
        jne     .L2
        movq    .LC1(%rip), %rax  # 生成了一段额外的代码片段
        movl    $1, 1032(%rdi)    # 
        movq    %rax, 1024(%rdi)  #
        ret
.LC0:
        .long   1
        .long   1
        .long   1
        .long   1
        .set    .LC1,.LC0

CFG

更复杂的loop vectorization

前面的std::array<int, 256>在compile-time就能知道需要向量化的size,如何把std::array改成std::vector,情况就会更复杂。

#include <array>
#include <vector>
#include <algorithm>

void foo(std::vector<int> &v) {
    std::generate(begin(v), end(v), [](){ return 1;});
}

生成的汇编代码如下所示:

foo(std::vector<int, std::allocator<int> >&):
        movq    8(%rdi), %rcx     # beign
        movq    (%rdi), %rax      # end
        cmpq    %rcx, %rax        # 如果std::vector为空,则直接返回
        je      .L1            
        leaq    -4(%rcx), %rsi    # %rcx - 4 -> %rsi (index -1)
        movq    %rax, %rdx        
        subq    %rax, %rsi        # %rax - %rsi = %rsi address range
        movq    %rsi, %rdi        # %rdi address range
        shrq    $2, %rdi          #  %rdi >>2 (rdi / 4) -> %rdi
        addq    $1, %rdi          # std::vector size + 1
        cmpq    $8, %rsi          
        jbe     .L6               # `jbe` is the same as `jle`, except that it performs an unsigned comparison.
                                  # if %rsi <= $8, jump到.L6
        movq    %rdi, %rsi        # std::vector size + 1
        movdqa  .LC0(%rip), %xmm0
        shrq    $2, %rsi          # std::vector size / 4
        salq    $4, %rsi          # (std::vector size / 4) * 16 -> %rsi
        addq    %rax, %rsi        # %rax + %rsi -> %rsi
.L4:                              # vectorization loop version
        movups  %xmm0, (%rdx)     # 赋值
        addq    $16, %rdx         # address + 16,一次处理4个int,128bits
        cmpq    %rsi, %rdx        # 是不是到了address end
        jne     .L4               
        movq    %rdi, %rdx        # 
        andq    $-4, %rdx         # 剩余element个数肯定 < 8,把后0x1111 1111 1111 1100 & %rdx,clear last 3bits
        leaq    (%rax,%rdx,4), %rax # %rax是start address,%rax + %rdx + 4 -> %rax,当前的下一个address
        cmpq    %rdx, %rdi          # 如果向量化循环loop是否能够cover整个循环,不能的话,则使用trivial loop进行
        je      .L1
.L6:
        movl    $1, (%rax)     # std::vector size <= 2,那么这是一个trivial的loop
        addq    $4, %rax
        cmpq    %rax, %rcx
        jne     .L6
.L1:
        ret
.LC0:
        .long   1
        .long   1
        .long   1
        .long   1

使用-fopt-info-all将其优化细节打印出来,如下所示:

# -O2 -ftree-vectorize -fopt-info-all -g3
/opt/compiler-explorer/gcc-11.2.0/include/c++/11.2.0/bits/stl_algo.h:4431:22: optimized: loop vectorized using 16 byte vectors
<source>:5:6: note: vectorized 1 loops in function.
<source>:7:1: note: ***** Analysis failed with vector mode V2DI
<source>:7:1: note: ***** Skipping vector mode V16QI, which would repeat the analysis for V2DI

其中需要解释的是V2DIV16QI,这两个都是不同的向量化模式,例如具体细节见VECTOR_MODES (INT, 16); /* V16QI V8HI V4SI V2DI */, 如果enable avx2的哈,那么向量化模式就是 V32QIV4DI

我们可以看到这里,虽然进行了向量化的处理,但是并没有进行loop unroll的优化。关于loop unroll的问题,我们后面再说。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值