从汇编层看64位程序运行——likely提示编译器的优化案例和底层实现分析

breaksoftware

于 2024-09-03 00:15:00 发布

阅读量540

点赞数 18

分类专栏：从汇编层看64位程序运行文章标签：汇编 c++

本文链接：https://blog.csdn.net/breaksoftware/article/details/141599792

版权

从汇编层看64位程序运行专栏收录该内容

15 篇文章 2 订阅

订阅专栏

大纲

代码
分析

我们在《Modern C++——使用分支预测优化代码性能》一文中介绍了likely提示编译器进行编译优化，但是我们又讲了最终优化不是对分支顺序的调换，那么它到底做了什么样的优化，让整体性能提升20%呢？

代码

我们先回顾下代码

#include <chrono>
#include <cmath>
#include <iomanip>
#include <iostream>
#include <random>
#include <functional>
 
namespace with_attributes {
    constexpr double pow(double x, long long n) noexcept {
        if (n <= 0) [[unlikely]]
            return 1;
        else [[likely]]
            return x * pow(x, n - 1);
    }
} // namespace with_attributes
 
namespace no_attributes {
    constexpr double pow(double x, long long n) noexcept {
        if (n <= 0)
            return 1;
        else
            return x * pow(x, n - 1);
    }
} // namespace no_attributes

double calc(double x, std::function<double(double, long long)> f) noexcept {
    constexpr long long precision{16LL};
    double y{};
    for (auto n{0LL}; n < precision; n += 2LL)
        y += f(x, n);
    return y;
}

double gen_random() noexcept {
    static std::random_device rd;
    static std::mt19937 gen(rd());
    static std::uniform_real_distribution<double> dis(-1.0, 1.0);
    return dis(gen);
}
 
volatile double sink{}; // ensures a side effect
 
int main() {
    auto benchmark = [](auto fun, auto rem)
    {
        const auto start = std::chrono::high_resolution_clock::now();
        for (auto y{1ULL}; y != 500'000'000ULL; ++y)
            sink = calc(gen_random(), fun);
        const std::chrono::duration<double> diff =
            std::chrono::high_resolution_clock::now() - start;
        std::cout << "Time: " << std::fixed << std::setprecision(6) << diff.count()
                  << " sec " << rem << std::endl; 
    };
 
    benchmark(with_attributes::pow, "(with attributes)");
    benchmark(no_attributes::pow, "(without attributes)");
    benchmark(with_attributes::pow, "(with attributes)");
    benchmark(no_attributes::pow, "(without attributes)");
    benchmark(with_attributes::pow, "(with attributes)");
    benchmark(no_attributes::pow, "(without attributes)");
}

以及执行效果
在这里插入图片描述

分析

现在我们开始探究性能提升的本质原因。

常言道：代码之前了无秘密。但是对于C++的程序，这个“代码”指的是汇编代码。所以我们需要查看with_attributes::pow和no_attributes::pow的底层实现。

with_attributes::pow

在这里插入图片描述

no_attributes::pow

在这里插入图片描述

分析

我们看到with_attributes::pow和no_attributes::pow在汇编层的实现结构是类似的，但是它们并没有遵从C++代码的结构。因为完全按照C++代码结构进行编译，其结构应该类似于

test   %rdi,%rdi
jle XXXXX_RET_ADDRESS
sub    $0x1,%rdi
movsd  %xmm0,0x8(%rsp)
call pow
……

由于我们开启了o3等级编译优化，所以编译器对with_attributes::pow和no_attributes::pow都进行了优化。

我们分析no_attributes::pow的优化方案：

test %rdi,%rdi检测n
jle 0x555555555c20 <_ZN13no_attributes3powEdx+112>如果n小于等于0，则跳转到+112位置，即返回1。
movapd %xmm0,%xmm1如果n大于0，则将x值放入xmm1寄存器。
cmp $0x1,%rdi对比n是否等于1。
je 0x555555555c1c <_ZN13no_attributes3powEdx+108>如果n是1，则运行+108位置，即返回x的值。因为pow(x,1)=x* pow(x,0)= x。
cmp $0x2,%rdi对比x是否等于2。
je 0x555555555c18 <_ZN13no_attributes3powEdx+104>如果n是2，则运行+104位置。它会执行一次x*x（mulsd %xmm1,%xmm0）。这是因为pow(x,2)=x * pow(x,1)= x * x * pow(x, 0) = x * x * 1 = x * x。

后面逻辑以此类推。
在这里插入图片描述

但是当n大于等于5时，逻辑会发生变动：它会递归调用pow(x, n-5）,取到结果后再与x进行5次连乘。
在这里插入图片描述

可以发现，这段代码的优化方案是：使用直接操作替代递归调用。

再看likely优化后的代码，可以发现它们的结构是一致的。那么为什么likely为什么比没标记过的代码快20%呢？

这是因为likely版本采用了"x的8连乘"，这比未标记版本的"x的5连乘"要减少更多的递归调用。
在这里插入图片描述
所以，总结下：本案例中，likely标记版提示编译器做了深度更深的优化，采用直接操作，减少递归调用。

breaksoftware

关注

18
点赞
踩
12

收藏

觉得还不错? 一键收藏
打赏
0
评论
从汇编层看64位程序运行——likely提示编译器的优化案例和底层实现分析

我们在一文中介绍了likely提示编译器进行编译优化，但是我们又讲了最终优化不是对分支顺序的调换，那么它到底做了什么样的优化，让整体性能提升20%呢？
复制链接

扫一扫