【leveldb】从EncodeFixed64来看汇编层面的性能优化

最新推荐文章于 2022-10-29 17:22:56 发布

馒头2870

最新推荐文章于 2022-10-29 17:22:56 发布

阅读量458

点赞数 1

分类专栏： leveldb c/c++ 数据库

本文链接：https://blog.csdn.net/sjc2870/article/details/112710375

版权

数据库同时被 3 个专栏收录

31 篇文章 1 订阅

订阅专栏

c/c++

15 篇文章 0 订阅

订阅专栏

leveldb

3 篇文章 0 订阅

订阅专栏

前言

最近在读leveldb的代码，看到了EncodeFixed64的写法感觉很不解，为什么要写的这么繁琐晦涩，把整数存在一个char数组中不是一句sprintf就解决了吗？深挖之后，还真不是这么简单

从git提交记录查起

从github上看关于这个函数最近的一个提交，Remove leveldb::port::kLittleEndian。
提交内容中写道：Clang 10包含了以下描述的优化https://bugs.llvm.org/show_bug.cgi?id=41761。这意味着 {Decode，Encode} Fixed {32,64}（）的与平台无关的实现在最新的Clang和GCC上编译为一条指令。
似乎是关于编译器优化导致的变动，那么我们看一下这个网址。
其中写道：LLVM在适当的情况下将连续的字节装入合并为较大的装入，但对于存储(store)则不这样做，这很不幸。有这些优化的化将使我能够编写与平台无关的代码来加载/存储{little，big} -endian整数。没有优化，我需要为小端平台显式编写一条方法。DecodeFixed32和DecodeFixed64在x864_64平台将会被编译为一条Mov指令，而对于EncodeFixed32和EncodeFixed64则不是这样。gcc8和更高的版本优化store和load，而低版本只优化load系列函数。MSVC对两者都不优化。(这里的store和load分别指encode系列和decode系列函数)
细细品味，似乎是llvm只对Decode函数做优化，而不对Encode函数做优化。
网址中提到了另一个网址： https://godbolt.org/z/45S0ID，我们来看下这个网址的内容。
在这里插入图片描述
图片显式不是很清晰，但总体来看，左边是代码，右边是不同版本的编译器编译之后产生的汇编语句，可以看到，这么长的代码编译之后在x86-64gcc8.2和ARMgcc8.2和x86-64 clang trunk版本都是编译为了一条语句
那么，我们也可以看一下我们一开始想到的sprintf是编译为了几条汇编语句，此处只以encode_fixed64为例来看
在这里插入图片描述图片也不是很清晰，我们把x86-64gcc8.2的编译结果拿出来看

encode_fixed64(char*, unsigned long):
        mov     QWORD PTR [rdi], rsi
        ret
.LC0:
        .string "%lu"
encode_sprintf(char*, unsigned long):
        mov     rdx, rsi
        xor     eax, eax
        mov     esi, OFFSET FLAT:.LC0
        jmp     sprintf

可以看到，encode_fixed64被编译为了一条mov指令，而sprintf被编译为了四条指令，而且最后一条指令是jmp到sprintf，还要到sprintf函数中进行执行，那这就又多出了函数跳转的开销，和一条mov指令的效率肯定是没法比的。

自己汇编看看

看到这里，我们可以用自己的编译器来看看其中的差距，我的编译器版本是8.3.0
下面是测试代码

#include <cstdint>
#include <stdio.h>
#include <inttypes.h>

void encode_fixed64(char *ptr, uint64_t value) {
    uint8_t *const buf = reinterpret_cast<uint8_t*>(ptr);
    buf[0] = static_cast<uint8_t>(value);
    buf[1] = static_cast<uint8_t>(value >> 8);
    buf[2] = static_cast<uint8_t>(value >> 16);
    buf[3] = static_cast<uint8_t>(value >> 24);
    buf[4] = static_cast<uint8_t>(value >> 32);
    buf[5] = static_cast<uint8_t>(value >> 40);
    buf[6] = static_cast<uint8_t>(value >> 48);
    buf[7] = static_cast<uint8_t>(value >> 56);
}

void encode_sprintf(char *ptr, uint64_t value) {
    sprintf(ptr, "%" PRIu64 "", value);
}

经过汇编之后(g++ -O2 -S test.cc -o test.i)，我们可以看看这个test.i的内容

        .file   "test.cc"
        .text
        .p2align 4,,15
        .globl  _Z14encode_fixed64Pcm
        .type   _Z14encode_fixed64Pcm, @function
_Z14encode_fixed64Pcm:
.LFB16:
        .cfi_startproc
        movq    %rsi, (%rdi)
        ret
        .cfi_endproc
.LFE16:
        .size   _Z14encode_fixed64Pcm, .-_Z14encode_fixed64Pcm
        .section        .rodata.str1.1,"aMS",@progbits,1
.LC0:
        .string "%lu"
        .text
        .p2align 4,,15
        .globl  _Z14encode_sprintfPcm
        .type   _Z14encode_sprintfPcm, @function
_Z14encode_sprintfPcm:
.LFB17:
        .cfi_startproc
        movq    %rsi, %rdx
        xorl    %eax, %eax
        movl    $.LC0, %esi
        jmp     sprintf
        .cfi_endproc
.LFE17:
        .size   _Z14encode_sprintfPcm, .-_Z14encode_sprintfPcm
        .ident  "GCC: (GNU) 8.3.0"
        .section        .note.GNU-stack,"",@progbits

可以看到，和上面的在线编译器结果差不太多。
最后，我们来分别执行这两个函数一亿次，看看性能究竟差多少
在这里插入图片描述
sprintf用了6s+，而encode_fixed64执行0秒
即使把循环次数改为十亿次，encode_fixed64仍然是0秒，想想也是，encode_fixed64的代价只有一条mov指令而已

附测试代码

#include <cstdint>
#include <stdio.h>
#include <inttypes.h>

#include <iostream>
#include <chrono>

#define N 1000000000
using namespace std;
using namespace chrono;

void encode_fixed64(char *ptr, uint64_t value) {
    uint8_t *const buf = reinterpret_cast<uint8_t*>(ptr);
    buf[0] = static_cast<uint8_t>(value);
    buf[1] = static_cast<uint8_t>(value >> 8);
    buf[2] = static_cast<uint8_t>(value >> 16);
    buf[3] = static_cast<uint8_t>(value >> 24);
    buf[4] = static_cast<uint8_t>(value >> 32);
    buf[5] = static_cast<uint8_t>(value >> 40);
    buf[6] = static_cast<uint8_t>(value >> 48);
    buf[7] = static_cast<uint8_t>(value >> 56);
}

void encode_sprintf(char *ptr, uint64_t value) {
    sprintf(ptr, "%" PRIu64 "", value);
}

int main(){
    char buf[64] = {0};

    auto beg = chrono::system_clock::now();
    for(size_t i = 0; i < N; ++i) {
        encode_fixed64(buf, 128);
    }
    auto end = chrono::system_clock::now();
    auto duration = chrono::duration_cast<chrono::microseconds>(end - beg);
    cout <<  "encode_fixed64 花费了"
     << double(duration.count()) * microseconds::period::num / microseconds::period::den
     << "秒" << endl;

    beg = system_clock::now();
    for(size_t i = 0; i < N; ++i) {
        encode_sprintf(buf, 128);
    }
    end = system_clock::now();
    duration = chrono::duration_cast<chrono::microseconds>(end - beg);
    cout <<  "encode_sprintf 花费了"
     << double(duration.count()) * microseconds::period::num / microseconds::period::den
     << "秒" << endl;
}

总结

一个小小的函数竟包含了这么多的内容，如果不深入去查那么永远也不知道还可以这么优化。查完之后，顿时对leveldb的作者更加敬仰佩服了，优化到这种层面并根据编译器的迭代而优化代码，真真巨佬。(另外，这就是c++的真正威力吗，爱了爱了)。以后碰到存一个数字到char数组中的时候，再也不用写sprintf了，直接一个encode_fixed64摔过去，逼格++++

馒头2870

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
【leveldb】从EncodeFixed64来看汇编层面的性能优化

前言最近在读leveldb的代码，看到了EncodeFixed64的写法感觉很不解，为什么要写的这么繁琐晦涩，把整数存在一个char数组中不是一句sprintf就解决了吗？深挖之后，还真不是这么简单从git提交记录查起从github上看关于这个函数最近的一个提交，Remove leveldb::port::kLittleEndian。提交内容中写道：Clang 10包含了以下描述的优化https://bugs.llvm.org/show_bug.cgi?id=41761。这意味着 {D
复制链接

扫一扫