GCC乘法优化测试-CSDN博客

这个标题有点大，我并没有看过GCC的源码，所以也不知道很准确的GCC对乘法的优化。但是我做了几个测试，发现GCC的强大之处，与君共享。
突然想到做这个测试，是因为前段时间有个面试，出了一个很基础的问题，y = x * 63怎么优化。这种是学校的学生都会做的问题： y = (x << 6) - x; 可是这种方法到底能提升多少效率？有数据有真相。
我测试的平台是在Linux 2.6.32，GCC 4.6。
下面一段是测试的代码

#include <sys/time.h>
#include <ctime>
#include <cstdio>
#include <cstdlib>

const int USEC_PER_SEC = 1000000 ;

int main(void)
{
    const int count = 1000000 ;
    const int factor = 63 ;

    printf("test count is %d\n", count);

    srand(time(NULL));
    const int cases = 4096 ;
    int init[cases] ;
    for (int i = 0 ; i < cases ; i++)
        init[i] = rand() % 49999 ;

    int tmp ;

    struct timeval start, end ;

    gettimeofday(&start, NULL) ;
    tmp = 0;
    for (int i = 0 ; i < count ; i++)
        for (int j = 0 ; j < cases ; j++)
            tmp += init[j] * factor ;

    gettimeofday(&end, NULL) ;
    end.tv_sec -= start.tv_sec ;
    end.tv_usec -= start.tv_usec ;
    if (end.tv_usec < 0)
    {
        end.tv_sec-- ;
        end.tv_usec += USEC_PER_SEC ;
    }
    printf("result tmp is %d\n", tmp) ;
    printf("nomal mul: %d.%06d\n", end.tv_sec, end.tv_usec) ;

    gettimeofday(&start, NULL) ;
    tmp = 0;
    for (int i = 0 ; i < count ; i++)
    {
        for (int j = 0 ; j < cases; j++)
        {
            const int a = init[j] ;
            tmp += (a << 6) - a ; // y = x * 63优化的代码
        }
    }
    gettimeofday(&end, NULL) ;
    end.tv_sec -= start.tv_sec ;
    end.tv_usec -= start.tv_usec ;
    if (end.tv_usec < 0)
    {
        end.tv_sec-- ;
        end.tv_usec += USEC_PER_SEC ;
    }
    printf("result tmp is %d\n", tmp) ;
    printf("optimized mul: %d.%06d\n", end.tv_sec, end.tv_usec) ;

    return 0 ;
}

测试的数据很不理想，用 gcc -O0 编译，就是不优化，得出的结果是：

test count is 1000000
result tmp is -1332858560
nomal mul: 8.261260
result tmp is -1332858560
optimized mul: 8.483564

用gcc -O2优化后的结果是：

test count is 1000000
result tmp is -1897458560
nomal mul: 2.133206
result tmp is -1897458560
optimized mul: 2.133081

其实结果已经很明显了，即使是-O0的选项，gcc还是对乘法做了优化。看看-O0对y = x * 63和y = (x << 6) - x生成的汇编代码是怎么样的：

.L6:
    .loc 1 29 0
    movl    -32(%rbp), %eax
    cltq
    movl    -16480(%rbp,%rax,4), %edx
    movl    %edx, %eax
    sall    $6, %eax            ; 这里其实是y = x * 63
    subl    %edx, %eax
    addl    %eax, -44(%rbp)
    .loc 1 28 0
    addl    $1, -32(%rbp)
.L5:
    cmpl    $4095, -32(%rbp)
    setle   %al
    testb   %al, %al
    jne .L6
; 此处略去几千字符
.LBB8:
    .loc 1 48 0
    movl    -24(%rbp), %eax
    cltq
    movl    -16480(%rbp,%rax,4), %eax
    movl    %eax, -20(%rbp)
    .loc 1 49 0
    movl    -20(%rbp), %eax
    sall    $6, %eax           ; 这一段才是y = (x << 6) - x
    subl    -20(%rbp), %eax
    addl    %eax, -44(%rbp)

很遗憾，即使用-O0，gcc还是对乘法做了优化，可能这种优化太简单了吧。-O2 优化的汇编代码自然不用说，这段是类似的。
通过x * 63这种代码优化，突然想起来，我们经常做的 x * 1000能不能也用这种优化。于是也测试了一把，x * 1000改成加法操作：

y = (x << 10) - (x << 4) - (x << 3) ;

就是

y = x * 1024 - x * 16 - x * 8 ;

下面是不做优化编译的结果：

test count is 1000000
result tmp is -1742412800
nomal mul: 7.867746
result tmp is -1742412800
optimized mul: 10.117456

很明显，跟我期望的结果相反，乘法的CPU指令周期是多少不太清楚，但是看测试结果，已经不能用这种方法优化了。
再看看O2优化的结果：

test count is 1000000
result tmp is 1581241344
nomal mul: 2.138208
result tmp is 1581241344
optimized mul: 2.129256

为什么会这么快？
看看汇编代码吧。

.L4:
    movl    (%rax), %esi
    addq    $4, %rax
    imull   $1000, %esi, %edx ; 直接用乘法指令
    addl    %edx, %r14d
    cmpq    %rbx, %rax
; 优化后的代码
.L8:        
    movl    (%rax), %edx
    addq    $4, %rax
    imull   $1000, %edx, %edx  ; 优化后直接改成了乘以1000，好牛X
    addl    %edx, %r13d
    cmpq    %rax, %rbx