1、前言
前几天与小伙伴讨论了switch-case、if-else孰优孰劣,同时想起了当年有个cpython使用goto来优化switch-case的,性能提升了15-20%。使用的就是 computed goto 方法。
2、使用场景
在Python中有个字节码解释器(Bytecode Interpreter),源码是Python/ceval.c,主要是循环地处理字节流,输入一个char,输出一个计算结果。按参考文章[1],将核心代码模拟出来。
3、代码测试
首先是宏定义、switch-case的实现:
#define OP_HALT 0x0
#define OP_INC 0x1
#define OP_DEC 0x2
#define OP_MUL2 0x3
#define OP_DIV2 0x4
#define OP_ADD7 0x5
#define OP_NEG 0x6
#define OP_MAX 0x7
int bm_dispatch::interp_switch(const uint8_t *code, int initval)
{
int pc = 0;
int val = initval;
while (1) {
switch (code[pc++]) {
case OP_HALT:
return val;
case OP_INC:
val++;
break;
case OP_DEC:
val--;
break;
case OP_MUL2:
val *= 2;
break;
case OP_DIV2:
val /= 2;
break;
case OP_ADD7:
val += 7;
break;
case OP_NEG:
val = -val;
break;
default:
return val;
}
}
}
然后使用computed goto的实现:
int bm_dispatch::interp_cgoto(const uint8_t *code, int initval)
{
/*
* The indices of labels in the dispatch_table are the relevant opcodes
*/
static void *dispatch_table[] = {
&&do_halt, &&do_inc, &&do_dec, &&do_mul2,
&&do_div2, &&do_add7, &&do_neg
};
#define DISPATCH() goto *dispatch_table[code[pc++]]
int pc = 0;
int val = initval;
DISPATCH();
while (1) {
do_halt:
return val;
do_inc:
val++;
DISPATCH();
do_dec:
val--;
DISPATCH();
do_mul2:
val *= 2;
DISPATCH();
do_div2:
val /= 2;
DISPATCH();
do_add7:
val += 7;
DISPATCH();
do_neg:
val = -val;
DISPATCH();
}
}
测试我们使用google-benchmark框架,
为了测试对比,我们每次测试,使用了一个相同的随机Buffer输入,
同时上述的两个函数分别执行100w次。
#include <stdlib.h>
#include <time.h>
#include <benchmark/benchmark.h>
class bm_dispatch:
public ::benchmark::Fixture
{
public:
int val_a = 0;
int val_b = 0;
uint8_t code[SIZE_MB] = {0};
bm_dispatch()
{
srandom(time(NULL));
val_a = val_b = random();
/* random-code */
for (size_t ix = 0; ix < sizeof(code) - 1; ++ix) {
code[ix] = random() % OP_MAX;
}
}
void SetUp(const ::benchmark::State &st)
{
}
void TearDown(const ::benchmark::State &)
{
}
int interp_switch(const uint8_t *code, int initval);
int interp_cgoto(const uint8_t *code, int initval);
int bm_switch_x100w()
{
for (int ix = 0; ix < 100 * 10000; ++ix) {
val_a = interp_switch(code, val_a);
}
return val_a;
}
int bm_cgoto_x100w()
{
for (int ix = 0; ix < 100 * 10000; ++ix) {
val_b = interp_cgoto(code, val_b);
}
return val_b;
}
};
4、测试结果
编译方法:
g++ -std=c++11 -Wall -O3 -Os -o bm_dispatch bm_dispatch.cc
-I/usr/local/hawk/include
-lbenchmark -lbenchmark_main -pthread
找一个 i5-6500 的机器,注意要是出现这个警告:
WARNING CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
表示你开启了CPU动态调频模式,手动关一下:
cpu_num=$(cat /proc/cpuinfo | grep "processor" | grep ":" | wc -l)
for ((ix = 0; ix < ${cpu_num}; ix++)); do
echo "performance" >/sys/devices/system/cpu/cpu${ix}/cpufreq/scaling_governor
done
执行结果:
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
bm_dispatch/interp_switch 24719496 ns 24719191 ns 28
bm_dispatch/interp_cgoto 19727398 ns 19727291 ns 35
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
bm_dispatch/interp_switch 4618213 ns 4618215 ns 153
bm_dispatch/interp_cgoto 3442024 ns 3442025 ns 203
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
bm_dispatch/interp_switch 19258560 ns 19244362 ns 36
bm_dispatch/interp_cgoto 19258885 ns 19244355 ns 36
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
bm_dispatch/interp_switch 10058817 ns 10057068 ns 70
bm_dispatch/interp_cgoto 10109119 ns 10108997 ns 69
看最后一列,执行结果5次也就2-3次cgoto
快些,甚至多次出现持平的情况
5.总结
用户态的这个试验,也不是15%的提升嘛,
可能与测试方法有关,具体分析估计还得看汇编是怎么处理了,,
实际上这种computed-goto也是属于底层的优化了,对于每秒上千万ops的处理才有效果,
对于普通的应用程序,我们还是以代码可读性为主。。
参考文章:
[1] https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables