这个线程引起了我的注意,因为它处理的是一个简单的问题,它需要大量的工作(CPU周期),即使对于现代CPU来说也是如此。有一天,我也站在那里,带着同样的#%“#”问题。我不得不翻几百万字节。然而,我知道我所有的目标系统都是基于现代英特尔的,所以让我们开始优化到极致!
所以我用马特·J的查找代码作为基础。我正在设定的系统是i7 Haswell 4700 eq。
MattJ的查找位翻转了40万字节:大约0.272秒。
然后,我继续试着看看英特尔的ISPC编译器是否能够在rese.c中将算法向量化。
我不会让你们对我的发现感到厌烦,因为我尝试了很多来帮助编译器找到一些东西,不管怎样,我以0.15秒的性能来翻转40万字节。这是一个很大的减少,但对我的应用程序来说,还是太慢了。
所以人们让我介绍世界上最快的基于英特尔的贱货。时刻表:
时间比特翻转400000000字节:0.050082秒!// Bitflip using AVX2 - The fastest Intel based bitflip in the world!!// Made by Anders Cedronius 2014 (anders.cedronius (you know what) gmail.com)#include #include #include #include using namespace std;#define DISPLAY_HEIGHT 4#define DISPLAY_WIDTH 32#define NUM_DATA_BYTES 400000000// Constants (first we got the mask, then the high order nibble look up table and last we got the low order nibble lookup table)__attribute__ ((aligned(32))) static unsigned char k1[32*3]={
0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,0x0f,
0x00,0x08,0x04,0x0c,0x02,0x0a,0x06,0x0e,0x01,0x09,0x05,0x0d,0x03,0x0b,0x07,0x0f,0x00,0x08,0x04,0x0c,0x02,0x0a,0x06,0x0e,0x01,0x09,0x05,0x0d,0x03,0x0b,0x07,0x0f,
0x00,0x80,0x40,0xc0,0x20,0xa0,0x60,0xe0,0x10,0x90,0x50,0xd0,0x30,0xb0,0x70,0xf0,0x00,0x80,0x40,0xc0,0x20,0xa0,0x60,0xe0,0x10,0x90,0x50,0xd0,0x30,0xb0,0x70,0xf0};// The data to be bitflipped (+32 to avoid the quantization out of memory problem)__attribute__ ((aligned(32))) static unsigned char data[NUM_DATA_BYTES+32]={};extern "C" {void bitflipbyte(unsigned char[],unsigned int,unsigned char[]);}int main(){
for(unsigned int i = 0; i
{
data[i] = rand();
}
printf ("\r\nData in(start):\r\n");
for (unsigned int j = 0; j
{
for (unsigned int i = 0; i
{
printf ("0x%02x,",data[i+(j*DISPLAY_WIDTH)]);
}
printf ("\r\n");
}
printf ("\r\nNumber of 32-byte chunks to convert: %d\r\n",(unsigned int)ceil(NUM_DATA_BYTES/32.0));
double start_time = omp_get_wtime();
bitflipbyte(data,(unsigned int)ceil(NUM_DATA_BYTES/32.0),k1);
double end_time = omp_get_wtime();
printf ("\r\nData out:\r\n");
for (unsigned int j = 0; j
{
for (unsigned int i = 0; i
{
printf ("0x%02x,",data[i+(j*DISPLAY_WIDTH)]);
}
printf ("\r\n");
}
printf("\r\n\r\nTime to bitflip %d bytes: %f seconds\r\n\r\n",NUM_DATA_BYTES, end_time-start_time);
// return with no errors
return 0;}
打印的是用来调试的。
这是工作马:bits 64global bitflipbyte
bitflipbyte:
vmovdqa ymm2, [rdx]
add rdx, 20h
vmovdqa ymm3, [rdx]
add rdx, 20h
vmovdqa ymm4, [rdx]bitflipp_loop:
vmovdqa ymm0, [rdi]
vpand ymm1, ymm2, ymm0
vpandn ymm0, ymm2, ymm0
vpsrld ymm0, ymm0, 4h
vpshufb ymm1, ymm4, ymm1
vpshufb ymm0, ymm3, ymm0
vpor ymm0, ymm0, ymm1
vmovdqa [rdi], ymm0
add rdi, 20h
dec rsi
jnz bitflipp_loop
ret
这段代码需要32个字节,然后屏蔽掉这些信息。然后用vpShufb和ymm 4/ymm 3作为查找表。我可以使用一个单一的查找表,但之后,我将不得不左移,然后再一起啃咬。
有更快的方式翻转比特。但是我一定要使用单线程和CPU,所以这是我所能达到的最快的速度。你能做一个更快的版本吗?
请不要评论使用英特尔C/C+编译器内部等效命令.