可以得到指令_mm512_shuffle_epi8的解释如下:
__m512i _mm512_shuffle_epi8 (__m512i a, __m512i b)
Synopsis
__m512i _mm512_shuffle_epi8 (__m512i a, __m512i b)
#include <immintrin.h>
Instruction: vpshufb zmm, zmm, zmm
CPUID Flags: AVX512BW
Description
Shuffle packed 8-bit integers in a according to shuffle control mask in the corresponding 8-bit element of b, and store the results in dst.
Operation
FOR j := 0 to 63
i := j*8
IF b[i+7] == 1
dst[i+7:i] := 0
ELSE
index[5:0] := b[i+3:i] + (j & 0x30)
dst[i+7:i] := a[index*8+7:index*8]
FI
ENDFOR
dst[MAX:512] := 0
Performance
Architecture | Latency | Throughput (CPI) |
Icelake | - | 1 |
Skylake | 1 | 1 |
但是看了之后还是一头雾水,自己写了如下一个简单的程序来做实验:
#include <stdio.h>
#include <immintrin.h>
void printU8(char *result)
{
for (int i = 0; i < 64; i++)
{
printf("%.2d ", result[i]);
}
printf("\n");
}
void main()
{
const __m512i in = _mm512_set_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,
16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,
32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,
48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63);
const __m512i mask = _mm512_set_epi8(8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7,
8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7,
8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7,
8,9,10,11,12,13,14,15,0,1,2,3,4,5,6,7);
const __m512i out = _mm512_shuffle_epi8(in, mask);
printf("In:\t");
printU8((char *)&in);
printf("Mask:\t");
printU8((char *)&mask);
printf("Out:\t");
printU8((char *)&out);
}
使用ICC编译程序输出结果如下:
In: 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
Mask: 07 06 05 04 03 02 01 00 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 15 14 13 12 11 10 09 08
Out: 56 57 58 59 60 61 62 63 48 49 50 51 52 53 54 55 40 41 42 43 44 45 46 47 32 33 34 35 36 37 38 39 24 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23 08 09 10 11 12 13 14 15 00 01 02 03 04 05 06 07
这样一看,这个指令其实是一次处理4组数据,每组数据里面包含了16个8bits整数,这个分4组是通过伪代码里面的+ (j & 0x30)来实现的。接下来就好理解了,就是根据mask的值来取In里面对应位置的值来存到结果里面,对应伪代码如下:
FOR j := 0 to 15
out[j] = in[mask[j]];
用下面的图来表示就更清晰了:
所以这个指令其实就是实现了16个8bits整数的任意交换顺序,比如说字节序的变换。同时,一次可以处理4组这样的数据。