【SIMD】NEON-intrinsic指令集总结

张西瓜.�

已于 2023-07-17 11:46:02 修改

阅读量1.5k

点赞数 1

文章标签：开发语言 c++

于 2023-07-12 10:58:12 首次发布

本文链接：https://blog.csdn.net/weixin_43730667/article/details/131640154

版权

文章详细介绍了SIMD技术，用于提升CPU处理多数据的性能，通过向量指令和多核机制优化计算。在ARMv8AArch64架构中，使用S,D,Q寄存器处理不同数据类型，并展示了编程方式和循环步长的优化。此外，文章还讨论了自动向量化，包括OpenMP编译指导语句和NEON指令集，用于高效编程和计算操作。

摘要由CSDN通过智能技术生成

一、SIMD

在这里插入图片描述

1. 概念

SIMD是使用一条指令同时处理多个数据的计算技术，可以理解为将原先的一次计算一个值改为一次并行计算多个值。这种并行技术是通过将多个数据打包传输至某一向量寄存器,用CPU核心中多个ALU一次计算处理该个向量寄存器来实现的。

2. CPU多级存储架构

在这里插入图片描述

3. 提升性能

提高流水线（基本优化）
提高Cache效率（基本优化）
向量指令，提高向量长度
多核机制
多节点机制

4. 指令集

Intel：MMX SSE AVX
ARM：NEON SVE

5. 求和向量化实例

在这里插入图片描述

二、ARMv8 AArch64结构

1.向量寄存器

S寄存器标量 single
D寄存器矢量 doubleword
Q寄存器矢量 quadword

2.数据类型

整数类型
- Byte 8bits
- Halfword 16bits
- Word 32bits
- DoubleWord 64bits
- Quadword 128bits
浮点类型
- Half-precision 16bits
- Single-precision 32bits
- Double-precision 64bits

3. 寄存器如何命名

在这里插入图片描述

通过操作寄存器来完成循环体计算，降低访存开销
对寄存器多通道多数据进行计算，增强并行性

3.编程方式

在这里插入图片描述

4. 循环步长

在这里插入图片描述

for (int i = 2; i < 2 + (n + 1 - 2) / 16 * 16; i +=16)

循环代码流程

在这里插入图片描述

声明向量类型变量
load
向量计算
store
https://developer.arm.com/architectures/instruction-sets/intrinsics/

自动向量化（编译&&openmp)

自动向量化

1. 自动向量化

在这里插入图片描述

2. 打印中间程序

在这里插入图片描述

编译指导语句（OpenMP)

1. 基本语句

// 1. pragma omp simd
#pragma omp simd reduction(+:sum)
for (i = 0; i < n; i ++)
	sum += val[i];
	
// 2. 子句
collapse(2) 
private(s) 
reduction(+:sum)

// 3. 依赖项最大距离
safelen(length)

#pragma omp simd safelen(4)
for (i = 0; i < (N - 4); i ++)
	a[i] = a[i + 4] + b[i] * c[i];
	
//4. 指定向量通道数
simdlen(length)

#pragma omp simd simdlen(4)
for (i = 0; i < N; i ++)
	c[i] = a[i] * b[i];

// 5. 线性映射
linear(list[:linear-step])

#pragma omp simd simdlen(4) linear(i: 4)
for (i = 0; i < N; i += 4)
	c[i] = a[i] * b[i];

// 6. 多线程多数据混合
#pragma omp parallel for simd simdlen(4)

2. 效果

在这里插入图片描述

NEON-intrinsic指令集

向量类型

在这里插入图片描述

命名规则

在这里插入图片描述

mod

//1. 饱和计算
q

//2.  >> 1
h

//3. 扩大向量长度,double运算，要结合l
d
int32x4_t vqdmull_s16(int16x4_t a, int16x4_t b)

//4. 四舍五入计算 (a[i] + b[i] + 1) >> 1
rh
int16x4_t vrhadd_s16(int16x4_t a, int16x4_t b)

//5. 向量两束进行两两和操作
p
int16x4_t vpadd_s16(int16x4_t a, int16x4_t b)

operate

在这里插入图片描述

shape

l //long 128
n //naroow
wide //wide
_high
_n
_lane

在这里插入图片描述

flags

在这里插入图片描述

type

在这里插入图片描述

q
h
d
rh
p

ld st add sub mul mla mls eq gt adn or shl shr cvt dup mov

l //long 128
n //naroow
wide //wide
_high
_n
_lane

u:uint u8 u16 u32
s:int s8 s16 s32
f:float f32 f64

存取操作

解决rgb->bgr的转换问题

交叉存取interleaving

LD1
LD2
LD3: LD3R LD3{}[4] LD3
LD4

在这里插入图片描述

#include <arm_neon.h>   
int main (void)  
{  
    uint8x8x3_t v; // 这表示3个向量。
    // 每个向量有8个8位数据通道。
    unsigned char A [24]; // 这个数组表示一个24位RGB图像。
    v = vld3_u8 (A); // 这将从数组A中解交错24位图像
    // 并将它们存储在3个单独的向量中
    // v.val [0] 是V中的第一个向量。它是红色通道
    // v.val [1] 是V中的第二个向量。它是绿色通道
    // v.val [2] 是V中的第三个向量。它是蓝色通道。
    // 将红色通道加倍
    v.val [0] = vadd_u8 (v.val [0],v.val [0]);   
    vst3_u8 (A, v); // 将向量存回数组，红色通道加倍。
    return 0;  
}

数据加载

int16x4 vld1_s16(int16_6 const* ptr)
int32x2x2_t vld2_s32(int32_t const* ptr)
int16x4_t vld1_lane_s16(int16_t const* ptr, int16x4_t src, const int ilane)
uint16x4x2_t vld2_dup_u16(uint16_t const* ptr)

在这里插入图片描述

数据存储

vst1_s16(int16_t* ptr, int16x4_t val);
vst1_lane_u16(uint16_t* ptr. uint16x4_t val. const int lane);

在这里插入图片描述

算数操作

初始化向量寄存器

// 1. 一次性初始化向量所有通道
int9x9_t vcreate_s8(uint64_t)
uint8x8_t v;
v = vcreate_u8(0x0102030405060708);

// 2. 直接赋值(操作符重载）
float32x4_t vec={1.0, 2.0, 3.0, 4.0};

// 3. 取值(操作符重载）
float val = vec[0];

// 4. 使用单一数值初始寄存器所有通道
// vdup_n_type: 用类型为type的数值，初始化同数据类型输出向量的所有通道元素。
uint16x4_t vdup_n_u16(uint16_t value)
// vdup_lane_type: 用元素类型为type的vec向量中指定第ilane通道元素，初始化同数据类型输
// 出向量的所有通道元素
int16x4_t vdup_lane_s16(int16x4_t vec, const int ilane)

// 5. mov操作
// 用类型为type的数值，初始化同数据类型输出向量的所有通道元素。
float32x2_t vmov_n_f32(float32_t value)
uint16x4_t vdup_n_u16(uint16_t value) // 等同

加法

int32x2_t vadd_s32(int32x2_t a, int32x2_t b)
// vadd_type: r[i] = a[i] + b[i]
// vqadd_type: r[i] = sat(a[i] + b[i]) 饱和指令

int16x4_t vaddhn_s32(int32x4_t a, int32x4_t b)
// vaddhn_type: 结果vector元素的位数减半，是输入vector位数的一半
// vaddl_type: 加法运算结果位数加倍，目的防止溢出
// vaddw_type: 两个输入vector元素位数不一致，第一个vector宽度需大于第二个vector

int8x8_t vhadd_s8(int8x8_t a, int8x8_t b)
// vhadd_type: 相加结果再除2。
// vrhadd_type: 相加结果再除2(实现四舍五入)。r[i] = (a[i] + b[i] + 1) >> 1

uint16x4_t vpadd_u16(uint16x4_t a, uint16x4_t b)
//  vpadd_type: 向量a与向量b进行pairwise运算。

int16x4_t vpaddl_s8(int8x8_t a)
// vpaddl_type: 将单个输入vector内的数据进行pairwise运算，同时结果vector的位数宽度
//加倍。如r[0] = a[0] + a[1], ..., r[3] = a[6] + a[7]

int32x2_t vpadal_s16(int32x2_t a, int16x4_t b)
// vpadal_type: r[0] = a[0] + (b[0] + b[1]), ..., r[1] = a[1] + (b[2] + b[3]);
Vadda

Vaddv：r=a[0]+a[1]+…+a[n]

减法

vsub_type
// r[i] = a[i] – b[i]。向量减
vsubl_type
vsubw_type
vsubhn_type
// 相减结果，结果向量长度缩短一半, r[i] = a[i] – b[i]

vqsub_type
// 饱和指令 r[i] = sat(a[i] – b[i])
vhsub_type
// 相减结果再除2。r[i] = (a[i] – b[i]) >> 1
vrsubhn_type
// 相减结果再加1，结果向量长度再缩短一半。r[i] = a[i] – b[i] + 1

乘法

vmul_type
//r[i] = a[i] * b[i]
// uint16x4_t vmul_u16(uint16x4_t a, uint16x4_t b)

vmul_n_type
//r[i] = a[i] * b
// float16x4_t vmul_n_f16(float16x4_t a, float16_t b)

vmul_lane_type: r[i] = a[i] * b[ilane]
// uint16x4_t vmul_lane_u16(uint16x4_t a, uint16x4_t b, const int ilane)

vmull_type
// 变长乘法运算，为了防止溢出
// int32x4_t vmull_s16(int16x4_t a, int16x4_t b)
// vqdmull_lane_s16
// bit位变长，乘法，并左移，运算，参与运算的值是有符号数
//（所以可能溢出）,当结果溢出时，取饱和值

除法

vrecpe_type
// 求近似倒数，type是f32或者u32。 vrecpe_type计算倒数能保证千分之一
// 左右的精度，如1.0的倒数为0.998047。
// float32x4_t recip = vrecpeq_f32(float32x4_t src)
// 此时能达到千分之一左右的精度，如1.0的倒数为0.998047

vrecps_f32
// (牛顿 - 拉夫逊迭代)，两个vector乘积的倒数
// float32x2_t vrecps_f32(float32x2_t a, float32x2_t b

// 先执行vrecpe求出src的低精度倒数rec
rec = vrecpeq_f32(src)
// 使用vrecps，执行下句后能达到百万分之一左右精度结果recip，如1.0的倒数为0.999996
recip1 = vmulq_f32 (vrecpsq_f32 (src, rec), rec);
// 再次执行vrecps后，能基本能达到完全精度，如1.0的倒数为1.000000
recip2 = vmulq_f32 (vrecpsq_f32 (src, recip1), recip1)

vsqrt_type
//计算输入值的平方根，r[i]=sqrt(a[i])。输入、输出向量为整数时，结果为近似值
int16x4_t vsqrt_f32(int16x4_t a)

vrsqrte_type
// 计算输入值的平方根的倒数，r[i]=1/sqrt(a[i]) 。
float32x2_t vrsqrte_f32(float32x2_t a)

乘加

vmla_type: r[i] = a[i] + b[i] * c[i]
// int8x8_t vmla_s8(int8x8_t a, int8x8_t b, int8x8_t c)
vmla_n_type: r[i] = a[i] + b[i] * c
// float32x2_t vmla_n_f32(float32x2_t a, float32x2_t b, float32_t c)
vmla_lane_type: r[i] = a[i] + b[i] * c[ilane]
// int16x4_t vmla_lane_s16(int16x4_t a, int16x4_t b, int16x4_t c, const int lane)
vmlal_type: r[i] = a[i] + b[i] * c[i]，先将b*c结果向量的bit长度增长，再与a相加
// int16x8_t vmlal_s8(int16x8_t a, int8x8_t b, int8x8_t c)
vfma_f32
// r[i] = a[i] + b[i] * c[i] ，在加法之前，b[i] 、c[i]相乘的结果不会被四舍五入
// float32x4_t vfmaq_f32(float32x4_t a, float32x4_t b, float32x4_t c)
vfma_lane_f32
// r[i] = a[i] + b[i] * c[ilane] 
// float32x2_t vfma_lane_f32(float32x2_t a, float32x2_t b, float32x2_t c, const int ilane)

//fma只有浮点数

乘减

vmls_type
// r[i] = a[i] - b[i] * c[i]
vmls_n_type
// r[i] = a[i] - b[i] * c
vmls_lane_type
// r[i] = a[i] - b[i] * c[ilane]
vmlsl_type
// r[i] = a[i] - b[i] * c[i] ，并将结果进行向量长度增长
vfms_f32
// r[i] = a[i] - b[i] * c[i] ， 在减法之前，bi、ci相乘的结果不会被四舍五入

比较操作

// 相等 equal to
vceq_type
// 大于, greater than
vcgt_type
// 小于 ,less than
vclt_type
// 大于或等于, greater than or equal to
vcge_type
// 小于或等于
vcle_type

float32x4_t vbslq_f32(uint32x4_t a, float32x4_t b, float32x4_t c)
// 当a中某通道元素不为0时，对应通道处结果向量取b
// 当a中某通道元素为0时，对应通道处结果向量取c

逻辑操作

// 位运算 只操作整形
vand_type
// 与，r[i] = a[i] & b[i]，b[i]作为
// uint16x4_t vand_u16(uint16x4_t a, uint16x4_t b)
vorr_type
// 或，r[i] = a[i] | b[i]
veor_type
// 异或， 对应bit位相同则结果为0，否则为1
vmvn_type
// 非，r[i] = ~a[i] ，结合异或指令可以成为同或
// 注意：只有整型向量可以进行位运算

绝对值

vabs_type
// r[i] = |a[i]|
vqabs_type
// r[i] = sat(|a[i]|)
vabd_type
// r[i] = |a[i] – b[i]|
vabdl_type
// 长指令
vaba_type
// r[i] = a[i] + |b[i] – c[i]|
vabal_type
// 长指令

最大值最小值

vmax_type
// r[i] = a[i] >= b[i] ? a[i] : b[i]
vpmax_type
// r[0] = a[0] >= a[1] ? a[0] : a[1], ..., r[4] = b[0] >= b[1] ? b[0]: b[1], ...
vmin_type
// r[i] = a[i]<= b[i] ? a[i] : b[i]
vpmin_type
// r[0] = a[0] <= a[1] ? a[0] : a[1], ..., r[4] = b[0] <= b[1] ? b[0] : b[1], ...
vmaxv_type
// r = max(a[i])

移位运算

• 左移
• vshl_type: r[i] = a[i] << b[i] ,如果b[i]是负数，则变成右移
uint16x4_t vshl_u16 (uint16x4_t a, int16x4_t b)
• vshl_n_type: r[i] = a[i] << b 
uint16x4_t vshl_n_u16(uint16x4_t a, const int b)
• 右移
• vshr_type: r[i] = a[i] >> b[i] ,如果b[i]是负数，则变成左移
• vshr_n_type: r[i] = a[i] >> b

取相反数

• vneg_type: r[i] = -a[i]
• vqneg_type: r[i] = sat(-a[i])

数据类型转换

在64位和128位寄存器间转移整型数据
• mov操作
• uint8x8_t vmovn_u16(uint16x8_t a)
• vmovn_type: 用旧vector创建一个新vector，新vector的元素bit位是旧vector的一半。新
vector元素只保留旧vector元素的低半部分。
• vqmovn_type: 同vmovn_type类似。但如果旧vector元素的值超过新vector元素的最大
有效范围，则新vector元素就取最大有效值。否则新vector元素就等于旧vector元素的值。
• vqmovun_type: 作用与vqmovn_type类似，但它输入的是有符号vector，输出的是无符
号vector。

• int16x8_t vmovl_s8(int8x8_t a)
• vmovl_type: 将vector的元素bit位扩大到原来的两倍，元素值不变


在64位和128位寄存器间转移浮点型数据
在同长度寄存器间转移浮点型与整型数据
• cvt操作
• vcvt_type1_type2：将数据从type2转换成type1。注意，在float转换到uint时，是向下取整，
且如果是负数，则转换后为0
• 如：int32x2_t vcvt_s32_f32(float32x2_t a)
表示从32位浮点型转换到32位整型

获取高半位和低半位

• 将128位寄存器的一半数据，转移到64位寄存器，元素位数不变
• vget_low_type: 获取128bit vector的低半部分元素，输出的是元素类型相同的64bit vector
int8x8_t vget_low_s8(int8x16_t a)
• vget_high_type: 获取128bit vector的高半部分元素，输出的是元素类型相同的64bit vector
float16x4_t vget_high_f16(float16x8_t a)
• vcombine_type: vcombine可以将两个64位向量连接成一个128位向量。输出向量的高半部分由第二个输入向量元素组成。
float32x4_t ret = vcombine_f32(a, b) 可以将两个 float32x2_t 类型的向量 a 和 b 连接成一个 float32x4_t 类型的向量ret，a中元素传给了位ret向量的低半位。

获取、设置neon寄存器某个通道的值

• vget_lane_type: 获取元素类型为type的vector中指定的某个元素值。
uint8_t vget_lane_u8(uint8x8_t v, const int ilane)
等同于： uint8_t val=vec[ilane]
• vset_lane_type: 设置元素类型为type的vector中指定的某个元素的值，并返回新vector
uint8x8_t vset_lane_u8(uint8_t a, uint8x8_t v, const int ilane)

寄存器数据重排

包括能对两个向量寄存器进行转置的vtrn函数，能将向量解释成其他数据类型的vreinterpret函数，能对向量各元素进行反转的vrev。

vtrn_type：两个向量的两个通道为一组，将第一个向量的前个通道与第二个向量的后个通道
数值进行交换。
• 如：int8x8x2_t vtrnq_s8(int16x4_t D0, int16x4_t D1)


• DstType vreinterpret_DstType_SrcType(SrcType Src)
类似于C/C++的指针类型强制转换，寄存器总长度和内部二进制数据不发生变化，但是元素
数值类型发生改变
• 例如：float64x2_t vreinterpretq_f64_f32(float32x4_t a)
将float32x4_t寄存器重新解释为float64x2_t类型。


REV指令：以向量寄存器中指定bit数为单位，单位范围内前后交换各通
道数据，并赋值给新寄存器
uint8x16_t vrev64q_u8(uint8x16_t vec)

vext_type: 取第2个输入vector的低n个元素放入新vector的高位，新vector剩下的元素取自第
1个输入vector最高的几个元素(可实现vector内元素位置的移动)
src1 = {1,2,3,4,5,6,7,8}
src2 = {9,10,11,12,13,14,15,16}
dst = vext_type(src1,src2,3)时，则dst = {4,5,6,7,8, 9,10,11}

点积

uint32x4_t vdotq_u32(uint32x4_t c, uint8x16_t a, uint8x16_t b)
• 例如，对第一组四个元素执行的操作为：
• c[0] = c[0] +（（a[0] * b[0]）+（a[1] * b[1] +（a[2] * b[2] +（a[3] * b[3]））