前言
ARM 从v5 架构开始引入VFP(vector-floating-point)指令扩展,用于加速浮点运算。从v7架构开始引入Neon技术,用向量指令来加速,替代原来的VFP模式,其本质是基于SIMD(单指令多数据):
Neon寄存器
以ARMv7, AArch32架构为例,普通寄存器是32位的,而Neon使用的向量寄存器是64位或者128位的。在ARMv7架构中一共有16个128位寄存器,称之为Q寄存器,一个128位寄存器可以分成两个64位寄存器,即一共有32个64位寄存器,64位寄存器又称之为D寄存器。所以在写代码时候也要注意下占用多少个专用寄存器,避免过量使用导致寄存器溢出(Register Spilling)从而产生负优化。
其本质上操作的是128位寄存器,即如果操作32位浮点数,那可以同时操作4个。根据操作数据位宽的不同,也可以把Neon寄存器理解为一组寄存器的集合,即可以作为8-bit,16-bit,32-bit,64-bit和128-bit寄存器使用。
在ARM官方的介绍中,Neon寄存器有register-》vector-》lane这个概念,原话是The Neon registers contain vectors of elements of the same data type. The same element position
in the input and output registers is referred to as a lane,就是指Neon寄存器包含相同数据类型元素的向量,输入输出寄存器中相同位置称为lane(通道)。所以有多少lane就跟寄存器中vector的长度跟每个数据元素的位宽有关系。
一个128-bit Neon寄存器可以包含下面的元素:
- 16个8-bit元素,操作数后缀为.16B,B表示为byte
- 8个16-bit元素,操作数后缀为.8H,H表示为halfword
- 4个32-bit元素,操作数后缀为.4S,S表示为word
- 2个64-bit元素,操作数后缀为.2D,D表示为doubleword
一个64-bit Neon寄存器可以包含下面的元素:
- 8个8-bit元素,操作数后缀为.8B,B表示为byte
- 4个16-bit元素,操作数后缀为.4H,H表示为halfword
- 2个32-bit元素,操作数后缀为.2S,S表示为word
Neon数据类型
Neon所定义的数据类型都是按照以下规则命名的:
(type)x(lanes)_t
64-bit 数据类型(D-register):
typedef __simd64_int8_t int8x8_t;
typedef __simd64_int16_t int16x4_t;
typedef __simd64_int32_t int32x2_t;
typedef __builtin_neon_di int64x1_t;
typedef __simd64_float16_t float16x4_t;
typedef __simd64_float32_t float32x2_t;
typedef __simd64_poly8_t poly8x8_t;
typedef __simd64_poly16_t poly16x4_t;
typedef __builtin_neon_poly64 poly64x1_t;
typedef __simd64_uint8_t uint8x8_t;
typedef __simd64_uint16_t uint16x4_t;
typedef __simd64_uint32_t uint32x2_t;
typedef __builtin_neon_udi uint64x1_t;
128-bit 数据类型(Q-register):
typedef __simd128_int8_t int8x16_t;
typedef __simd128_int16_t int16x8_t;
typedef __simd128_int32_t int32x4_t;
typedef __simd128_int64_t int64x2_t;
typedef __simd128_float16_t float16x8_t;
typedef __simd128_float32_t float32x4_t;
typedef __simd128_poly8_t poly8x16_t;
typedef __simd128_poly16_t poly16x8_t;
typedef __simd128_uint8_t uint8x16_t;
typedef __simd128_uint16_t uint16x8_t;
typedef __simd128_uint32_t uint32x4_t;
typedef __simd128_uint64_t uint64x2_t;
也有结构化数据类型,将上述基本的数据类型组合成一个结构体,通常被映射到一组向量寄存器中,例如:
typedef struct int8x8x2_t
{
int8x8_t val[2];
} int8x8x2_t;
typedef struct int8x16x2_t
{
int8x16_t val[2];
} int8x16x2_t;
typedef struct int16x4x2_t
{
int16x4_t val[2];
} int16x4x2_t;
typedef struct int16x8x2_t
{
int16x8_t val[2];
} int16x8x2_t;
typedef struct int32x2x2_t
{
int32x2_t val[2];
} int32x2x2_t;
typedef struct int32x4x2_t
{
int32x4_t val[2];
} int32x4x2_t;
typedef struct int64x1x2_t
{
int64x1_t val[2];
} int64x1x2_t;
typedef struct int64x2x2_t
{
int64x2_t val[2];
} int64x2x2_t;
typedef struct uint8x8x2_t
{
uint8x8_t val[2];
} uint8x8x2_t;
typedef struct uint8x16x2_t
{
uint8x16_t val[2];
} uint8x16x2_t;
typedef struct uint16x4x2_t
{
uint16x4_t val[2];
} uint16x4x2_t;
typedef struct uint16x8x2_t
{
uint16x8_t val[2];
} uint16x8x2_t;
typedef struct uint32x2x2_t
{
uint32x2_t val[2];
} uint32x2x2_t;
typedef struct uint32x4x2_t
{
uint32x4_t val[2];
} uint32x4x2_t;
typedef struct uint64x1x2_t
{
uint64x1_t val[2];
} uint64x1x2_t;
typedef struct uint64x2x2_t
{
uint64x2_t val[2];
} uint64x2x2_t;
typedef struct float16x4x2_t
{
float16x4_t val[2];
} float16x4x2_t;
其中注意要有poly8x8_t,poly16x8_t等这个数据类型,它表示polynomial多项式,例如 x 3 + x 2 + 1 x^3+x^2+1 x3+x2+1可以表示为1101b。
Neon intrinsic检索
在ARM的官方网站上可以检索查找所有的Neon intrinsic:https://developer.arm.com/architectures/instruction-sets/intrinsics。
例如vaddq_s8这个指令:
其中就描绘了指令操作、对应的汇编指令、入参、出参、伪代码等。需要注意的就是Architectures这一项,表示能使用这个指令的arm架构,所以在使用指令时也要确认当前的架构能否支持。
学习指令的命令规则也能帮助我们检索,或者更好的理解指令,命名规则如下:
- ret - the return type of the function
- v - short for vector and is present on all the intrinsics
- p - indicates a pairwise operation. ( [value] means value may be present)
- q - indicates a saturating operation (with the exception of vqtb[l][x] in AArch64 operations
where the q indicates 128-bit index and result operands) - r - indicates a rounding operation
- name - the descriptive name of the basic operation
- u - indicates signed-to-unsigned saturation
- n - indicates a narrowing operation
- q - postfixing the name indicates an operation on 128-bit vectors
- x - indicates an Advanced SIMD scalar operation in AArch64. It can be one of b, h, s or d (that is 8, 16, 32, or 64 bits)
- _high - In AArch64, used for widening and narrowing operations involving 128-bit operands. For
widening 128-bit operands, high refers to the top 64-bits of the source operand(s). For
narrowing, it refers to the top 64-bits of the destination operand - _n - indicates a scalar operand supplied as an argument
- _lane - indicates a scalar operand taken from the lane of a vector. _laneq indicates a scalar operand taken from the lane of an input vector of 128-bit width. ( left | right means only left or
right would appear) - type - the primary operand type in short form
- args - the function’s arguments
从中我们可以看出vaddq_s8这个指令就是对int8类型的向量的加法,并且是对128-bit向量寄存器的操作,一次可以操作两组16个int8的数据。
结语
最后介绍几个学习使用Neon的网址:
ARM intrinsics:https://developer.arm.com/architectures/instruction-sets/intrinsics
ARM Neon programming quick reference:https://community.arm.com/arm-community-blogs/b/operating-systems-blog/posts/arm-neon-programming-quick-reference