[①ARM Neon]: Neon Intrinsic简介

李71~李先森

已于 2023-10-08 11:17:23 修改

阅读量490

点赞数 3

文章标签： arm开发

于 2023-09-28 13:55:55 首次发布

本文链接：https://blog.csdn.net/eng_ingli/article/details/133353973

版权

前言

ARM 从v5 架构开始引入VFP(vector-floating-point)指令扩展，用于加速浮点运算。从v7架构开始引入Neon技术，用向量指令来加速，替代原来的VFP模式，其本质是基于SIMD(单指令多数据)：

Neon寄存器

以ARMv7， AArch32架构为例，普通寄存器是32位的，而Neon使用的向量寄存器是64位或者128位的。在ARMv7架构中一共有16个128位寄存器，称之为Q寄存器，一个128位寄存器可以分成两个64位寄存器，即一共有32个64位寄存器，64位寄存器又称之为D寄存器。所以在写代码时候也要注意下占用多少个专用寄存器，避免过量使用导致寄存器溢出(Register Spilling)从而产生负优化。

其本质上操作的是128位寄存器，即如果操作32位浮点数，那可以同时操作4个。根据操作数据位宽的不同，也可以把Neon寄存器理解为一组寄存器的集合，即可以作为8-bit，16-bit，32-bit，64-bit和128-bit寄存器使用。
在ARM官方的介绍中，Neon寄存器有register-》vector-》lane这个概念，原话是The Neon registers contain vectors of elements of the same data type. The same element position
in the input and output registers is referred to as a lane，就是指Neon寄存器包含相同数据类型元素的向量，输入输出寄存器中相同位置称为lane(通道)。所以有多少lane就跟寄存器中vector的长度跟每个数据元素的位宽有关系。
一个128-bit Neon寄存器可以包含下面的元素：

16个8-bit元素，操作数后缀为.16B，B表示为byte
8个16-bit元素，操作数后缀为.8H，H表示为halfword
4个32-bit元素，操作数后缀为.4S，S表示为word
2个64-bit元素，操作数后缀为.2D，D表示为doubleword

一个64-bit Neon寄存器可以包含下面的元素：

8个8-bit元素，操作数后缀为.8B，B表示为byte
4个16-bit元素，操作数后缀为.4H，H表示为halfword
2个32-bit元素，操作数后缀为.2S，S表示为word

Neon数据类型

Neon所定义的数据类型都是按照以下规则命名的：

(type)x(lanes)_t

64-bit 数据类型(D-register)：

typedef __simd64_int8_t int8x8_t;
typedef __simd64_int16_t int16x4_t;
typedef __simd64_int32_t int32x2_t;
typedef __builtin_neon_di int64x1_t;
typedef __simd64_float16_t float16x4_t;
typedef __simd64_float32_t float32x2_t;
typedef __simd64_poly8_t poly8x8_t;
typedef __simd64_poly16_t poly16x4_t;
typedef __builtin_neon_poly64 poly64x1_t;
typedef __simd64_uint8_t uint8x8_t;
typedef __simd64_uint16_t uint16x4_t;
typedef __simd64_uint32_t uint32x2_t;
typedef __builtin_neon_udi uint64x1_t;

128-bit 数据类型(Q-register)：

typedef __simd128_int8_t int8x16_t;
typedef __simd128_int16_t int16x8_t;
typedef __simd128_int32_t int32x4_t;
typedef __simd128_int64_t int64x2_t;
typedef __simd128_float16_t float16x8_t;
typedef __simd128_float32_t float32x4_t;
typedef __simd128_poly8_t poly8x16_t;
typedef __simd128_poly16_t poly16x8_t;
typedef __simd128_uint8_t uint8x16_t;
typedef __simd128_uint16_t uint16x8_t;
typedef __simd128_uint32_t uint32x4_t;
typedef __simd128_uint64_t uint64x2_t;

也有结构化数据类型，将上述基本的数据类型组合成一个结构体，通常被映射到一组向量寄存器中，例如：

typedef struct int8x8x2_t
{
  int8x8_t val[2];
} int8x8x2_t;

typedef struct int8x16x2_t
{
  int8x16_t val[2];
} int8x16x2_t;

typedef struct int16x4x2_t
{
  int16x4_t val[2];
} int16x4x2_t;

typedef struct int16x8x2_t
{
  int16x8_t val[2];
} int16x8x2_t;

typedef struct int32x2x2_t
{
  int32x2_t val[2];
} int32x2x2_t;

typedef struct int32x4x2_t
{
  int32x4_t val[2];
} int32x4x2_t;

typedef struct int64x1x2_t
{
  int64x1_t val[2];
} int64x1x2_t;

typedef struct int64x2x2_t
{
  int64x2_t val[2];
} int64x2x2_t;

typedef struct uint8x8x2_t
{
  uint8x8_t val[2];
} uint8x8x2_t;

typedef struct uint8x16x2_t
{
  uint8x16_t val[2];
} uint8x16x2_t;

typedef struct uint16x4x2_t
{
  uint16x4_t val[2];
} uint16x4x2_t;

typedef struct uint16x8x2_t
{
  uint16x8_t val[2];
} uint16x8x2_t;

typedef struct uint32x2x2_t
{
  uint32x2_t val[2];
} uint32x2x2_t;

typedef struct uint32x4x2_t
{
  uint32x4_t val[2];
} uint32x4x2_t;

typedef struct uint64x1x2_t
{
  uint64x1_t val[2];
} uint64x1x2_t;

typedef struct uint64x2x2_t
{
  uint64x2_t val[2];
} uint64x2x2_t;

typedef struct float16x4x2_t
{
  float16x4_t val[2];
} float16x4x2_t;

其中注意要有poly8x8_t，poly16x8_t等这个数据类型，它表示polynomial多项式，例如 $x^3+x^2+1$ 可以表示为1101b。

Neon intrinsic检索

在ARM的官方网站上可以检索查找所有的Neon intrinsic：https://developer.arm.com/architectures/instruction-sets/intrinsics。
例如vaddq_s8这个指令：

其中就描绘了指令操作、对应的汇编指令、入参、出参、伪代码等。需要注意的就是Architectures这一项，表示能使用这个指令的arm架构，所以在使用指令时也要确认当前的架构能否支持。
学习指令的命令规则也能帮助我们检索，或者更好的理解指令，命名规则如下：

ret - the return type of the function
v - short for vector and is present on all the intrinsics
p - indicates a pairwise operation. ( [value] means value may be present)
q - indicates a saturating operation (with the exception of vqtb[l][x] in AArch64 operations
where the q indicates 128-bit index and result operands)
r - indicates a rounding operation
name - the descriptive name of the basic operation
u - indicates signed-to-unsigned saturation
n - indicates a narrowing operation
q - postfixing the name indicates an operation on 128-bit vectors
x - indicates an Advanced SIMD scalar operation in AArch64. It can be one of b, h, s or d (that is 8, 16, 32, or 64 bits)
_high - In AArch64, used for widening and narrowing operations involving 128-bit operands. For
widening 128-bit operands, high refers to the top 64-bits of the source operand(s). For
narrowing, it refers to the top 64-bits of the destination operand
_n - indicates a scalar operand supplied as an argument
_lane - indicates a scalar operand taken from the lane of a vector. _laneq indicates a scalar operand taken from the lane of an input vector of 128-bit width. ( left | right means only left or
right would appear)
type - the primary operand type in short form
args - the function’s arguments

从中我们可以看出vaddq_s8这个指令就是对int8类型的向量的加法，并且是对128-bit向量寄存器的操作，一次可以操作两组16个int8的数据。

结语

最后介绍几个学习使用Neon的网址：

ARM intrinsics：https://developer.arm.com/architectures/instruction-sets/intrinsics
ARM Neon programming quick reference：https://community.arm.com/arm-community-blogs/b/operating-systems-blog/posts/arm-neon-programming-quick-reference