使用Neon intrinsics优化C代码

七点八客

已于 2025-01-03 19:05:14 修改

阅读量1.3k

点赞数 32

分类专栏：学习ARM架构-A系列文章标签： Neon 架构 arm开发指令集

于 2024-03-19 18:48:29 首次发布

本文链接：https://blog.csdn.net/qq_16106195/article/details/136851480

版权

学习ARM架构-A系列专栏收录该内容

10 篇文章

订阅专栏

概述

本文介绍了如何在 C 或 C++ 代码中使用 Neon intrinsics，以利用 Armv8 架构中的 Advanced SIMD 技术。另外，也提供了一些使用 Neon 的示例，并介绍了他们的用途。

本文对使用 Advanced SIMD 技术的底层软件工程师、库开发者和其他开发人员很有用。

本文的最后一部分可以测试你是否已经理解以下基本概念：

了解 Neon 是什么，以及使用 Neon 的不同方式
了解在 C 语言中使用 Neon intrinsics 的基础知识
了解在哪里可以找到 Neon intrinsics 参考实现和 Neon 指令集

什么是Neon

Neon 是 Arm Advanced SIMD 架构的实现，Neon 的目的是加速数据处理：

32 个128 位向量寄存器，每个向量寄存器都可以包含多个数据通道
SIMD 指令可同时对多个数据通道进行操作

使用 Neon 技术的应用包括多媒体和信号处理、3D图形、语音、图像处理或其他对定点和浮点性能至关重要的应用。

作为编程者，你有多种方式使用 Neon 技术：

支持 Neon 的开源库，例如 Arm Compute Library
编译器的自动矢量化特性可以利用 Neon 技术自动优化你的代码
Neon intrinsics内建函数，编译器用相应的 Neon 指令进行了封装
对于经验丰富的程序员来说，为了获得极佳的性能，手动编写 Neon 汇编也是一种方法

本文重点介绍在 AArch64 上使用 Neon intrinsics，但它们也可以编译成 AArch32，更多关于 AArch32 Neon的信息请参考 Neon简介。

为什么使用Neon intrinsics

intrinsics是编译器已经实现的内部函数。Neon intrinsics 是在arm_neon.h中定义的一组 C 和 C++ 函数，Arm编译器和GCC均支持。使用这些函数你无需直接编写汇编代码，因为在调用这些函数时，编译器会自动生成相关的 Neon 指令。另外，编译器会处理寄存器分配和流水线优化，因此可以避免编写汇编程序遇到的许多问题。

Neon Intrinsics Reference列举了所有的 Neon intrinsics 内建函数。Neon intrinsics 工程规范参见Arm C Language Extensions (ACLE) 。

使用 Neon intrinsics 内建函数有很多好处：

功能强大：intrinsics 使程序员可以直接运用 Neon 指令集，而无需手写汇编代码
可移植：手写的 Neon 汇编指令可能需要针对不同的处理器进行重写。使用 Neon intrinsics 内建函数的C/C++代码可以支持编译新的处理器或者执行状态（例如，从AArch64移植到AArch32），只需要更改很少的代码甚至无需更改
灵活：程序员可以在需要时利用 Neon，在不需要时使用 C/C++，避免许多底层问题

然而，intrinsics 并非是所有情况的最佳选择：

学习使用 Neon intrinsics 比导入一个库或者直接使用编译器要花费些时间
尽管编写汇编比较难，但手动优化的汇编代码可以带来极佳的性能

接下来看一下使用 Neon intrinsics 实现的C函数示例。这些示例并不能反映应用的所有情况，但他们展现了如何使用 intrinsics。

示例1-RGB解交错

考虑一个24位 RGB 图像，其中图像是一组像素，每个像素点都有一个红色、蓝色和绿色元素。在内存中，这可能如下图所示：

RGB图像像素

由于 RGB 数据是交错的，因此访问和操作三个独立的颜色通道给编程者带来了问题。在简单的情况下，我们可以通过运用模 3 操作来实现单色通道运算。然而，对于更复杂的操作，例如傅立叶变换，提取和分割通道会更合理。

假定在内存中有一个 RGB 值数组，我们希望对其进行解交错并将这些值存放到单独的颜色数组中。一个可能的 C 程序如下所示：

void rgb_deinterleave_c(uint8_t *r, uint8_t *g, uint8_t *b, uint8_t *rgb, int len_color) 
{
    /*
    * Take the elements of "rgb" and store the individual colors "r", "g", and "b".
    */
    for (int i=0; i < len_color; i++) {
        r[i] = rgb[3*i];
        g[i] = rgb[3*i+1];
        b[i] = rgb[3*i+2];
	}
}

但这有一个问题。使用 Arm Compiler 6 在优化等级 ‑O3（非常高的优化）下进行编译，检查汇编代码发现并没有使用 Neon 指令或寄存器，每个8位值都存储在单独的64位通用寄存器中。考虑到 Neon 寄存器为128位宽，在示例中每个寄存器可以保存16个8位值，重新使用 Neon intrinsics 实现，应该可以带来好的效果。

void rgb_deinterleave_neon(uint8_t *r, uint8_t *g, uint8_t *b, uint8_t *rgb, int len_color) 
{
    /*
    * Take the elements of "rgb" and store the individual colors "r", "g", and "b"
    */
    int num8x16 = len_color / 16;
    uint8x16x3_t intlv_rgb;
    for (int i=0; i < num8x16; i++) {
        intlv_rgb = vld3q_u8(rgb+3*16*i);
        vst1q_u8(r+16*i, intlv_rgb.val[0]);
        vst1q_u8(g+16*i, intlv_rgb.val[1]);
        vst1q_u8(b+16*i, intlv_rgb.val[2]);
    }
}

在上面示例中，我们使用了以下类型和 intrinsics：

代码	意义	为什么使用
uint8x16_t	一个包含16个8位无符号整型的数组	一个 uint8x16_t 可以填充到128位寄存器。即使在C代码中，也可以确保不存在浪费的寄存器位
uint8x16x3_t	一个包含3个 uint8x16_t 元素的结构	在循环中临时保存当前颜色值
vld3q_u8	将内存中3*16字节的连续区域加载到 uint8x16_t 中，每个字节都交替存放在每个 uint8x16_t 数组中	在底层代码中，这个 intrinsics 会保证生成 LD3 指令，其将给定地址的值加载到三个 Neon 寄存器中
vst1q_u8	将一个 uint8x16_t 存储到指定地址	存储一个128位寄存器值

可以使用以下命令在 Arm 机器上编译和反汇编：

gcc -g -o3 rgb.c -o exe_rgb_o3
objdump -d exe_rgb_o3 > disasm_rgb_o3

示例2-矩阵乘法

矩阵乘法在许多数据密集型程序运用较多，其简单、重复执行算术运算，如下图所示：

矩阵乘法

矩阵乘法过程如下：

A：在第一个矩阵中取一行
B：执行该行与第二个矩阵中列的点积
C：将结果存储在新矩阵的相应行和列中

对于32位浮点矩阵，乘法可以写为：

void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) 
{
	for (int i_idx=0; i_idx < n; i_idx++) {
		for (int j_idx=0; j_idx < m; j_idx++) {
			C[n*j_idx + i_idx] = 0;
			for (int k_idx=0; k_idx < k; k_idx++) {
				C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];
			}
		}
	}
}

假设内存中矩阵按列优先布局。也就是说，nxm矩阵M表示为数组M_array，其中Mij = M_array[n*j + i]。

该代码不是最理想的，因为它没有充分利用 Neon。我们可以通过使用 intrinsics 进行改进，首先看一个比较简单的矩阵，然后再转向更大的矩阵。

以下代码使用 intrinsics 将两个4x4矩阵相乘。由于我们要处理的数据较少且固定，这些值都可以立即放入处理器的 Neon 寄存器中，因此我们可以在循环中进行展开：

void matrix_multiply_4x4_neon(float32_t *A, float32_t *B, float32_t *C) 
{
    // these are the columns A
    float32x4_t A0;
    float32x4_t A1;
    float32x4_t A2;
    float32x4_t A3;
    // these are the columns B
    float32x4_t B0;
    float32x4_t B1;
    float32x4_t B2;
    float32x4_t B3;
    // these are the columns C
    float32x4_t C0;
    float32x4_t C1;
    float32x4_t C2;
    float32x4_t C3;
    A0 = vld1q_f32(A);
    A1 = vld1q_f32(A+4);
    A2 = vld1q_f32(A+8);
    A3 = vld1q_f32(A+12);
    // Zero accumulators for C values
    C0 = vmovq_n_f32(0);
    C1 = vmovq_n_f32(0);
    C2 = vmovq_n_f32(0);
    C3 = vmovq_n_f32(0);
    // Multiply accumulate in 4x1 blocks, i.e. each column in C
    B0 = vld1q_f32(B);
    C0 = vfmaq_laneq_f32(C0, A0, B0, 0);
    C0 = vfmaq_laneq_f32(C0, A1, B0, 1);
    C0 = vfmaq_laneq_f32(C0, A2, B0, 2);
    C0 = vfmaq_laneq_f32(C0, A3, B0, 3);
    vst1q_f32(C, C0);
    B1 = vld1q_f32(B+4);
    C1 = vfmaq_laneq_f32(C1, A0, B1, 0);
    C1 = vfmaq_laneq_f32(C1, A1, B1, 1);
    C1 = vfmaq_laneq_f32(C1, A2, B1, 2);
    C1 = vfmaq_laneq_f32(C1, A3, B1, 3);
    vst1q_f32(C+4, C1);
    B2 = vld1q_f32(B+8);
    C2 = vfmaq_laneq_f32(C2, A0, B2, 0);
    C2 = vfmaq_laneq_f32(C2, A1, B2, 1);
    C2 = vfmaq_laneq_f32(C2, A2, B2, 2);
    C2 = vfmaq_laneq_f32(C2, A3, B2, 3);
    vst1q_f32(C+8, C2);
    B3 = vld1q_f32(B+12);
    C3 = vfmaq_laneq_f32(C3, A0, B3, 0);
    C3 = vfmaq_laneq_f32(C3, A1, B3, 1);
    C3 = vfmaq_laneq_f32(C3, A2, B3, 2);
    C3 = vfmaq_laneq_f32(C3, A3, B3, 3);
    vst1q_f32(C+12, C3);
}

选择乘以固定大小的4x4矩阵有几个原因：

某些应用程序特别需要4x4矩阵，例如图形或相对论物理
Neon 向量寄存器保存四个32位值，因此容易优化程序

看一下使用的 intrinsics：

代码	意义	为什么使用
float32x4_t	一个包含4个32位浮点数的数组	一个 float32x4_t 可以填充到128位寄存器。即使在C代码中，也可以确保不存在浪费的寄存器位
vld1q_f32(…)	将4个32位浮点数加载到 float32x4_t 中的函数	为了从 A 和 B 中获取我们需要的矩阵值
vfmaq_lane_f32(…)	使用融合乘法累加指令的函数，将一个 float32x4_t 中的每个元素与另一个 float32x4_t 中指定通道的元素相乘，然后将结果与第三个 float32x4_t 相加，最后返回结果	由于矩阵行列点积是一组乘法和加法，因此该运算十分合适
vst1q_f32(…)	一个在指定地址存储 float32x4_t 的函数	计算后保存结果

使用Neon实现矩阵AxB

上面官方文档中图示例看的不是太懂，接下来自己分析一下整个计算流程。

假定在C语言中矩阵使用数组进行表示，以列优先的原则，例如A[]={a₀, a₁, …, a₁₅}，B[]={b₀, b₁, …, b₁₅}，如下所示。
$A=\left[ \begin{matrix} a_0 & a_4 & a_8 & a_{12} \\ a_1 & a_5 & a_9 & a_{13} \\ a_2 & a_6 & a_{10} & a_{14} \\ a_3 & a_7 & a_{11} & a_{15} \end{matrix} \right] \tag{1}$

$B=\left[ \begin{matrix} b_0 & b_4 & b_8 & b_{12} \\ b_1 & b_5 & b_9 & b_{13} \\ b_2 & b_6 & b_{10} & b_{14} \\ b_3 & b_7 & b_{11} & b_{15} \end{matrix} \right] \tag{2}$

$C=\left[ \begin{matrix} c_0 & c_4 & c_8 & c_{12} \\ c_1 & c_5 & c_9 & c_{13} \\ c_2 & c_6 & c_{10} & c_{14} \\ c_3 & c_7 & c_{11} & c_{15} \end{matrix} \right] \tag{3}$

根据矩阵乘法规则，矩阵C中的每个元素等于A的该元素所在行与B的该元素所在列相乘再求和，如下所示。
$C=AB=\left[ \begin{matrix} a_0b_0+a_4b_1+a_8b_2+a_{12}b_3 & a_0b_4+a_4b_5+a_8b_6+a_{12}b_7 & a_0b_8+a_4b_9+a_8b_{10}+a_{12}b_{11} & a_0b_{12}+a_4b_{13}+a_8b_{14}+a_{12}b_{15} \\ a_1b_0+a_5b_1+a_9b_2+a_{13}b_3 & a_1b_4+a_5b_5+a_9b_6+a_{13}b_7 & a_1b_8+a_5b_9+a_9b_{10}+a_{13}b_{11} & a_1b_{12}+a_5b_{13}+a_9b_{14}+a_{13}b_{15} \\ a_2b_0+a_6b_1+a_{10}b_2+a_{14}b_3 & a_2b_4+a_{6}b_5+a_{10}b_6+a_{14}b_7 & a_2b_8+a_6b_9+a_{10}b_{10}+a_{14}b_{11} & a_2b_{12}+a_6b_{13}+a_{10}b_{14}+a_{14}b_{15} \\ a_3b_0+a_7b_1+a_{11}b_2+a_{15}b_3 & a_3b_4+a_{7}b_5+a_{11}b_6+a_{15}b_7 & a_3b_8+a_7b_9+a_{11}b_{10}+a_{15}b_{11} & a_3b_{12}+a_7b_{13}+a_{11}b_{14}+a_{15}b_{15} \end{matrix} \right] \tag{4}$

将矩阵按列划分，即A=[A0, A1, A2, A3]，这样一列刚好可以加载到float32x4_t变量中去。
$\begin{array}{|c|c|c|}\hline A0&A1&A2&A3\\ \hline a_0&a_4&a_8&a_{12}\\ a_1&a_5&a_9&a_{13}\\ a_2&a_6&a_{10}&a_{14}\\ a_3&a_7&a_{11}&a_{15}\\ \hline \end{array}\tag{5}$
比如A0就可以表示如下：
$A0=\left[ \begin{matrix} a_0 \\ a_1 \\ a_2 \\ a_3 \end{matrix} \right] \tag{6}$
函数命令C0 = vfmaq_laneq_f32(C0, A0, B0, 0)，其是将A0中的每个元素与B0中通道为0的元素相乘，然后再与C0中的相应元素相加，最后返回结果存储到C0中，由于命令C0 = vmovq_n_f32(0)将C0初始值为0，最终：
$C0=\left[ \begin{matrix} c_0 \\ c_1 \\ c_2 \\ c_3 \end{matrix} \right] =\left[ \begin{matrix} a_0b_0 \\ a_1b_0 \\ a_2b_0 \\ a_3b_0 \end{matrix} \right]\tag{7}$
当执行第一轮如下代码，即计算矩阵C的第一列C0：

C0 = vfmaq_laneq_f32(C0, A0, B0, 0);
C0 = vfmaq_laneq_f32(C0, A1, B0, 1);
C0 = vfmaq_laneq_f32(C0, A2, B0, 2);
C0 = vfmaq_laneq_f32(C0, A3, B0, 3);

可以得到第一列C0数据为：
$C0=\left[ \begin{matrix} c_0 \\ c_1 \\ c_2 \\ c_3 \end{matrix} \right] =\left[ \begin{matrix} a_0b_0+a_4b_1+a_8b_2+a_{12}b_3 \\ a_1b_0+a_5b_1+a_9b_2+a_{13}b_3 \\ a_2b_0+a_6b_1+a_{11}b_2+a_{14}b_3 \\ a_3b_0+a_7b_1+a_{12}b_2+a_{15}b_3 \end{matrix} \right]\tag{8}$
后续C1、C2和C3列计算类似，最终可以计算得到矩阵C的结果如上面公式（4）所示。

现在我们可以实现更大矩阵的乘法，将其视为由许多4x4的矩阵块组成。这种方法的缺点就是它只适用于行列都是四的倍数的矩阵，但对于其他矩阵，你也可以通过用零填充实现。

下面列出了更为通用的矩阵乘法代码。内部结构变化很小，主要是增加了循环和地址计算。与4x4矩阵一样，我们对B的每一列分配了唯一的变量，尽管我们可以使用一个变量并重新加载。这样做是让编译器给B分配了不同的寄存器，在指令流水线的作用下，可以实现当在计算某一列的时候，处理器可以同时加载另外一列的数据。

void matrix_multiply_neon(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) 
{
    /*
    * Multiply matrices A and B, store the result in C.
    * It is the user's responsibility to make sure the matrices are compatible.
    */
    int A_idx;
    int B_idx;
    int C_idx;
    // these are the columns of a 4x4 sub matrix of A
    float32x4_t A0;
    float32x4_t A1;
    float32x4_t A2;
    float32x4_t A3;
    // these are the columns of a 4x4 sub matrix of B
    float32x4_t B0;
    float32x4_t B1;
    float32x4_t B2;
    float32x4_t B3;
    // these are the columns of a 4x4 sub matrix of C
    float32x4_t C0;
    float32x4_t C1;
    float32x4_t C2;
    float32x4_t C3;
    for (int i_idx=0; i_idx<n; i_idx+=4 {
        for (int j_idx=0; j_idx<m; j_idx+=4) {
            // zero accumulators before matrix op
            c0=vmovq_n_f32(0);
            c1=vmovq_n_f32(0);
            c2=vmovq_n_f32(0);
            c3=vmovq_n_f32(0);
            for (int k_idx=0; k_idx<k; k_idx+=4) {
                // compute base index to 4x4 block
                a_idx = i_idx + n*k_idx;
                b_idx = k*j_idx k_idx;
                // load most current a values in row
                A0=vld1q_f32(A+A_idx);
                A1=vld1q_f32(A+A_idx+n);
                A2=vld1q_f32(A+A_idx+2*n);
                A3=vld1q_f32(A+A_idx+3*n);
                // multiply accumulate 4x1 blocks, i.e. each column C
                B0=vld1q_f32(B+B_idx);
                C0=vfmaq_laneq_f32(C0,A0,B0,0);
                C0=vfmaq_laneq_f32(C0,A1,B0,1);
                C0=vfmaq_laneq_f32(C0,A2,B0,2);
                C0=vfmaq_laneq_f32(C0,A3,B0,3);
                B1=v1d1q_f32(B+B_idx+k);
                C1=vfmaq_laneq_f32(C1,A0,B1,0);
                C1=vfmaq_laneq_f32(C1,A1,B1,1);
                C1=vfmaq_laneq_f32(C1,A2,B1,2);
                C1=vfmaq_laneq_f32(C1,A3,B1,3);
                B2=vld1q_f32(B+B_idx+2*k);
                C2=vfmaq_laneq_f32(C2,A0,B2,0);
                C2=vfmaq_laneq_f32(C2,A1,B2,1);
                C2=vfmaq_laneq_f32(C2,A2,B2,2);
                C2=vfmaq_laneq_f32(C2,A3,B3,3);
                B3=vld1q_f32(B+B_idx+3*k);
                C3=vfmaq_laneq_f32(C3,A0,B3,0);
                C3=vfmaq_laneq_f32(C3,A1,B3,1);
                C3=vfmaq_laneq_f32(C3,A2,B3,2);
                C3=vfmaq_laneq_f32(C3,A3,B3,3);
			}
            //Compute base index for stores
            C_idx = n*j_idx + i_idx;
            vstlq_f32(C+C_idx, C0);
            vstlq_f32(C+C_idx+n,Cl);
            vstlq_f32(C+C_idx+2*n,C2);
            vstlq_f32(C+C_idx+3*n,C3);
		}
	}
}

编译上面的函数，并与C代码实现进行比较：

由于利用了 Advanced SIMD 技术，对于给定矩阵相乘，其算术指令更少，而通常C代码做不到这一点。
用FMLA指令替代了FMUL指令。
更少的循环，当使用 intrinsics，循环展开更容易。
然而由于内存分配和数据类型初始化，存在不必要的加载和存储（例如float32x4_t），这些纯C代码不需要。

下面是完整的代码示例：

/*
* Copyright (C) Arm Limited, 2019 All rights reserved.
*
* The example code is provided to you as an aid to learning when working
* with Arm-based technology, including but not limited to programming tutorials.
* Arm hereby grants to you, subject to the terms and conditions of this Licence,
* a non-exclusive, non-transferable, non-sub-licensable, free-of-charge licence,
* to use and copy the Software solely for the purpose of demonstration and
* evaluation.
*
* You accept that the Software has not been tested by Arm therefore the Software
* is provided "as is", without warranty of any kind, express or implied. In no
* event shall the authors or copyright holders be liable for any claim, damages
* or other liability, whether in action or contract, tort or otherwise, arising
* from, out of or in connection with the Software or the use of Software.
*/
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <stdbool.h>
#include <math.h>
#include <arm_neon.h>
#define BLOCK_SIZE 4

void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) 
{
    for (int i_idx=0; i_idx<n; i_idx++) {
            for (int j_idx=0; j_idx<m; j_idx++) {
                C[n*j_idx + i_idx] = 0;
                for (int k_idx=0; k_idx<k; k_idx++) {
                C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx +
                k_idx];
            }
        }
    }
}

void matrix_multiply_neon(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) 
{
    /*
    * Multiply matrices A and B, store the result in C.
    * It is the user's responsibility to make sure the matrices are compatible.
    */
    int A_idx;
    int B_idx;
    int C_idx;
    // these are the columns of a 4x4 sub matrix of A
    float32x4_t A0;
    float32x4_t A1;
    float32x4_t A2;
    float32x4_t A3;
    // these are the columns of a 4x4 sub matrix of B
    float32x4_t B0;
    float32x4_t B1;
    float32x4_t B2;
    float32x4_t B3;
    // these are the columns of a 4x4 sub matrix of C
    float32x4_t C0;
    float32x4_t C1;
    float32x4_t C2;
    float32x4_t C3;
    for (int i_idx=0; i_idx<n; i_idx+=4 {
        for (int j_idx=0; j_idx<m; j_idx+=4) {
            // zero accumulators before matrix op
            c0=vmovq_n_f32(0);
            c1=vmovq_n_f32(0);
            c2=vmovq_n_f32(0);
            c3=vmovq_n_f32(0);
            for (int k_idx=0; k_idx<k; k_idx+=4) {
                // compute base index to 4x4 block
                a_idx = i_idx + n*k_idx;
                b_idx = k*j_idx k_idx;
                // load most current a values in row
                A0=vld1q_f32(A+A_idx);
                A1=vld1q_f32(A+A_idx+n);
                A2=vld1q_f32(A+A_idx+2*n);
                A3=vld1q_f32(A+A_idx+3*n);
                // multiply accumulate 4x1 blocks, i.e. each column C
                B0=vld1q_f32(B+B_idx);
                C0=vfmaq_laneq_f32(C0,A0,B0,0);
                C0=vfmaq_laneq_f32(C0,A1,B0,1);
                C0=vfmaq_laneq_f32(C0,A2,B0,2);
                C0=vfmaq_laneq_f32(C0,A3,B0,3);
                B1=v1d1q_f32(B+B_idx+k);
                C1=vfmaq_laneq_f32(C1,A0,B1,0);
                C1=vfmaq_laneq_f32(C1,A1,B1,1);
                C1=vfmaq_laneq_f32(C1,A2,B1,2);
                C1=vfmaq_laneq_f32(C1,A3,B1,3);
                B2=vld1q_f32(B+B_idx+2*k);
                C2=vfmaq_laneq_f32(C2,A0,B2,0);
                C2=vfmaq_laneq_f32(C2,A1,B2,1);
                C2=vfmaq_laneq_f32(C2,A2,B2,2);
                C2=vfmaq_laneq_f32(C2,A3,B3,3);
                B3=vld1q_f32(B+B_idx+3*k);
                C3=vfmaq_laneq_f32(C3,A0,B3,0);
                C3=vfmaq_laneq_f32(C3,A1,B3,1);
                C3=vfmaq_laneq_f32(C3,A2,B3,2);
                C3=vfmaq_laneq_f32(C3,A3,B3,3);
			}
            //Compute base index for stores
            C_idx = n*j_idx + i_idx;
            vstlq_f32(C+C_idx, C0);
            vstlq_f32(C+C_idx+n,Cl);
            vstlq_f32(C+C_idx+2*n,C2);
            vstlq_f32(C+C_idx+3*n,C3);
		}
	}
}
         
void matrix_multiply_4x4_neon(float32_t *A, float32_t *B, float32_t *C) 
{
    // these are the columns A
    float32x4_t A0;
    float32x4_t A1;
    float32x4_t A2;
    float32x4_t A3;
    // these are the columns B
    float32x4_t B0;
    float32x4_t B1;
    float32x4_t B2;
    float32x4_t B3;
    // these are the columns C
    float32x4_t C0;
    float32x4_t C1;
    float32x4_t C2;
    float32x4_t C3;
    A0 = vld1q_f32(A);
    A1 = vld1q_f32(A+4);
    A2 = vld1q_f32(A+8);
    A3 = vld1q_f32(A+12);
    // Zero accumulators for C values
    C0 = vmovq_n_f32(0);
    C1 = vmovq_n_f32(0);
    C2 = vmovq_n_f32(0);
    C3 = vmovq_n_f32(0);
    // Multiply accumulate in 4x1 blocks, i.e. each column in C
    B0 = vld1q_f32(B);
    C0 = vfmaq_laneq_f32(C0, A0, B0, 0);
    C0 = vfmaq_laneq_f32(C0, A1, B0, 1);
    C0 = vfmaq_laneq_f32(C0, A2, B0, 2);
    C0 = vfmaq_laneq_f32(C0, A3, B0, 3);
    vst1q_f32(C, C0);
    B1 = vld1q_f32(B+4);
    C1 = vfmaq_laneq_f32(C1, A0, B1, 0);
    C1 = vfmaq_laneq_f32(C1, A1, B1, 1);
    C1 = vfmaq_laneq_f32(C1, A2, B1, 2);
    C1 = vfmaq_laneq_f32(C1, A3, B1, 3);
    vst1q_f32(C+4, C1);
    B2 = vld1q_f32(B+8);
    C2 = vfmaq_laneq_f32(C2, A0, B2, 0);
    C2 = vfmaq_laneq_f32(C2, A1, B2, 1);
    C2 = vfmaq_laneq_f32(C2, A2, B2, 2);
    C2 = vfmaq_laneq_f32(C2, A3, B2, 3);
    vst1q_f32(C+8, C2);
    B3 = vld1q_f32(B+12);
    C3 = vfmaq_laneq_f32(C3, A0, B3, 0);
    C3 = vfmaq_laneq_f32(C3, A1, B3, 1);
    C3 = vfmaq_laneq_f32(C3, A2, B3, 2);
    C3 = vfmaq_laneq_f32(C3, A3, B3, 3);
    vst1q_f32(C+12, C3);
}

void print_matrix(float32_t *M, uint32_t cols, uint32_t rows) 
{
    for (int i=0; i<rows; i++) {
        for (int j=0; j<cols; j++) {
        	printf("%f ", M[j*rows + i]);
        }
        printf("\n");
    }
    printf("\n");
}
         
void matrix_init_rand(float32_t *M, uint32_t numvals) 
{
    for (int i=0; i<numvals; i++) {
    	M[i] = (float)rand()/(float)(RAND_MAX);
    }
}
         
void matrix_init(float32_t *M, uint32_t cols, uint32_t rows, float32_t val) 
{
    for (int i=0; i<rows; i++) {
        for (int j=0; j<cols; j++) {
        	M[j*rows + i] = val;
        }
    }
}
         
bool f32comp_noteq(float32_t a, float32_t b) 
{
    if (fabs(a-b) < 0.000001) {
    	return false;
    }
    return true;
}
         
bool matrix_comp(float32_t *A, float32_t *B, uint32_t rows, uint32_t cols) 
{
    float32_t a;
    float32_t b;
    for (int i=0; i<rows; i++) {
        for (int j=0; j<cols; j++) {
            a = A[rows*j + i];
            b = B[rows*j + i];
            if (f32comp_noteq(a, b)) {
            	printf("i=%d, j=%d, A=%f, B=%f\n", i, j, a, b);
            	return false;
            }
        }
    }
    return true;
}
         
int main() 
{
    uint32_t n = 2*BLOCK_SIZE; // rows in A
    uint32_t m = 2*BLOCK_SIZE; // cols in B
    uint32_t k = 2*BLOCK_SIZE; // cols in a and rows in b
    float32_t A[n*k];
    float32_t B[k*m];
    float32_t C[n*m];
    float32_t D[n*m];
    float32_t E[n*m];
    bool c_eq_asm;
    bool c_eq_neon;
    matrix_init_rand(A, n*k);
    matrix_init_rand(B, k*m);
    matrix_init(C, n, m, 0);
    print_matrix(A, k, n);
    print_matrix(B, m, k);
    //print_matrix(C, n, m);
    matrix_multiply_c(A, B, E, n, m, k);
    printf("C\n");
    print_matrix(E, n, m);
    printf("===============================\n");
    matrix_multiply_neon(A, B, D, n, m, k);
    printf("Neon\n");
    print_matrix(D, n, m);
    c_eq_neon = matrix_comp(E, D, n, m);
    printf("Neon equal to C? %d\n", c_eq_neon);
    printf("===============================\n");
}

可以使用以下命令在 Arm 机器上编译和反汇编：

gcc -g -o3 matrix.c -o exe_matrix_o3
objdump -d exe_ matrix _o3 > disasm_matrix_o3

示例3-碰撞检测

这个示例使用 Neon intrinsics 实现了一个简单的碰撞检测算法，游戏软件可以利用这个算法识别物体何时接触或碰撞。

为了判断两个物体何时碰撞，我们将其视为圆形。如果首先考虑沿单个轴的碰撞检测，我们可以看到当这些圆重叠时会发生碰撞：

一根轴上的碰撞检测

假设两个圆的半径分别为r1和r2：

当两个圆之间的距离d大于r1和r2的和时不会发生碰撞
当两个圆之间的距离d小于等于r1和r2的和时会发生碰撞
二维空间中，每个圆心都可以用坐标 (x,y) 表示。考虑c1和c2两个圆，其圆心坐标分别为 (c1x, c1y) 和 (c2x, c2y)。

二维空间中碰撞检测

二维碰撞检测与单轴碰撞检测相同，为了计算两个圆心之间的距离，作一个虚线三角形，如下所示：

利用毕达哥拉斯定理计算圆之间的距离

三角形的高度是圆的y坐标之间的绝对差
三角形的底边是圆的x坐标之间的绝对差
三角形的斜边d是圆C1和C2圆心之间的距离

为了计算斜边d，我们使用毕达哥拉斯定理，斜边的平方等于其他两侧的平方和：
$d^2=(c1y-c2y)^2+(c1x-c2x)^2$

无矢量运算的算法实现

碰撞检测算法是获取两个圆心之间的距离，如果这个距离小于它们的半径之和，则两个圆会发生碰撞。下面是没有运用矢量的示例：

#include <stdio.h>
struct circle
{
    float radius;
    float x;
    float y;
};

bool does_collide(circle& c1, circle& c2)
{
    // Two circles collide if the distance from c1 to c2 is less
    // than the sum of their radii, or equivalently if the squared
    // distance is less than the square of the sum of the radii.
    float delta_x = c1.x - c2.x;
    float delta_y = c1.y - c2.y;
    float deltas_squared = (delta_x * delta_x) + (delta_y * delta_y);
    float radius_sum_squared = (c1.radius * c1.radius) + (c2.radius * c2.radius);
    return deltas_squared <= radius_sum_squared;
}

int main()
{
    circle c1;
    c1.radius = 2.0;
    c1.x = 2.0;
    c1.y = 4.0;
    circle c2;
    c2.radius = 1.0;
    c2.x = 6.0;
    c2.y = 1.0;
    if (does_collide(c1, c2)) {
    	printf("Circles collide\n");
    } else {
    	printf("Circles do not collide\n");
    }
    return (0);
}

上面代码会创建两个圆，如下：

c1：半径为2，坐标为（2,4）
c2：半径为1，坐标为（6,1）

x坐标之间的差为4（6-2），y坐标之间的差为3（4-1）。那么圆心之间的距离为5，计算如下图：

碰撞检测示例

圆心之间的距离 (5) 大于圆半径之和，因此两个圆不会发生碰撞。

将上述C代码编译生成汇编代码如下：

ldr s0, [x0] // Load c1.x (first data at c1 base address)
ldr s1, [x1] // Load c2.x (first data at c2 base address)
fsub s0, s0, s1 // Calculate (c1.x - c2.x)
ldr s2, [x0, 4] // Load c1.y (offset 4 from c1 base address)
ldr s1, [x1, 4] // Load c2.y (offset 4 from c2 base address)
fsub s2, s2, s1 // Calculate (c1.y - c2.y)
ldr s1, [x0, 8] // Load c1.radius (offset 8 from c1 base address)
ldr s3, [x1, 8] // Load c2.radius (offset 8 from c2 base address)
fmul s0, s0, s0 // Calculate (c1.x - c2.x)^2
fmul s2, s2, s2 // Calculate (c1.y - c2.y)^2
fadd s0, s0, s2 // Calculate ((c1.x - c2.x)^2 + (c1.y - c2.y)^2 )
fmul s1, s1, s1 // Calculate c1.radius^2
fmul s3, s3, s3 // Calculate c2.radius^2
fadd s1, s1, s3 // Calculate (c1.radius^2 + c2.radius^2)
fcmpe s0, s1 // Test ((c1.x - c2.x)^2 + (c1.y - c2.y)^2 ) against
// (c1.radius^2 + c2.radius^2)
cset w0, mi // If the result is negative (mi),
// set the result (w0) to 1
ret

使用Neon intrinsics进行基本矢量运算

Neon 指令可以同时对多个数据执行相同的计算。我们的碰撞检测算法包含多个实例，对不同的数据执行相同的数学运算，这些操作是可使用 Neon intrinsics 进行优化：

减法
从 c1.x 中减去 c2.x，计算 x 坐标的差值
从 c1.y 中减去 c2.y，计算 y 坐标的差值
乘法
- c1.radius 乘以 c1.radius，计算其本身的平方值
- c2.radius 乘以 c2.radius，计算其本身的平方值

以下代码使用 Neon intrinsics 优化了碰撞检测算法：

#include <arm_neon.h>
#include <stdio.h>
struct circle
{
    float x;
    float y;
    float radius;
} __attribute__((aligned (64)));

bool does_collide_neon(circle const& c1, circle const& c2)
{
    float32x2_t c1_coords = vld1_f32(&c1.x);
    float32x2_t c2_coords = vld1_f32(&c2.x);
    float32x2_t deltas = vsub_f32(c1_coords, c2_coords);
    float32x2_t deltas_squared = vmul_f32(deltas, deltas);
    float sum_deltas_squared = vpadds_f32(deltas_squared);
    float radius_sum = c1.radius + c2.radius;
    float radius_sum_squared = radius_sum * radius_sum;
    return sum_deltas_squared <= radius_sum_squared;
}

int main()
{
    circle c1;
    c1.radius = 2.0;
    c1.x = 2.0;
    c1.y = 4.0;
    circle c2;
    c2.radius = 1.0;
    c2.x = 6.0;
    c2.y = 1.0;
    if (does_collide_neon(c1, c2)) {
    	printf("Circles collide\n");
    } else {
    	printf("Circles do not collide\n");
    }
    return (0);
}

上述代码使用了双通道64位 Neon 寄存器，每个寄存器包含32位浮点数。使用 Neon intrinsics 的代码部分如下：

将每个圆的 x 和 y 坐标分别加载到两个单独 Neon 寄存器的不同通道中
```
float32x2_t c1_coords = vld1_f32(&c1.x);
float32x2_t c2_coords = vld1_f32(&c2.x);
```
vld1_f32函数将两个32位浮点数加载到 Neon 寄存器中，如下图所示：
作减法获取差值
```
float32x2_t deltas = vsub_f32(c1_coords, c2_coords);
```
vsub_f32函数将两个 Neon 寄存器中的对应通道相减，然后将结果放入目标寄存器的通道中，如下图所示：
将 delta 与其自身相乘，获得平方值：
```
float32x2_t deltas_squared = vmul_f32(deltas, deltas);
```
vmul_f32 函数将两个 Neon 寄存器中的对应通道相乘，然后将结果放入目标寄存器的通道中，如下图所示：

vmul_f32

对矢量求和得到一个标量值，即两个圆之间距离的平方：
```
float sum_deltas_squared = vpadds_f32(deltas_squared);
```
vpadds_f32函数将 Neon 寄存器中的两个浮点向量元素相加，得到一个标量结果。

编译上述C代码生成汇编代码如下：

ldr d0, [x0] // Load c1.x and c1.y into two lanes of vector d0
// first two data values at c1 base address
ldr d1, [x1] // Load c2.x and c2.y into two lanes of vector d1
// first two data values at c2 base address
fsub v0.2s, v0.2s, v1.2s // Vector subtract gets x and y deltas
fmul v0.2s, v0.2s, v0.2s // Square both deltas at the same time
faddp s0, v0.2s // Add across vector to get sum of squares
ldr s1, [x0, 8] // Code from here is scalar, same as before...
ldr s2, [x1, 8]
fadd s1, s1, s2
fmul s1, s1, s1
fcmpe s0, s1
cset w0, mi
ret

该代码在计算距离平方时并行化减法和乘法运算。

使用Neon intrinsics进行高级矢量运算

上一节的实现可能不会比无矢量运算的实现更快。并行减法和乘法运算的性能受到内存布局的限制，加载向量寄存器需要多条指令。此外，在需要执行跨通道加法操作之前，我们只并行化了两个操作，一个减法和一个乘法。

circle结构的声明意味着数据是交错的，这会抑制矢量运算。一种方法是更改我们的数据布局以提供解交错数据。

以下示例展示了这种方法，它创建了一个新的结构Circles，指向包括解交错的 x、y 和半径的数组指针。这种方法可以检查单个圆和多个圆之间的碰撞。

#include <arm_neon.h>
#include <stdio.h>

// Structure containing data for a single collider circle
struct circle
{
    float x;
    float y;
    float radius;
};

// Structure containing an array of pointers to data for multiple circles
struct circles
{
    size_t size;
    float* xs;
    float* ys;
    float* radii;
};

void does_collide_neon_deinterleaved(circles const& input, circle& collider, bool* out)
{
    // Duplicate the collider properties in 3 separate 4-lane vector registers
    float32x4_t c1_x = vdupq_n_f32(collider.x);
    float32x4_t c1_y = vdupq_n_f32(collider.y);
    float32x4_t c1_r = vdupq_n_f32(collider.radius);
    for (size_t offset = 0; offset != input.size; offset += 4)
    {
    	// Perform 4 collision tests at a time
        float32x4_t x = vld1q_f32(input.xs + offset);
        float32x4_t y = vld1q_f32(input.ys + offset);
        float32x4_t delta_x = vsubq_f32(c1_x, x);
        float32x4_t delta_y = vsubq_f32(c1_y, y);
        float32x4_t delta_x_squared = vmulq_f32(delta_x, delta_x);
        float32x4_t delta_y_squared = vmulq_f32(delta_y, delta_y);
        float32x4_t sum_deltas_squared = vaddq_f32(delta_x_squared, delta_y_squared);
        float32x4_t r = vld1q_f32(input.radii + offset);
        float32x4_t radius_sum = vaddq_f32(c1_r, r);
        float32x4_t radius_sum_squared = vmulq_f32(radius_sum, radius_sum);
        uint32x4_t mask = vcltq_f32(sum_deltas_squared, radius_sum_squared);
        // Unpack the results in each lane
        out[offset] = 1 & vgetq_lane_u32(mask, 0);
        out[offset + 1] = 1 & vgetq_lane_u32(mask, 1);
        out[offset + 2] = 1 & vgetq_lane_u32(mask, 2);
        out[offset + 3] = 1 & vgetq_lane_u32(mask, 3);
    }
}

int main()
{
    int num_input = 4;
    float input_x[num_input] __attribute__((aligned (64)));
    float input_y[num_input] __attribute__((aligned (64)));
    float input_r[num_input] __attribute__((aligned (64)));
    bool output[num_input] __attribute__((aligned (64)));
    // Set up the data for multiple circles
    for (int i = 0; i < num_input; i++) {
        input_x[i] = i*2;
        input_y[i] = i*3;
        input_r[i] = i;
        output[i] = 0;
    }
    // Create input object containing pointers to array data for multiple circles
    circles c1;
    c1.size = num_input;
    c1.radii = input_r;
    c1.xs = input_x;
    c1.ys = input_y;
    // Create collider object containing data for a single circle
    circle c2;
    c2.radius = 5.0;
    c2.x = 10.0;
    c2.y = 10.0;
    // Test whether the collider circle collides with any of the input circles,
    // returning results in output
    does_collide_neon_deinterleaved(c1, c2, output);
    // Iterate over the returned output data and display results
    for (int i = 0; i < num_input; i++) {
        if (output[i]) {
        	printf("Circle %d at (%.1f, %.1f) with radius %.1f collides\n", i, input_x[i],
        	input_y[i], input_r[i]);
        } else {
        	printf("Circle %d at (%.1f, %.1f) with radius %.1f does not collide\n", i,
        	input_x[i], input_y[i], input_r[i]);
        }
    }
    return (0);
}

使用 Neon intrinsics 的代码部分如下：

将x，y和半径拷贝到三个 Neon 寄存器中：
```
float32x4_t c1_x = vdupq_n_f32(collider.x);
float32x4_t c1_y = vdupq_n_f32(collider.y);
float32x4_t c1_r = vdupq_n_f32(collider.radius);
```
vdupq_n_f32函数将一个标量值复制到 Neon 寄存器的所有通道中，如下图所示：
接下来将四个圆的 x 和 y 数据加载到两个不同 Neon 的单独通道中：
```
float32x4_t x = vld1q_f32(input.xs + offset);
float32x4_t y = vld1q_f32(input.ys + offset);
```
vld1q_f32函数将连续内存地址中的四个32位值加载到 Neon 寄存器的四个 32 位通道。地址是通过在基地址加上偏移量来计算的x 和 y 数据的地址指针。每次循环迭代时偏移量都会增加4，因为每个循环处理四个圆。下图显示了第二次迭代循环，使用偏移量4：

将四个输入x值分别减去c1_x，得到delta_x，delta_y类似：

float32x4_t delta_x = vsubq_f32(c1_x, x);
float32x4_t delta_y = vsubq_f32(c1_y, y);

vsubq_f32

将delta_x和delta_y分别与自身相乘，得到平方值：

float32x4_t delta_x_squared = vmulq_f32(delta_x, delta_x);
float32x4_t delta_y_squared = vmulq_f32(delta_y, delta_y);

将每个通道的delta_x_squared和delta_y_squared求和，得到距离的平方：
```
float32x4_t sum_deltas_squared = vaddq_f32(delta_x_squared, delta_y_squared);
```

类似可以计算半径的平方：

float32x4_t r = vld1q_f32(input.radii + offset);
float32x4_t radius_sum = vaddq_f32(c1_r, r);
float32x4_t radius_sum_squared = vmulq_f32(radius_sum, radius_sum);

将每个圆的radius_sum_squared分别与sum_deltas_squared相比：
```
uint32x4_t mask = vcltq_f32(sum_deltas_squared, radius_sum_squared);
```
vcltq_f32 intrinsics函数将第一个 Neon 寄存器的每个通道与第二个 Neon 寄存器相应通道相比。如果第一个值小于第二个值，则对应存储结果的通道为1，否则为0。

提取每个通道的值，并存储到out数组：

out[offset] = 1 & vgetq_lane_u32(mask, 0);
out[offset + 1] = 1 & vgetq_lane_u32(mask, 1);
out[offset + 2] = 1 & vgetq_lane_u32(mask, 2);
out[offset + 3] = 1 & vgetq_lane_u32(mask, 3);

性能结果

当分析性能时，为了防止编译器进行自动矢量优化，所有函数都用 GCC 的noinline属性修饰，这可以保证性能提升是由 Neon intrinsics 带来的。下表总结了上面三种实现方式的性能差异：

Function	Time-per-invocation	Speedup
does_collide	2.724ns	1x
does_collide_neon	2.717ns	1.003x
does_collide_neon_deinterleaved	0.925s	2.945x

为了计算Time-per-invocation，每个函数都执行了16,384次碰撞测试，并做了100,000次以上实验。上面的代码均使用-O3编译，在 Samsung S20上运行测试。

does_collide_neon加速是名义上的。但是does_collide_neon_deinterleaved却提供了近3倍的性能加速。

编程约定

宏

为了使用 intrinsics，必须支持 Advanced SIMD 架构，并且某些特定指令都可能启用或者不启用。当下面的宏使能时，相应的功能可用：

__ARM_NEON：表示编译器支持 Advanced SIMD，AArch64 始终为1。
__ARM_NEON_FP：表示支持 Neon 浮点运算，AArch64 始终为1。
__ARM_FEATURE_CRYPTO：表示加密指令可用，因此使用这些的加密 Neon intrinsics 也是可用的。
__ARM_FEATURE_FMA：表示融合乘法累加指令可用，因此使用这些的 Neon intrinsics 也是可用的。

上面列举的并不详细，更多宏请参考Arm C Language Extensions文档。

类型

arm_neon.h提供了三大数据类型，按照如下格式：

baseW_t：标量数据类型
baseWxL_t：矢量数据类型
baseWxLxN_t：矢量数组数据类型

其中：

base：基本数据类型
W：基本类型的宽度
L：矢量数据类型的实例数量，例如数组的长度

一般来说， W和L是64或128位的矢量数据类型，因此完全适合放进 Neon 寄存器。N适合那些在多个寄存器上操作的指令。在上面代码中，有三个实例：

uint8_t
uint8x16_t
uint8x16x3_t

函数

根据 Arm C Language Extensions，arm_neon.h中的函数原型遵循通用格式。一般如下：

ret v[p][q][r]name[u][n][q][x][_high][_lane | laneq][_n][_result]_type(args)

ret：函数的返回类型
v：vector的缩写，在所有 intrinsics 函数中存在
p：表示成对操作。（[value]意味着value可能存在）
q：表示饱和操作（除了vqtb[l][x]之外，在AArch64 操作中，q表示操作128位索引和结果
r：表示四舍五入运算
name：基本操作的描述性名称。通常这是 Advanced SIMD 指令，但并非必须如此
u：表示有符号到无符号的饱和
n：表示缩窄操作
q：该后缀表示对128位向量进行操作
x：表示 AArch64 中的 Advanced SIMD 标量运算。可以是b、h、s或者d（即8位，16位，32位或者64位）
_high：在 AArch64 中，用于128位操作数的加宽和缩窄操作。对于加宽128位操作数，high是指源操作数的高64位。对于缩窄操作，它指的是目标操作数的高64位
_n：表示参数是是标量
_lane：从向量通道中取标量操作。_laneq表示从128位宽的输入向量通道中获取标量操作数
type：主要操作类型的缩写
args：函数参数

检查你的知识

什么是Neon？

Neon 是 Arm 架构 Advanced SIMD 扩展的实现。所有符合 Armv8‑A 架构的处理器（例如Cortex‑A76或Cortex‑A57）均包括 Neon。从程序员视角看，Neon 提供额外32个128位寄存器，以及在8、16、32或64位寄存器通道上运行的指令。
为了使用 Neon intrinsics 内建函数，必须在C代码中包含哪个文件？

arm_neon.h。#include <arm_neon.h>必须在使用任何 Neon intrinsics 内建函数之前。
函数int8x16_t vmulq_s8(int8x16_t a, int8x16_t b)的功能？

mul暗示 intrinsics 使用 MUL指令。根据参数和返回值的类型（16 个字节的有符号整数），我们可能会猜测此 intrinsics 的汇编实现为MUL Vd.16B, Vn.16B, Vm.16B。所以这个函数将a和b的相应元素相乘并返回结果，检查上面的定义表明这确实是正确的。
本文实验中定义的deinterleave函数只能对16个8位无符号整数块进行操作。如果有一个uint8_t数组，长度不
是16的倍数，如何进行更改？只更改数组但不更改函数，或者更只更改函数但不更改数组？

如果是更改数组但不更改函数，请用零填充数组。这是最简单的选择，但必须考虑对其他函数的影响。如果是更改函数但不更改数组，请对每个16字节数组使用 Neon deinterleave，然后对余数使用C deinterleave 。
数据类型float64_t、poly64x2_t和int8x8x3_t代表什么？

float64_t是标量类型，即64位浮点类型，poly64x2_t是两个64位多项式标量的向量类型，int8x8x3_t是由8个8位有符号整数组成的数组类型，数组长度为3。