向量化运算和 EIGEN_MAKE_ALIGNED_OPERATOR_NEW

泠山

已于 2023-08-24 15:40:10 修改

阅读量1.3k

点赞数 6

分类专栏： # 向量化运算文章标签： eigen SLAM c++

于 2022-10-14 16:25:22 首次发布

原文链接：https://zhuanlan.zhihu.com/p/93824687

版权

向量化运算专栏收录该内容

5 篇文章 2 订阅

订阅专栏

EIGEN_MAKE_ALIGNED_OPERATOR_NEW

1 缘起
2 什么是向量化运算？
3 解析
4 再谈Eigen
- 4.1 文档内容
- - 为什么这是需要的？
  - 为什么 C++17 就不需要这样写了？
5 总结

Reference:

相关文章：

相关概念：

AVX(Advanced Vector Extensions)：高级向量扩展指令集；
SSE(Streaming SIMD Extensions)：因特网数据流单指令序列扩展。
SIMD(Single Instruction Multiple Data)：一条指令操作多个数据。是CPU基本指令集的扩展。主要用于提供小碎数据的并行操作(fine grain parallelism)。比如说图像处理，图像的数据常用的数据类型是RGB565、RGBA8888、YUV422等格式，这些格式的数据特点是一个像素点的一个分量总是用小于等于８bit的数据表示的。如果使用传统的处理器做计算，虽然处理器的寄存器是32位或是64位的，处理这些数据确只能用于他们的低8位，似乎有点浪费。如果把64位寄存器拆成8个8位寄存器就能同时完成8个操作，计算效率提升了8倍。SIMD指令的初衷就是这样的，只不过后来慢慢cover的功能越来越多。

本文用到的所有示例代码已上传GitHub：jingedawang/AlignmentExample。

1 缘起

Eigen 是一个非常常用的矩阵运算库，至少对于 SLAM 的研究者来说不可或缺。然而，向来乖巧的 Eigen 近来却频频闹脾气，把我的程序折腾得死去活来，我却是丈二和尚摸不着头脑。

简单说说我经历的灵异事件。我的程序原本在NVIDIA TX2上跑的好好的，直到有一天，我打算把它放到服务器上，看看传说中的RTX 2080GPU能不能加速一把。结果悲剧发生了，编译正常，但是一运行就立即double free。我很是吃惊，怎么能一行代码都没执行就崩了呢。但崩了就是崩了，一定是哪里有bug，我用valgrind检查内存问题，发现种种线索都指向g2o。g2o是一个SLAM后端优化库，里面封装了大量SLAM相关的优化算法，内部使用了Eigen进行矩阵运算。阴差阳错之间，我发现关闭-march=native这个编译选项后就能正常运行(set(CMAKE_CXX_FLAGS “${CMAKE_CXX_FLAGS} -O3 -march=native”))，而这个编译选项其实是告诉编译器当前的处理器支持哪些SIMD指令集(是set(CMAKE_CXX_FLAGS “-march=native”)这个告诉编译器当前处理器支持哪些SIMD指令集)，Eigen中又恰好使用了SSE、AVX等指令集进行向量化加速。此时，机智的我发现Eigen文档中有一章叫做Alignment issues，里面提到了某些情况下Eigen对象可能没有内存对齐，从而导致程序崩溃。现在，证据到齐，基本可以确定我遇到的真实问题了：编译安装g2o时，默认没有使用-march=native，因此里面的Eigen代码没有使用向量化加速，所以它们并没有内存对齐(这里的理解应该有点问题，开启了 -O3 时就是使用编译器向量化优化了的。但是因为没有开启 -march=native，所以编译器默认做的是 $16$ 字节的对齐优化；而开启 -march=native 后，知道电脑上有 AVX，所以使用的是 $32$ 字节对齐的优化)。而在我的程序中，启用了向量化加速，所有的Eigen对象都是内存对齐的。两个程序链接起来之后，g2o中未对齐的Eigen对象一旦传递到我的代码中，向量化运算的指令就会触发异常。解决方案很简单，要么都用-march=native，要么都不用。(-march=native：让编译器自动判断当前硬件支持的指令)

这件事就这么过去了，但我不能轻易放过它，毕竟花费了那么多时间找bug。后来我又做了一些深入的探究，这篇文章就来谈谈向量化和内存对齐里面的门道。

2 什么是向量化运算？

计算机架构有多种分类方法，其中最著名的是1966年由Flynn提出的分类法，称为Flynn分类法。Flynn分类法根据指令和数据进入CPU的方式，将计算机架构分为四种不同的类型：

单指令流单数据流(SISD, Single Instruction stream Single Data stream)
单指令流多数据流(SIMD, Single Instruction stream Multiple Data stream)
多指令流单数据流(MISD, Multiple Instruction stream Single Data stream)
多指令流多数据流(MIMD, Multiple Instruction stream Multiple Data stream)

向量化运算就是用SSE、AVX等SIMD(Single Instruction Multiple Data)指令集，实现一条指令对多个操作数的运算，从而提高代码的吞吐量，实现加速效果。(简单来说，向量化计算就是将一个loop(处理一个array的时候每次处理1个数据共处理N次)，转换为vectorization(处理一个array的时候每次同时处理n[一般为2,4,8…]个数据共处理N/n次))

在这里插入图片描述

SSE是一个系列，包括从最初的SSE到最新的SSE4.2，支持同时操作 $16$ 字节的数据，即 $4$ 个float或者 $2$ 个double。(为什么没有SSE5而改用了后面的AVX?其实SSE5是存在的，只不过存在于AMD处理器内。前几代SSE处理器都是Intel先出，随后AMD跟随，然而到SSE4之后，突然AMD先出了SSE5，抢了Intel的先。Intel不能忍，于是SSE5就没了。历史可参考这篇文章：SIMD指令集)AVX也是一个系列，它是 SSE 的升级版，支持同时操作 $32$ 字节的数据，即 $8$ 个 float或者 $4$ 个 double( $4\times64bits=256$ )。SSE 和 AVX 都是在 Intel 处理器上的指令集；在ARM上，这个称之为NEON。
在这里插入图片描述

~~但向量化运算是有前提的，那就是内存对齐~~(C++17 之前是需要的，但是 C++17 新增了一个新特性，即过度对齐的动态内存分配 Dynamic memory allocation for over-aligned data？内存对齐的主要目的还是使CPU能够对变量进行快速的访问)。SSE的操作数，必须 $16$ bytes对齐，而 AVX 的操作数，必须 $32$ bytes对齐。也就是说，如果我们有 $4$ 个float数，必须把它们放在连续的且首地址为 $16$ 的倍数的内存空间中，才能调用SSE的指令进行运算。

2.1 数据类型

SSE有3个数据类型(128:128-bit integer registers)：

__m128: float
__m128d: double
__m128i：integer

AVX有3个数据类型(256:256-bit integer registers)：

__m256: float
__m256d: double
__m256i：integer

在这里插入图片描述

2.2 时间测试

现使用AVX展现一下SIMD向量化计算的魔力：

#include <immintrin.h>
#include <iostream>
#include <chrono>

#define NUM (8)
#define LOOP_NUM (200000000l)

void mulAndAdd() {
    float input[8] = {1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8};
    float weight[8] = {2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9};
    float bias[8] = {11.1,12.2,13.3,14.4,15.5,16.6,17.7,18.8};
    float output[8] = {0};
    auto start = std::chrono::system_clock::now();
    for(int j=0; j<LOOP_NUM; j++){
        for(int i=0; i<NUM; i++){
            output[i] += weight[i] * input[i] + bias[i];
        }
    }
    auto end = std::chrono::system_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "Loop result: ";
    for(int i=0; i<NUM; i++){
        std::cout<<output[i]<<" ";
    }
    std::cout<< "\nelapsed time: " << elapsed_seconds.count() << "s\n";
}

void mulAndAddVec() {
    __attribute__ ((aligned (32))) float input[8] = {1.1,2.2,3.3,4.4,5.5,6.6,7.7,8.8};
    __attribute__ ((aligned (32))) float weight[8] = {2.2,3.3,4.4,5.5,6.6,7.7,8.8,9.9};
    __attribute__ ((aligned (32))) float bias[8] = {11.1,12.2,13.3,14.4,15.5,16.6,17.7,18.8};
    __attribute__ ((aligned (32))) float output[8];

    auto start = std::chrono::system_clock::now();
    __m256 i = _mm256_load_ps(input);
    __m256 w = _mm256_load_ps(weight);
    __m256 b = _mm256_load_ps(bias);
    __m256 o;
    for(int j=0; j<LOOP_NUM; j++){
        o += _mm256_fmadd_ps(i, w, b);
    }
    _mm256_store_ps(output, o);
    auto end = std::chrono::system_clock::now();
    std::chrono::duration<double> elapsed_seconds = end-start;
    std::cout << "Vectorization result: ";
    for(int i=0; i<NUM; i++){
        std::cout<<output[i]<<" ";
    }
    std::cout<< "\nelapsed time: " << elapsed_seconds.count() << "s\n";
}

输出结果为：

Loop result: 2.68435e+08 5.36871e+08 5.36871e+08 1.07374e+09 1.07374e+09 2.14748e+09 2.14748e+09 2.14748e+09 
elapsed time: 2.98301s
Vectorization result: 2.68435e+08 5.36871e+08 1.0812e+24 1.07374e+09 1.07374e+09 2.14748e+09 2.14748e+09 2.14748e+09 
elapsed time: 0.474711s

可以看到使用向量化加速后，程序运行速度快了不少。（如果使用 Release 版本，程序默认开启 -O3 时，编译器做了向量化运算，这时候第一个 elapsed time 会和下面差不多，查看汇编可以看到做了 AVX 的优化）

2.3 EXTRA

在 2.2 节中，CMake 使用的指令为：

set(CMAKE_CXX_FLAGS “-march=native”)

那么如果使用 -O3 协助优化呢？

set(CMAKE_CXX_FLAGS “${CMAKE_CXX_FLAGS} -O3 -march=native”)

重新运行一下上述程序，输出结果为：

Loop result: 2.68435e+08 5.36871e+08 5.36871e+08 1.07374e+09 1.07374e+09 2.14748e+09 2.14748e+09 2.14748e+09 
elapsed time: 0.189158s
Vectorization result: 2.68435e+08 5.36871e+08 5.36871e+08 1.07374e+09 1.07374e+09 2.14748e+09 2.14748e+09 2.14748e+09 
elapsed time: 0.187188s

-O3 给自动做了优化，关于 -O3内容，可见：-O1,-O2,-O3编译优化知多少（需要注意一点，cmake 的参数设置为 Release，即 set(CMAKE_BUILD_TYPE Release) 的话就默认开 -O3 的优化了）

2.3 部分常用指令

基于 SIMD 指令集的 SIMD Intrinsics(SIMD 内建函数)是各SIMD指令集提供的一套 C++ 语言接口，详细接口可查询Intrinsics。

_mm256_load_ps: Loads packed single-precision floating point values (float32 values) from the 256-bit aligned memory location pointed to by *a, into a destination float32 vector, which is retured by the intrinsic.

extern __m256 _mm256_load_ps(float const *a);

_mm256_load_si256: Loads integer values from the 256-bit aligned memory location pointed to by *a, into a destination integer vector, which is returned by the intrinsic.

extern __m256i _mm256_load_si256(__m256i const *a);

_mm256_load_pd: Loads packed double-precision floating point values (float64 values) from the 256-bit aligned memory location pointed to by *a, into a destination float64 vector, which is retured by the intrinsic.

extern __m256d _mm256_load_pd(double const *a);

_mm256_loadu_ps: Loads packed single-precision floating point values (float32 values) from the 256-bit unaligned memory location pointed to by *a, into a destination float32 vector, which is retured by the intrinsic.

extern __m256 _mm256_loadu_ps(float const *a);

_mm256_store_ps: Performs a store operation by moving packed single-precision floating point values (float32 values) from a float32 vector, b, to a 256-bit aligned memory location, pointed to by *a.

extern void _mm256_store_ps(float *a, __m256 b);

__attribute__((align_value(alignment))): (Linux* OS:)This keyword can be added to a pointer typedef declaration to specify the alignment value of pointers declared for that pointer type.
It tells the compiler that the data referenced by the designated pointer is aligned by the indicated value, and the compiler can generate code based on that assumption. If this attribute is used incorrectly, and the data is not aligned to the designated value, the behavior is undefined.
(Windows* OS:)__declspec(align_value(alignment))
_mm256_fmadd_ps: Performs a set of SIMD multiply-add computation on packed single-precision floating-point values using three source vectors/operands, a, b, and c. Corresponding values in two operands, a and b, are multiplied and the infinite precision intermediate results are added to corresponding values in the third operand, after which the final results are rounded to the nearest float32 values.

extern __m256 _mm256_fmadd_ps(__m256 a, __m256 b, __m256 c);

Intrinsics 的总结可以看这篇文章：Crunching Numbers with AVX and AVX2

3 解析

3.1 A Simple Example

为了给没接触过向量化编程的同学一些直观的感受，这里写了一个简单的示例程序：

// unaligned_vectorization.cpp
// gcc编译支持AVX2指令的编程。程序中需要使用头文件<immintrin.h>和<avx2intrin.h>，
// 这样通过调用其中定义的一些函数，达到使用AVX2指令的目的，
// 即用C/C++调用SIMD指令（单指令多数据）。
#include <immintrin.h> 
#include <iostream>
 
// 同时计算4对double的和
int main() {
 
  double input1[4] = {1, 1, 1, 1};
  double input2[4] = {1, 2, 3, 4};
  double result[4];
 
  std::cout << "address of input1: " << input1 << std::endl;
  std::cout << "address of input2: " << input2 << std::endl;
 
  __m256d a = _mm256_load_pd(input1); // 加载操作数
  __m256d b = _mm256_load_pd(input2);
  __m256d c = _mm256_add_pd(a, b); // 进行向量化运算
 
  _mm256_store_pd(result, c); // 读取运算结果到result中
 
  std::cout << result[0] << " " << result[1] << " " 
    << result[2] << " " << result[3] << std::endl;
 
  return 0;
}
// unaligned_vectorization.cpp

这段代码使用AVX中的向量化加法指令，同时计算 $4$ 对double的和。这 $4$ 对数保存在input1和input2中。 _mm256_load_pd指令用来加载操作数，_mm256_add_pd指令进行向量化运算，最后， _mm256_store_pd指令读取运算结果到result中。可惜的是，程序运行到第一个_mm256_load_pd处就崩溃了。崩溃的原因正是因为输入变量没有内存对齐。我特意打印出了两个输入变量的地址，结果如下(开了 -O3 或 Release 后这里就不会崩溃了，编译器默认做了优化)：

 address of input1: 0x7ffeef431ef0
 address of input2: 0x7ffeef431f10

上一节提到了 AVX 要求 $32$ 字节对齐，我们可以把这两个输入变量的地址除以 $32$ ，看是否能够整除。结果发现 0x7ffeef431ef0 和 0x7ffeef431f10 都不能整除。当然，其实直接看倒数第二位是否是偶数即可，是偶数就可以被 $32$ 整除，是奇数则不能被 $32$ 整除。

如何让输入变量内存对齐呢？我们知道，对于局部变量来说，它们的内存地址是在编译期确定的，也就是由编译器决定。所以我们只需要告诉编译器，给input1和input2申请空间时请让首地址32字节对齐，这需要通过预编译指令来实现。不同编译器的预编译指令是不一样的，比如gcc的语法为__attribute__((aligned(32)))，MSVC的语法为 __declspec(align(32)) 。以gcc语法为例，做少量修改，就可以得到正确的代码：

#include <immintrin.h>
#include <iostream>
 
int main() {
 
  __attribute__ ((aligned (32))) double input1[4] = {1, 1, 1, 1};
  __attribute__ ((aligned (32))) double input2[4] = {1, 2, 3, 4};
  __attribute__ ((aligned (32))) double result[4];
 
  std::cout << "address of input1: " << input1 << std::endl;
  std::cout << "address of input2: " << input2 << std::endl;
 
  __m256d a = _mm256_load_pd(input1);
  __m256d b = _mm256_load_pd(input2);
  __m256d c = _mm256_add_pd(a, b);
 
  _mm256_store_pd(result, c);
 
  std::cout << result[0] << " " << result[1] << " " 
    << result[2] << " " << result[3] << std::endl;
 
  return 0;
}
// aligned_vectorization.cpp

输出结果为：

address of input1: 0x7ffc5ca2e640
address of input2: 0x7ffc5ca2e660
2 3 4 5

可以看到，这次的两个地址都是 $32$ 的倍数，而且最终的运算结果也完全正确。

虽然上面的代码正确实现了向量化运算，但实现方式未免过于粗糙。每个变量声明前面都加上一长串预编译指令看起来就不舒服。我们尝试重构一下这段代码。

3.2 重构

首先，最容易想到的是，把内存对齐的 double 数组声明成一种自定义数据类型，如下所示：

using aligned_double4 = __attribute__ ((aligned (32))) double[4]; //为一个类型起一个简洁的名字

aligned_double4 input1 = {1, 1, 1, 1};
aligned_double4 input2 = {1, 2, 3, 4};
aligned_double4 result;

这样看起来清爽多了。更进一步，如果 $4$ 个 double 是一种经常使用的数据类型的话，我们就可以把它封装为一个Vector4d类，这样，用户就完全看不到内存对齐的具体实现了，像下面这样：

#include <immintrin.h>
#include <iostream>
 
class Vector4d {
  using aligned_double4 = __attribute__ ((aligned (32))) double[4];
public:
  Vector4d() {
  }
 
  Vector4d(double d1, double d2, double d3, double d4) {
    data[0] = d1;
    data[1] = d2;
    data[2] = d3;
    data[3] = d4;
  }
 
  aligned_double4 data;
};
 
Vector4d operator+ (const Vector4d& v1, const Vector4d& v2) {
  __m256d data1 = _mm256_load_pd(v1.data);
  __m256d data2 = _mm256_load_pd(v2.data);
  __m256d data3 = _mm256_add_pd(data1, data2);
  Vector4d result;
  _mm256_store_pd(result.data, data3);
  return result;
}
 
std::ostream& operator<< (std::ostream& o, const Vector4d& v) {
  o << "(" << v.data[0] << ", " << v.data[1] << ", " << v.data[2] << ", " << v.data[3] << ")";
  return o;
}
 
int main() {
  Vector4d input1 = {1, 1, 1, 1}; // 栈空间上
  Vector4d input2 = {1, 2, 3, 4};
  Vector4d result = input1 + input2;
 
  std::cout << result << std::endl;
 
  return 0;
}
// encapsulated_vectorization.cpp

这段代码实现了Vector4d类，并把向量化运算放在了operator+中，主函数变得非常简单。

但不要高兴得太早，这个Vector4d其实有着严重的漏洞，如果我们动态创建对象，程序仍然会崩溃，比如这段代码：

int main() {
  Vector4d* input1 = new Vector4d{1, 1, 1, 1}; // 堆空间上
  Vector4d* input2 = new Vector4d{1, 2, 3, 4};
 
  std::cout << "address of input1: " << input1->data << std::endl;
  std::cout << "address of input2: " << input2->data << std::endl;
 
  Vector4d result = *input1 + *input2;
 
  std::cout << result << std::endl;
 
  delete input1;
  delete input2;
  return 0;
}
// unaligned_heap_vectorization.cpp

崩溃前的输出为：

address of input1: 0x1ceae70
address of input2: 0x1ceaea0

很诡异吧，似乎刚才我们设置的内存对齐都失效了，这两个输入变量的内存首地址又不是32的倍数了。

3.3 Heap vs Stack

问题的根源在于不同的对象的创建方式。直接声明的对象是存储在栈上的，其内存地址由编译器在编译时确定，因此预编译指令会生效。但用new动态创建的对象则存储在堆中，其地址在运行时确定。C++的运行时库并不会关心预编译指令声明的对齐方式，我们需要更强有力的手段来确保内存对齐。

C++提供的new关键字是个好东西，它避免了C语言中丑陋的malloc操作，但同时也隐藏了实现细节。如果我们翻看C++官方文档，可以发现new Vector4d实际上做了两件事情，第一步申请sizeof(Vector4d)大小的空间(空间已经提前申请好了，所以不起作用)，第二步调用Vector4d的构造函数。要想实现内存对齐，我们必须修改第一步申请空间的方式才行。好在第一步其实调用了operator new这个函数，我们只需要重写这个函数，就可以实现自定义的内存申请，下面是添加了该函数后的 Vector4d 类。

class Vector4d {
  using aligned_double4 = __attribute__ ((aligned (32))) double[4];
public:
  Vector4d() {
  }
 
  Vector4d(double d1, double d2, double d3, double d4) {
    data[0] = d1;
    data[1] = d2;
    data[2] = d3;
    data[3] = d4;
  }
 
  void* operator new (std::size_t count) { // Eigen中也是这么写的Eigen/src/Core/util/Memory.h中的函数 handmade_aligned_malloc
    void* original = ::operator new(count + 32);
    void* aligned = reinterpret_cast<void*>((reinterpret_cast<size_t>(original) & ~size_t(32 - 1)) + 32);
    *(reinterpret_cast<void**>(aligned) - 1) = original;
    return aligned;
  }
 
  void operator delete (void* ptr) {
    ::operator delete(*(reinterpret_cast<void**>(ptr) - 1));
  }
 
  aligned_double4 data;
};
// aligned_heap_vectorization.cpp

operator new 的实现还是有些技巧的，我们来详细解释一下。首先，根据 C++ 标准的规定，operator new的参数count是要开辟的空间的大小。为了保证一定可以得到 count 大小且 $32$ 字节对齐的内存空间，我们把实际申请的内存空间扩大到 count + 32。可以想象，在这count + 32字节空间中，一定存在首地址为32的倍数的连续count字节的空间。
所以，第二行代码，我们通过对申请到的原始地址original做一些位运算，先找到比original小且是 $32$ 的倍数的地址，然后加上 $32$ ，就得到了我们想要的对齐后的地址，记作aligned。
接下来，第三行代码很关键，它把原始地址的值保存在了aligned地址的前一个位置中，之所以要这样做，是因为我们还需要自定义释放内存的函数operator delete。毕竟aligned地址并非真实申请到的地址，所以在该地址上调用默认的delete 是会出错的。可以看到，我们在代码中也定义了一个operator delete，传入的参数正是前面operator new返回的对齐的地址。这时候，保存在aligned前一个位置的原始地址就非常有用了，我们只需要把它取出来，然后用标准的 delete 释放该内存即可。

为了方便大家理解这段代码，有几个细节需要特地强调一下。::operator new 中的::代表全局命名空间，因此可以调用到标准的 operator new。第三行需要先把 aligned 强制转换为 void** 类型，这是因为我们希望在aligned的前一个位置保存一个void*类型的地址，既然保存的元素是地址，那么该位置对应的地址就是地址的地址，也就是 void**。

这是一个不大不小的trick，C++的很多内存管理方面的处理经常会有这样的操作。但不知道细心的你是否发现了这里的一个问题：reinterpret_cast<void**>(aligned) - 1这个地址是否一定在我们申请的空间中呢？换句话说，它是否一定大于original呢？之所以存在这个质疑，是因为这里的-1其实是对指针减一。要知道，在64位计算机中，指针的长度是8字节，所以这里得到的地址其实是reinterpret_cast<size_t>(aligned) - 8。看出这里的区别了吧，对指针减1相当于对地址的值减 $8$ 。所以仔细想想，如果original到aligned的距离小于8字节的话，这段代码就会对申请的空间以外的内存赋值，可怕吧。

其实没什么可怕的，为什么我敢这样讲，因为Eigen就是这样实现的。这样做依赖于现代编译器的一个共识：所有的内存分配都默认 $16$ 字节对齐。这个事实可以解释很多问题，首先，永远不用担心original到aligned的距离会不会小于 $8$ 了，它会稳定在 $16$ ，这足够保存一个指针。其次，为什么我们用AVX指令集举例，而不是SSE？因为SSE要求 $16$ 字节对齐，而现代编译器已经默认 $16$ 字节对齐了(见C++ 字节对齐)，那这篇文章就没办法展开了。最后，为什么我的代码在NVIDIA TX2上运行正常而在服务器上挂掉了？因为TX2中是ARM处理器，里面的向量化指令集NEON也只要求 $16$ 字节对齐(Eigen 使用的AVX要32字节)。

3.4 还有坑？

如果你以为到这里就圆满结束了，那可是大错特错。还有个天坑没展示给大家，下面的代码中，我的自定义类 Point 包含了一个 Vector4d 的成员(包含了类的类)，这时候，噩梦又出现了。(这个相当重要！)

class Point {
public:
  Point(Vector4d position) : position(position) {
  }
 
  Vector4d position;
};
 
int main() {
  Vector4d* input1 = new Vector4d{1, 1, 1, 1};
  Vector4d* input2 = new Vector4d{1, 2, 3, 4};
 
  Point* point1 = new Point{*input1};
  Point* point2 = new Point{*input2};
 
  std::cout << "address of point1: " << point1->position.data << std::endl;
  std::cout << "address of point2: " << point2->position.data << std::endl;
 
  Vector4d result = point1->position + point2->position;
 
  std::cout << result << std::endl;
 
  delete input1;
  delete input2;
  delete point1;
  delete point2;
  return 0;
}
// malicious_aligned_heap_vectorization.cpp

输出的地址又不再是 $32$ 的倍数了，程序戛然而止。我们分析一下为什么会这样。在主函数中，new Point 动态创建了一个 Point 对象。前面提到过，这个过程分为两步，第一步申请 Point 对象所需的空间，即 sizeof(Point) 大小的空间，第二步调用 Point 的构造函数。我们寄希望于第一步申请到的空间恰好让内部的 position 对象对齐，这是不现实的。因为整个过程中并不会调用 Vector4d 的 operator new，调用的只有 Point 的 operator new，而这个函数我们并没有重写。

可惜的是，此处并没有足够优雅的解决方案，唯一的方案是在 Point 类中也添加自定义 operator new，这就需要用户的协助，类库的作者已经无能为力了。不过类库的作者能做的，是尽量让用户更方便地添加 operator new，比如封装为一个宏定义，用户只需要在 Point 类中添加一句宏即可。最后，完整的代码如下。

#include <immintrin.h>
#include <iostream>
 
#define ALIGNED_OPERATOR_NEW \
  void* operator new (std::size_t count) { \
    void* original = ::operator new(count + 32); \
    void* aligned = reinterpret_cast<void*>((reinterpret_cast<size_t>(original) & ~size_t(32 - 1)) + 32); \
    *(reinterpret_cast<void**>(aligned) - 1) = original; \
    return aligned;\
  } \
  void operator delete (void* ptr) { \
    ::operator delete(*(reinterpret_cast<void**>(ptr) - 1)); \
  }
 
class Vector4d {
  using aligned_double4 = __attribute__ ((aligned (32))) double[4];
public:
  Vector4d() {
  }
 
  Vector4d(double d1, double d2, double d3, double d4) {
    data[0] = d1;
    data[1] = d2;
    data[2] = d3;
    data[3] = d4;
  }
 
  ALIGNED_OPERATOR_NEW // 注意这句话
 
  aligned_double4 data;
};
 
Vector4d operator+ (const Vector4d& v1, const Vector4d& v2) {
  __m256d data1 = _mm256_load_pd(v1.data);
  __m256d data2 = _mm256_load_pd(v2.data);
  __m256d data3 = _mm256_add_pd(data1, data2);
  Vector4d result;
  _mm256_store_pd(result.data, data3);
  return result;
}
 
std::ostream& operator<< (std::ostream& o, const Vector4d& v) {
  o << "(" << v.data[0] << ", " << v.data[1] << ", " << v.data[2] << ", " << v.data[3] << ")";
  return o;
}
 
class Point {
public:
  Point(Vector4d position) : position(position) {
  }
 
  ALIGNED_OPERATOR_NEW // 注意这句话
 
  Vector4d position;
};
 
int main() {
  Vector4d* input1 = new Vector4d{1, 1, 1, 1};
  Vector4d* input2 = new Vector4d{1, 2, 3, 4};
 
  Point* point1 = new Point{*input1};
  Point* point2 = new Point{*input2};
 
  std::cout << "address of point1: " << point1->position.data << std::endl;
  std::cout << "address of point2: " << point2->position.data << std::endl;
 
  Vector4d result = point1->position + point2->position;
 
  std::cout << result << std::endl;
 
  delete input1;
  delete input2;
  delete point1;
  delete point2;
  return 0;
}
// antimalicious_aligned_heap_vectorization.cpp

这段代码中，宏定义 ALIGNED_OPERATOR_NEW 包含了 operator new 和 operator delete，它们对所有需要内存对齐的类都适用。因此，无论是需要内存对齐的类，还是包含了这些类的类，都需要添加这个宏。

4 再谈Eigen

在Eigen官方文档中有这么一页内容讲的很详细 Structures Having Eigen Members：
在这里插入图片描述

有没有觉得似曾相识？Eigen 对该问题的解决方案与我们不谋而合。这当然不是巧合，事实上，本文的灵感正是来源于 Eigen。但 Eigen 只告诉了我们应该怎么做，没有详细讲解其原理。本文则从问题的提出，到具体的解决方案，一一剖析，希望可以给大家一些更深的理解。

4.1 文档内容

如果定义了一个结构，其中成员是固定大小的可向量化的特征类型，必须确保调用它的 operator new 分配了正确对齐的缓冲区(同上面示例)。

如果在 C++17 模式下编译，并且使用的是足够新的编译器(例如，GCC>=7)，那么一切都由编译器负责，EIGEN_MAKE_ALIGNED_OPERATOR_NEW 就不需要添加了。否则，必须重载它的 operator new 以便它生成正确对齐的指针(例如，Vector4d 和 AVX 的 $32$ 字节对齐)。幸运的是，Eigen 提供了宏 EIGEN_MAKE_ALIGNED_OPERATOR_NEW 来做这些。

为什么这是需要的？

比如代码是这个样子的：

class Foo
{
  ...
  Eigen::Vector4d v;
  ...
};
...
Foo *foo = new Foo;

一个 Eigen::Vector4d 由 $4$ 个 double 组成，也就是 $256$ 位。这正是 AVX 寄存器的大小，这使得它可以使用 AVX 对这个向量进行各种操作。但是 AVX 指令(至少是Eigen使用的那些，它们是快速的)需要 $256$ 位对齐。否则你会得到一个 segmentation fault。

因此，Eigen 自己会通过两件事确保 Eigen::Vector4d $256$ 位对齐：

Eigen 需要对 Eigen::Vector4d 的数组(包含 $4$ 个 double )进行 $256$ 位对齐。这可以通过 alignas 关键字完成(同上面示例中的 __attribute__ ((aligned (32)))，但是 alignas 更通用)；
Eigen重载 Eigen::Vector4d 的 operator new，所以它将始终返回 $256$ 位对齐的指针。(C++17 就不再需要这样写了)

因此，通常情况下，你不需要担心任何事情，Eigen 会为你处理新的操作符的对齐… 除了一种情况，当你有一个像上面那样的类 Foo，并且你像上面那样动态地分配一个新的 Foo，那么，由于 Foo 没有对齐操作符 new，返回的指针 foo 不一定是 $256$ 位对齐的。成员 v 的 alignment 属性是相对于类 Foo 的开始的。如果 foo 指针没有对齐，那么成员 v 的 alignment 属性是相对于类 Foo 的开始的。如果 foo 指针没有对齐，那么 foo->v 也不会被对齐。解决方案是让类 Foo 拥有一个对齐的运算符 new，也就是前面提到的 EIGEN_MAKE_ALIGNED_OPERATOR_NEW。这个解释也适用于SSE/NEON/MSA/Altivec/VSX目标，它们需要 $16$ 字节的对齐，以及AVX512，它需要 $64$ 字节的对齐，用于 $64$ 字节的固定大小对象。(这与 3.4 节表述一致)

为什么 C++17 就不需要这样写了？

前面的写法并不能算作 bug。这更像是 C++ 语言规范的内在问题，在 C++17 中已经通过 Dynamic memory allocation for over-aligned data 的特性解决了这个问题。

5 总结

最后做一个简短的总结。对于基本数据类型和自定义类型，我们需要用预编译指令来保证栈内存的对齐，用重写operator new的方式保证堆内存对齐。对于嵌套的自定义类型，申请栈内存时会自动保证其内部数据类型的对齐，而申请堆内存时仍然需要重写 operator new。

有一种特殊情况本文并未提到，如果使用 std::vector，需要传入自定义内存申请器，即std::vector<Vector4d, AlignedAllocator>，其中 AlignedAllocator 是我们自定义的内存申请器。这是因为，std::vector 中使用了动态申请的空间保存数据，因此默认的 operator new 是无法让其内存对齐的。在无法重写std::vector类的operator new的情况下，标准库提供了自定义内存申请器的机制，让用户可以以自己的方式申请内存。具体做法本文就不再展开了，理解了前面的内容，这个问题应该很容易解决。