FPGA流式Cholesky算法优化

六本木砍王刀哥

已于 2023-07-04 13:57:43 修改

阅读量150

点赞数

文章标签：算法

于 2023-07-03 21:39:20 首次发布

本文链接：https://blog.csdn.net/weixin_43807601/article/details/131523975

版权

Cholesky算法伪代码如下：

int row_size = 0;
for (column = 0; column <= row_size; column++) {
    for (row = column; row < rows; row++) {
      float sum = 0;
      for (k = 0; k < column; k++)
        sum += L[row][k] * L[column][k];

      if (row == column)
        L[row][column] = sqrt(A[row][row] - sum);
      else
        L[row][column] = (A[row][column] - sum) / L[column][column];
  }
}

本实验笔记基于Intel oneAPI C++ SYCL的流式Cholesky算法的FPGA优化版本。

在FPGA上进行Cholesky算法优化的关键点在于：

1. 数据通过管道从全局内存流入FPGA片上内存，防止全局内存读取，并设计无停顿的内核，结果通过管道传出内核。

2. 利用循环依赖忽略优化和流水线Triangular Loop优化，增加一定数量的虚拟计算，避免先写后读问题，防止编译器自动增加循环启动间隔II。

3. 使用bank技术和私有内存拷贝，允许多个迭代并发（or并行？）执行，不产生存储体冲突。

Streaming部分代码分析：

模版参数读入了rows行数，pipe_size是管道每次读取的矩阵元素个数，AIn和LOut分别代表输入数据的管道和输出结果的管道，其中raw_latency代表了解决read after write错误所需要的延迟计算的最大次数，这一部分参考上面提到的循环依赖编译器优化和Triangular Loop优化。

template <typename T,       // The datatype for the computation
          bool is_complex,  // True if T is ac_complex<X>
          int rows,         // Number of rows==columns in the A matrices
          int raw_latency,  // Read after write (RAW) latency (in iterations) of
                            // the triangular loop of this function.
                            // This value depends on the FPGA target, the
                            // datatype, the target frequency, etc.
                            // This value will have to be tuned for optimal
                            // performance. Refer to the Triangular Loop
                            // design pattern tutorial.
                            // In general, find a high value for which the
                            // compiler is able to achieve an II of 1 and
                            // go down from there.
          int pipe_size,    // Number of elements read/write per pipe operation
                            // to read the input matrix
          typename AIn,     // A matrix input pipe, receive pipe_size
                            // elements from the pipe with each read
          typename LOut     // L matrix output pipe, send one element to the
                            // pipe with each write.
                            // Only lower-left elements of L are
                            // sent in row order, starting with row 0.
          >

下面的代码计算了总共需要传输的L结果矩阵的元素个数，因为rows值是程序编译期间就可知的，SYCL内核不允许动态决定本地数组长度，所以内核需要提前确定变量大小。

// Functional assertions 
static_assert(rows >= 4,
                "Only matrices of size 4x4 and over are supported");
static_assert(pipe_size >= 1,
                "The pipe must be able to contain at least one element");

// Set the computation type to T or ac_complex<T> depending on the value
// of is_complex
using TT = std::conditional_t<is_complex, ac_complex<T>, T>;

constexpr int kColumns = rows;

// Number of lower-left elements in the L output matrix
constexpr int kLMatrixSize = kColumns * (kColumns + 1) / 2;

下面的代码块确定了存储体（Bank）的宽度，每个存储体宽度为单次管道输入的元素个数pipe_size的实际大小，存储体的个数为rows / pipe_size，代表a_load[rows][kColumns]矩阵一列数据划分到每个存储体中。

l_result_compute[rows][kColumns]本地数组私有拷贝四次，允许流水线同时并行访存结果数组。

constexpr short kBankwidth = pipe_size * sizeof(TT);
constexpr unsigned short kNumBanks = rows / pipe_size;

// When specifying numbanks for a memory, it must be a power of 2.
// Unused banks will be automatically optimized away.
constexpr short kNumBanksNextPow2 =
    fpga_tools::Pow2(fpga_tools::CeilLog2(kNumBanks));

[[intel::numbanks(kNumBanksNextPow2)]]  // NO-FORMAT: Attribute
[[intel::bankwidth(kBankwidth)]]        // NO-FORMAT: Attribute
[[intel::private_copies(4)]]            // NO-FORMAT: Attribute
[[intel::max_replicates(1)]]            // NO-FORMAT: Attribute
TT a_load[rows][kColumns];

// Two copies of L to be able to load two complete rows per iteration
// Multiple private copies to be able to overlap multiple loop
// iterations
[[intel::private_copies(4)]]  // NO-FORMAT: Attribute
TT l_result_compute[rows][kColumns];
[[intel::private_copies(4)]]  // NO-FORMAT: Attribute
TT l_result_compute_copy[rows][kColumns];

[[intel::private_copies(2)]]  // NO-FORMAT: Attribute
TT l_result[kLMatrixSize];

接下来是读取管道数据到本地数组中，这里使用了Intel提供的循环展开模版函数，通过展开循环每一列需要读取的次数（kLoopIterPerColumn）和每次读取的元素个数（pipe_size），将数据保存到片上内存中，提高后续计算访存效率。

利用FPGA寄存器插入到循环展开部分（fpga_reg），防止在展开读取过程中产生过大的扇出（large fanout），因为过大的扇出导致FPGA在物理硬件层面设置更长的布线，从而降低效率和fMAX，显示的插入寄存器可以防止树状结构过大，并且提高系统频率，但是可能会增加延迟，该部分可以参考fpga寄存器防止扇出。

// Copy a matrix from the pipe to a local memory
// Number of pipe reads of pipe_size required to read a full column
constexpr int kExtraIteration = ((rows % pipe_size) != 0) ? 1 : 0;
constexpr int kLoopIterPerColumn = (rows / pipe_size) + kExtraIteration;
// Number of pipe reads of pipe_size to read all the matrices
constexpr int kLoopIter = kLoopIterPerColumn * kColumns;
// Size in bits of the loop iterator over kLoopIter iterations
constexpr int kLoopIterBitSize =
    fpga_tools::BitsForMaxValue<kLoopIter + 1>();

[[intel::initiation_interval(1)]]  // NO-FORMAT: Attribute
for (ac_int<kLoopIterBitSize, false> li = 0; li < kLoopIter; li++) {
fpga_tools::NTuple<TT, pipe_size> pipe_read = AIn::read();

int write_idx = li % kLoopIterPerColumn;

fpga_tools::UnrolledLoop<kLoopIterPerColumn>([&](auto k) {
    fpga_tools::UnrolledLoop<pipe_size>([&](auto t) {
    if constexpr (k * pipe_size + t < kColumns) {
        if (write_idx == k) {
        a_load[li / kLoopIterPerColumn][k * pipe_size + t] =
            pipe_read.template get<t>();
        }
    }

    // Delay data signals to create a vine-based data distribution
    // to lower signal fanout.
    pipe_read.template get<t>() =
        sycl::ext::intel::fpga_reg(pipe_read.template get<t>());
    });

    write_idx = sycl::ext::intel::fpga_reg(write_idx);
});
}

接下来是计算部分的主题，也就是前述伪代码的具体实现。

在原来的伪代码中，column是外循环，row是内循环，根据Triangular Loop的优化策略，我们需要把内外两个循环的总行程计数（kTotalIterations）计算出来，它由原来的基本循环次数加上虚拟循环次数决定的，虚拟循环计数是根据raw latency决定，这个数值是靠实验得出（正好使得循环启动周期II为1）。

// Computation of the number of iterations required for the triangular
// loop. Refer to the triangular_loop tutorial for details on how
// to compute these.
constexpr int kRegularIterations = kColumns * (kColumns + 1) / 2;
constexpr int kExtraIterations = (raw_latency - 1) * raw_latency / 2;
constexpr int kExtraIterationsToRemove =
    kColumns >= raw_latency
        ? 0
        : (raw_latency - kColumns) * (raw_latency - kColumns + 1) / 2;
constexpr int kTotalIterations =
    kRegularIterations + kExtraIterations - kExtraIterationsToRemove;

计算部分和伪代码部分步骤相似，先计算sum值，这里把k设置为展开kColumns次，但是实际上这是根据当前所在列决定的，通过一个bool判断展开的循环里是否需要计算。

计算结果矩阵的某个位置（坐标为row，columb）的时候，当前位置的结果值计算过程只和小于等于row和column的值有关。

把每次计算结果存到to_add变量中，与sum累加。

int column = 0;
int row = 0;
TT div_term{0};
[[intel::ivdep(raw_latency)]] // NO-FORMAT: Attribute
for (int iteration = 0; iteration < kTotalIterations; iteration++) {
// Perform the dot product of the elements of the two rows indexed by
// row and column from element 0 to column

TT sum = 0;
fpga_tools::UnrolledLoop<kColumns>([&](auto k) {
    TT to_add;
    bool should_compute =  k < column;
    TT mul_lhs = should_compute ? l_result_compute[row][k] : T{0};
    TT mul_rhs = should_compute ? l_result_compute_copy[column][k] : T{0};
    
    if constexpr (is_complex) {
    to_add = mul_lhs * mul_rhs.conj();
    } else {
    to_add = mul_lhs * mul_rhs;
    }
    sum += to_add;
});

然后加载A[row][row]或者A[row][column]，预处理出减去sum的值diff。

当行列相等时，只要做一个开方即可，但是为了运算效率和降低计算延迟，需要用到sycl反开方库函数rsqrt，因为乘法比除法后续效率更高，当然这里的结果最后也需要做个倒数。

当行列不相等时，就可以用到之前的反开方结果了，直接相乘上diff差值即可。

TT a_loaded = (row < rows) ? a_load[row][column] : TT{0};
TT diff = a_loaded - sum;

// Only do useful work for meaningful iterations
TT to_store;
if (row == column) {
    // Perform the reciprocal sqrt rather than the sqrt because:
    // - it has a shorter latency and will reduce the RAW latency
    //   of the loop
    // - the result of the sqrt is used as a divisor which is also
    //   a long operation, so replacing x/sqrt by x*rsqrt will save
    //   latency
    // - the diagonal elements will need to be inverted later, but we
    //   can do that while outside this loop when we transfer the L
    //   matrix to the pipe
    if constexpr (is_complex) {
    div_term = {sycl::rsqrt(diff.r()), 0};
    } else {
    div_term = sycl::rsqrt(diff);
    }
    to_store = div_term;
} else {
    to_store = diff * div_term;
}

把结果更新，存储到两个拷贝数组和最终数组中。

if (column <= row) {
    // Store the results to two working copies of L to be able to read
    // two complete rows at each iteration
    l_result_compute[row][column] = to_store;
    l_result_compute_copy[row][column] = to_store;
    // Store the result to the output matrix
    if constexpr (is_complex) {
    l_result[row * (row + 1) / 2 + column] = to_store.conj();
    } else {
    l_result[row * (row + 1) / 2 + column] = to_store;
    }
}

最后更新row和column索引，这里需要取最小值决定是否需要做虚拟计算。

// Update loop indexes
if (row == (rows - 1)) {
    column = column + 1;
    row = sycl::min(column, rows - raw_latency);
} else {
    row = row + 1;
}

}  // end of iteration

最后处理结果，把正确结果写入输出管道中，流式程序结束。

// Go over the L matrix and write each element to the pipe
int l_idx = 0;
[[intel::loop_coalesce(2)]] // NO-FORMAT: Attribute
for (int row = 0; row < rows; row++) {
for (int column = 0; column <= row; column++) {
    TT to_write;
    TT current_l_value = l_result[l_idx];
    // The diagonal elements need to be inverted as the
    // inversion was removed from the above compute loop
    // to reduce the RAW latency
    if (row == column) {
    if constexpr (is_complex) {
        to_write = {1 / current_l_value.r(), 0};
    }
    else{
        to_write = 1 / current_l_value;
    }

    } else {
    to_write = current_l_value;
    }
    LOut::write(to_write);

    l_idx++;
}
}

下面是FPGA优化的Streaming Cholesky算法完整代码：

#ifndef __STREAMING_CHOLESKY_HPP__
#define __STREAMING_CHOLESKY_HPP__

#include "constexpr_math.hpp"
#include "tuple.hpp"
#include "unrolled_loop.hpp"

namespace fpga_linalg {

/*
  Cholesky decomposition - Computes L such that A=LL* where:
  - A is the input matrix (hermitian, positive definite)
  - L is a lower triangular matrix
  - L* is the conjugate transpose of L

  The input and output matrices are consumed/produced from/to pipes.
*/
template <typename T,       // The datatype for the computation
          bool is_complex,  // True if T is ac_complex<X>
          int rows,         // Number of rows==columns in the A matrices
          int raw_latency,  // Read after write (RAW) latency (in iterations) of
                            // the triangular loop of this function.
                            // This value depends on the FPGA target, the
                            // datatype, the target frequency, etc.
                            // This value will have to be tuned for optimal
                            // performance. Refer to the Triangular Loop
                            // design pattern tutorial.
                            // In general, find a high value for which the
                            // compiler is able to achieve an II of 1 and
                            // go down from there.
          int pipe_size,    // Number of elements read/write per pipe operation
                            // to read the input matrix
          typename AIn,     // A matrix input pipe, receive pipe_size
                            // elements from the pipe with each read
          typename LOut     // L matrix output pipe, send one element to the
                            // pipe with each write.
                            // Only lower-left elements of L are
                            // sent in row order, starting with row 0.
          >
struct StreamingCholesky {
  void operator()() const {
    // Functional assertions 
    static_assert(rows >= 4,
                  "Only matrices of size 4x4 and over are supported");
    static_assert(pipe_size >= 1,
                  "The pipe must be able to contain at least one element");

    // Set the computation type to T or ac_complex<T> depending on the value
    // of is_complex
    using TT = std::conditional_t<is_complex, ac_complex<T>, T>;

    constexpr int kColumns = rows;

    // Number of lower-left elements in the L output matrix
    constexpr int kLMatrixSize = kColumns * (kColumns + 1) / 2;

    // Compute Cholesky decompositions as long as matrices are given as inputs
    while (1) {
      // Break memories up to store pipe_size elements per bank
      constexpr short kBankwidth = pipe_size * sizeof(TT);
      constexpr unsigned short kNumBanks = rows / pipe_size;

      // When specifying numbanks for a memory, it must be a power of 2.
      // Unused banks will be automatically optimized away.
      constexpr short kNumBanksNextPow2 =
          fpga_tools::Pow2(fpga_tools::CeilLog2(kNumBanks));

      [[intel::numbanks(kNumBanksNextPow2)]]  // NO-FORMAT: Attribute
      [[intel::bankwidth(kBankwidth)]]        // NO-FORMAT: Attribute
      [[intel::private_copies(4)]]            // NO-FORMAT: Attribute
      [[intel::max_replicates(1)]]            // NO-FORMAT: Attribute
      TT a_load[rows][kColumns];

      // Two copies of L to be able to load two complete rows per iteration
      // Multiple private copies to be able to overlap multiple loop
      // iterations
      [[intel::private_copies(4)]]  // NO-FORMAT: Attribute
      TT l_result_compute[rows][kColumns];
      [[intel::private_copies(4)]]  // NO-FORMAT: Attribute
      TT l_result_compute_copy[rows][kColumns];

      [[intel::private_copies(2)]]  // NO-FORMAT: Attribute
      TT l_result[kLMatrixSize];

      // Copy a matrix from the pipe to a local memory
      // Number of pipe reads of pipe_size required to read a full column
      constexpr int kExtraIteration = ((rows % pipe_size) != 0) ? 1 : 0;
      constexpr int kLoopIterPerColumn = (rows / pipe_size) + kExtraIteration;
      // Number of pipe reads of pipe_size to read all the matrices
      constexpr int kLoopIter = kLoopIterPerColumn * kColumns;
      // Size in bits of the loop iterator over kLoopIter iterations
      constexpr int kLoopIterBitSize =
          fpga_tools::BitsForMaxValue<kLoopIter + 1>();

      [[intel::initiation_interval(1)]]  // NO-FORMAT: Attribute
      for (ac_int<kLoopIterBitSize, false> li = 0; li < kLoopIter; li++) {
        fpga_tools::NTuple<TT, pipe_size> pipe_read = AIn::read();

        int write_idx = li % kLoopIterPerColumn;

        fpga_tools::UnrolledLoop<kLoopIterPerColumn>([&](auto k) {
          fpga_tools::UnrolledLoop<pipe_size>([&](auto t) {
            if constexpr (k * pipe_size + t < kColumns) {
              if (write_idx == k) {
                a_load[li / kLoopIterPerColumn][k * pipe_size + t] =
                    pipe_read.template get<t>();
              }
            }

            // Delay data signals to create a vine-based data distribution
            // to lower signal fanout.
            pipe_read.template get<t>() =
                sycl::ext::intel::fpga_reg(pipe_read.template get<t>());
          });

          write_idx = sycl::ext::intel::fpga_reg(write_idx);
        });
      }

      // Computation of the number of iterations required for the triangular
      // loop. Refer to the triangular_loop tutorial for details on how
      // to compute these.
      constexpr int kRegularIterations = kColumns * (kColumns + 1) / 2;
      constexpr int kExtraIterations = (raw_latency - 1) * raw_latency / 2;
      constexpr int kExtraIterationsToRemove =
          kColumns >= raw_latency
              ? 0
              : (raw_latency - kColumns) * (raw_latency - kColumns + 1) / 2;
      constexpr int kTotalIterations =
          kRegularIterations + kExtraIterations - kExtraIterationsToRemove;

      // Compute the L matrix
      int column = 0;
      int row = 0;
      TT div_term{0};
      [[intel::ivdep(raw_latency)]] // NO-FORMAT: Attribute
      for (int iteration = 0; iteration < kTotalIterations; iteration++) {
        // Perform the dot product of the elements of the two rows indexed by
        // row and column from element 0 to column

        TT sum = 0;
        fpga_tools::UnrolledLoop<kColumns>([&](auto k) {
          TT to_add;
          bool should_compute =  k < column;
          TT mul_lhs = should_compute ? l_result_compute[row][k] : T{0};
          TT mul_rhs = should_compute ? l_result_compute_copy[column][k] : T{0};
          
          if constexpr (is_complex) {
            to_add = mul_lhs * mul_rhs.conj();
          } else {
            to_add = mul_lhs * mul_rhs;
          }
          sum += to_add;
        });

        TT a_loaded = (row < rows) ? a_load[row][column] : TT{0};
        TT diff = a_loaded - sum;

        // Only do useful work for meaningful iterations
        TT to_store;
        if (row == column) {
          // Perform the reciprocal sqrt rather than the sqrt because:
          // - it has a shorter latency and will reduce the RAW latency
          //   of the loop
          // - the result of the sqrt is used as a divisor which is also
          //   a long operation, so replacing x/sqrt by x*rsqrt will save
          //   latency
          // - the diagonal elements will need to be inverted later, but we
          //   can do that while outside this loop when we transfer the L
          //   matrix to the pipe
          if constexpr (is_complex) {
            div_term = {sycl::rsqrt(diff.r()), 0};
          } else {
            div_term = sycl::rsqrt(diff);
          }
          to_store = div_term;
        } else {
          to_store = diff * div_term;
        }

        if (column <= row) {
          // Store the results to two working copies of L to be able to read
          // two complete rows at each iteration
          l_result_compute[row][column] = to_store;
          l_result_compute_copy[row][column] = to_store;
          // Store the result to the output matrix
          if constexpr (is_complex) {
            l_result[row * (row + 1) / 2 + column] = to_store.conj();
          } else {
            l_result[row * (row + 1) / 2 + column] = to_store;
          }
        }

        // Update loop indexes
        if (row == (rows - 1)) {
          column = column + 1;
          row = sycl::min(column, rows - raw_latency);
        } else {
          row = row + 1;
        }

      }  // end of iteration

      // Go over the L matrix and write each element to the pipe
      int l_idx = 0;
      [[intel::loop_coalesce(2)]] // NO-FORMAT: Attribute
      for (int row = 0; row < rows; row++) {
        for (int column = 0; column <= row; column++) {
          TT to_write;
          TT current_l_value = l_result[l_idx];
          // The diagonal elements need to be inverted as the
          // inversion was removed from the above compute loop
          // to reduce the RAW latency
          if (row == column) {
            if constexpr (is_complex) {
              to_write = {1 / current_l_value.r(), 0};
            }
            else{
              to_write = 1 / current_l_value;
            }

          } else {
            to_write = current_l_value;
          }
          LOut::write(to_write);

          l_idx++;
        }
      }

    }  // end of while(1)
  }    // end of operator
};     // end of struct

}  // namespace fpga_linalg

#endif /* __STREAMING_CHOLESKY_HPP__ */

六本木砍王刀哥

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
FPGA流式Cholesky算法优化

1. 数据通过管道从全局内存流入FPGA片上内存，防止全局内存读取，并设计无停顿的内核，结果通过管道传出内核。，增加一定数量的虚拟计算，避免先写后读问题，防止编译器自动增加循环启动间隔II。3. 使用bank技术和私有内存拷贝，允许多个迭代并发执行，不产生存储体冲突。本实验笔记基于Intel oneAPI C++ SYCL的。
复制链接

扫一扫