FPGA流式Cholesky算法优化

 Cholesky算法伪代码如下:

int row_size = 0;
for (column = 0; column <= row_size; column++) {
    for (row = column; row < rows; row++) {
      float sum = 0;
      for (k = 0; k < column; k++)
        sum += L[row][k] * L[column][k];

      if (row == column)
        L[row][column] = sqrt(A[row][row] - sum);
      else
        L[row][column] = (A[row][column] - sum) / L[column][column];
  }
}

本实验笔记基于Intel oneAPI C++ SYCL的流式Cholesky算法的FPGA优化版本

在FPGA上进行Cholesky算法优化的关键点在于:

1. 数据通过管道从全局内存流入FPGA片上内存,防止全局内存读取,并设计无停顿的内核,结果通过管道传出内核。

2. 利用循环依赖忽略优化和流水线Triangular Loop优化,增加一定数量的虚拟计算,避免先写后读问题,防止编译器自动增加循环启动间隔II。

3. 使用bank技术和私有内存拷贝,允许多个迭代并发(or并行?)执行,不产生存储体冲突。

Streaming部分代码分析:

模版参数读入了rows行数,pipe_size是管道每次读取的矩阵元素个数,AIn和LOut分别代表输入数据的管道和输出结果的管道,其中raw_latency代表了解决read after write错误所需要的延迟计算的最大次数,这一部分参考上面提到的循环依赖编译器优化和Triangular Loop优化。

template <typename T,       // The datatype for the computation
          bool is_complex,  // True if T is ac_complex<X>
          int rows,         // Number of rows==columns in the A matrices
          int raw_latency,  // Read after write (RAW) latency (in iterations) of
                            // the triangular loop of this function.
                            // This value depends on the FPGA target, the
                            // datatype, the target frequency, etc.
                            // This value will have to be tuned for optimal
                            // performance. Refer to the Triangular Loop
                            // design pattern tutorial.
                            // In general, find a high value for which the
                            // compiler is able to achieve an II of 1 and
                            // go down from there.
          int pipe_size,    // Number of elements read/write per pipe operation
                            // to read the input matrix
          typename AIn,     // A matrix input pipe, receive pipe_size
                            // elements from the pipe with each read
          typename LOut     // L matrix output pipe, send one element to the
                            // pipe with each write.
                            // Only lower-left elements of L are
                            // sent in row order, starting with row 0.
          >

下面的代码计算了总共需要传输的L结果矩阵的元素个数,因为rows值是程序编译期间就可知的,SYCL内核不允许动态决定本地数组长度,所以内核需要提前确定变量大小。

// Functional assertions 
static_assert(rows >= 4,
                "Only matrices of size 4x4 and over are supported");
static_assert(pipe_size >= 1,
                "The pipe must be able to contain at least one element");

// Set the computation type to T or ac_complex<T> depending on the value
// of is_complex
using TT = std::conditional_t<is_complex, ac_complex<T>, T>;

constexpr int kColumns = rows;

// Number of lower-left elements in the L output matrix
constexpr int kLMatrixSize = kColumns * (kColumns + 1) / 2;

下面的代码块确定了存储体(Bank)的宽度,每个存储体宽度为单次管道输入的元素个数pipe_size的实际大小,存储体的个数为rows / pipe_size,代表a_load[rows][kColumns]矩阵一列数据划分到每个存储体中。

l_result_compute[rows][kColumns]本地数组私有拷贝四次,允许流水线同时并行访存结果数组。

constexpr short kBankwidth = pipe_size * sizeof(TT);
constexpr unsigned short kNumBanks = rows / pipe_size;

// When specifying numbanks for a memory, it must be a power of 2.
// Unused banks will be automatically optimized away.
constexpr short kNumBanksNextPow2 =
    fpga_tools::Pow2(fpga_tools::CeilLog2(kNumBanks));

[[intel::numbanks(kNumBanksNextPow2)]]  // NO-FORMAT: Attribute
[[intel::bankwidth(kBankwidth)]]        // NO-FORMAT: Attribute
[[intel::private_copies(4)]]            // NO-FORMAT: Attribute
[[intel::max_replicates(1)]]            // NO-FORMAT: Attribute
TT a_load[rows][kColumns];

// Two copies of L to be able to load two complete rows per iteration
// Multiple private copies to be able to overlap multiple loop
// iterations
[[intel::private_copies(4)]]  // NO-FORMAT: Attribute
TT l_result_compute[rows][kColumns];
[[intel::private_copies(4)]]  // NO-FORMAT: Attribute
TT l_result_compute_copy[rows][kColumns];

[[intel::private_copies(2)]]  // NO-FORMAT: Attribute
TT l_result[kLMatrixSize];

接下来是读取管道数据到本地数组中,这里使用了Intel提供的循环展开模版函数,通过展开循环每一列需要读取的次数(kLoopIterPerColumn)和每次读取的元素个数(pipe_size),将数据保存到片上内存中,提高后续计算访存效率。

利用FPGA寄存器插入到循环展开部分(fpga_reg),防止在展开读取过程中产生过大的扇出(large fanout),因为过大的扇出导致FPGA在物理硬件层面设置更长的布线,从而降低效率和fMAX,显示的插入寄存器可以防止树状结构过大,并且提高系统频率,但是可能会增加延迟,该部分可以参考fpga寄存器防止扇出

// Copy a matrix from the pipe to a local memory
// Number of pipe reads of pipe_size required to read a full column
constexpr int kExtraIteration = ((rows % pipe_size) != 0) ? 1 : 0;
constexpr int kLoopIterPerColumn = (rows / pipe_size) + kExtraIteration;
// Number of pipe reads of pipe_size to read all the matrices
constexpr int kLoopIter = kLoopIterPerColumn * kColumns;
// Size in bits of the loop iterator over kLoopIter iterations
constexpr int kLoopIterBitSize =
    fpga_tools::BitsForMaxValue<kLoopIter + 1>();

[[intel::initiation_interval(1)]]  // NO-FORMAT: Attribute
for (ac_int<kLoopIterBitSize, false> li = 0; li < kLoopIter; li++) {
fpga_tools::NTuple<TT, pipe_size> pipe_read = AIn::read();

int write_idx = li % kLoopIterPerColumn;

fpga_tools::UnrolledLoop<kLoopIterPerColumn>([&](auto k) {
    fpga_tools::UnrolledLoop<pipe_size>([&](auto t) {
    if constexpr (k * pipe_size + t < kColumns) {
        if (write_idx == k) {
        a_load[li / kLoopIterPerColumn][k * pipe_size + t] =
            pipe_read.template get<t>();
        }
    }

    // Delay data signals to create a vine-based data distribution
    // to lower signal fanout.
    pipe_read.template get<t>() =
        sycl::ext::intel::fpga_reg(pipe_read.template get<t>());
    });

    write_idx = sycl::ext::intel::fpga_reg(write_idx);
});
}

接下来是计算部分的主题,也就是前述伪代码的具体实现。

在原来的伪代码中,column是外循环,row是内循环,根据Triangular Loop的优化策略,我们需要把内外两个循环的总行程计数(kTotalIterations)计算出来,它由原来的基本循环次数加上虚拟循环次数决定的,虚拟循环计数是根据raw latency决定,这个数值是靠实验得出(正好使得循环启动周期II为1)。

// Computation of the number of iterations required for the triangular
// loop. Refer to the triangular_loop tutorial for details on how
// to compute these.
constexpr int kRegularIterations = kColumns * (kColumns + 1) / 2;
constexpr int kExtraIterations = (raw_latency - 1) * raw_latency / 2;
constexpr int kExtraIterationsToRemove =
    kColumns >= raw_latency
        ? 0
        : (raw_latency - kColumns) * (raw_latency - kColumns + 1) / 2;
constexpr int kTotalIterations =
    kRegularIterations + kExtraIterations - kExtraIterationsToRemove;

 计算部分和伪代码部分步骤相似,先计算sum值,这里把k设置为展开kColumns次,但是实际上这是根据当前所在列决定的,通过一个bool判断展开的循环里是否需要计算。

计算结果矩阵的某个位置(坐标为row,columb)的时候,当前位置的结果值计算过程只和小于等于row和column的值有关。

 把每次计算结果存到to_add变量中,与sum累加。

int column = 0;
int row = 0;
TT div_term{0};
[[intel::ivdep(raw_latency)]] // NO-FORMAT: Attribute
for (int iteration = 0; iteration < kTotalIterations; iteration++) {
// Perform the dot product of the elements of the two rows indexed by
// row and column from element 0 to column

TT sum = 0;
fpga_tools::UnrolledLoop<kColumns>([&](auto k) {
    TT to_add;
    bool should_compute =  k < column;
    TT mul_lhs = should_compute ? l_result_compute[row][k] : T{0};
    TT mul_rhs = should_compute ? l_result_compute_copy[column][k] : T{0};
    
    if constexpr (is_complex) {
    to_add = mul_lhs * mul_rhs.conj();
    } else {
    to_add = mul_lhs * mul_rhs;
    }
    sum += to_add;
});

 然后加载A[row][row]或者A[row][column],预处理出减去sum的值diff。

当行列相等时,只要做一个开方即可,但是为了运算效率和降低计算延迟,需要用到sycl反开方库函数rsqrt,因为乘法比除法后续效率更高,当然这里的结果最后也需要做个倒数。

当行列不相等时,就可以用到之前的反开方结果了,直接相乘上diff差值即可。

TT a_loaded = (row < rows) ? a_load[row][column] : TT{0};
TT diff = a_loaded - sum;

// Only do useful work for meaningful iterations
TT to_store;
if (row == column) {
    // Perform the reciprocal sqrt rather than the sqrt because:
    // - it has a shorter latency and will reduce the RAW latency
    //   of the loop
    // - the result of the sqrt is used as a divisor which is also
    //   a long operation, so replacing x/sqrt by x*rsqrt will save
    //   latency
    // - the diagonal elements will need to be inverted later, but we
    //   can do that while outside this loop when we transfer the L
    //   matrix to the pipe
    if constexpr (is_complex) {
    div_term = {sycl::rsqrt(diff.r()), 0};
    } else {
    div_term = sycl::rsqrt(diff);
    }
    to_store = div_term;
} else {
    to_store = diff * div_term;
}

 把结果更新,存储到两个拷贝数组和最终数组中。

if (column <= row) {
    // Store the results to two working copies of L to be able to read
    // two complete rows at each iteration
    l_result_compute[row][column] = to_store;
    l_result_compute_copy[row][column] = to_store;
    // Store the result to the output matrix
    if constexpr (is_complex) {
    l_result[row * (row + 1) / 2 + column] = to_store.conj();
    } else {
    l_result[row * (row + 1) / 2 + column] = to_store;
    }
}

最后更新row和column索引,这里需要取最小值决定是否需要做虚拟计算。

// Update loop indexes
if (row == (rows - 1)) {
    column = column + 1;
    row = sycl::min(column, rows - raw_latency);
} else {
    row = row + 1;
}

}  // end of iteration

 最后处理结果,把正确结果写入输出管道中,流式程序结束。

// Go over the L matrix and write each element to the pipe
int l_idx = 0;
[[intel::loop_coalesce(2)]] // NO-FORMAT: Attribute
for (int row = 0; row < rows; row++) {
for (int column = 0; column <= row; column++) {
    TT to_write;
    TT current_l_value = l_result[l_idx];
    // The diagonal elements need to be inverted as the
    // inversion was removed from the above compute loop
    // to reduce the RAW latency
    if (row == column) {
    if constexpr (is_complex) {
        to_write = {1 / current_l_value.r(), 0};
    }
    else{
        to_write = 1 / current_l_value;
    }

    } else {
    to_write = current_l_value;
    }
    LOut::write(to_write);

    l_idx++;
}
}

下面是FPGA优化的Streaming Cholesky算法完整代码:

#ifndef __STREAMING_CHOLESKY_HPP__
#define __STREAMING_CHOLESKY_HPP__

#include "constexpr_math.hpp"
#include "tuple.hpp"
#include "unrolled_loop.hpp"

namespace fpga_linalg {

/*
  Cholesky decomposition - Computes L such that A=LL* where:
  - A is the input matrix (hermitian, positive definite)
  - L is a lower triangular matrix
  - L* is the conjugate transpose of L

  The input and output matrices are consumed/produced from/to pipes.
*/
template <typename T,       // The datatype for the computation
          bool is_complex,  // True if T is ac_complex<X>
          int rows,         // Number of rows==columns in the A matrices
          int raw_latency,  // Read after write (RAW) latency (in iterations) of
                            // the triangular loop of this function.
                            // This value depends on the FPGA target, the
                            // datatype, the target frequency, etc.
                            // This value will have to be tuned for optimal
                            // performance. Refer to the Triangular Loop
                            // design pattern tutorial.
                            // In general, find a high value for which the
                            // compiler is able to achieve an II of 1 and
                            // go down from there.
          int pipe_size,    // Number of elements read/write per pipe operation
                            // to read the input matrix
          typename AIn,     // A matrix input pipe, receive pipe_size
                            // elements from the pipe with each read
          typename LOut     // L matrix output pipe, send one element to the
                            // pipe with each write.
                            // Only lower-left elements of L are
                            // sent in row order, starting with row 0.
          >
struct StreamingCholesky {
  void operator()() const {
    // Functional assertions 
    static_assert(rows >= 4,
                  "Only matrices of size 4x4 and over are supported");
    static_assert(pipe_size >= 1,
                  "The pipe must be able to contain at least one element");

    // Set the computation type to T or ac_complex<T> depending on the value
    // of is_complex
    using TT = std::conditional_t<is_complex, ac_complex<T>, T>;

    constexpr int kColumns = rows;

    // Number of lower-left elements in the L output matrix
    constexpr int kLMatrixSize = kColumns * (kColumns + 1) / 2;

    // Compute Cholesky decompositions as long as matrices are given as inputs
    while (1) {
      // Break memories up to store pipe_size elements per bank
      constexpr short kBankwidth = pipe_size * sizeof(TT);
      constexpr unsigned short kNumBanks = rows / pipe_size;

      // When specifying numbanks for a memory, it must be a power of 2.
      // Unused banks will be automatically optimized away.
      constexpr short kNumBanksNextPow2 =
          fpga_tools::Pow2(fpga_tools::CeilLog2(kNumBanks));

      [[intel::numbanks(kNumBanksNextPow2)]]  // NO-FORMAT: Attribute
      [[intel::bankwidth(kBankwidth)]]        // NO-FORMAT: Attribute
      [[intel::private_copies(4)]]            // NO-FORMAT: Attribute
      [[intel::max_replicates(1)]]            // NO-FORMAT: Attribute
      TT a_load[rows][kColumns];

      // Two copies of L to be able to load two complete rows per iteration
      // Multiple private copies to be able to overlap multiple loop
      // iterations
      [[intel::private_copies(4)]]  // NO-FORMAT: Attribute
      TT l_result_compute[rows][kColumns];
      [[intel::private_copies(4)]]  // NO-FORMAT: Attribute
      TT l_result_compute_copy[rows][kColumns];

      [[intel::private_copies(2)]]  // NO-FORMAT: Attribute
      TT l_result[kLMatrixSize];

      // Copy a matrix from the pipe to a local memory
      // Number of pipe reads of pipe_size required to read a full column
      constexpr int kExtraIteration = ((rows % pipe_size) != 0) ? 1 : 0;
      constexpr int kLoopIterPerColumn = (rows / pipe_size) + kExtraIteration;
      // Number of pipe reads of pipe_size to read all the matrices
      constexpr int kLoopIter = kLoopIterPerColumn * kColumns;
      // Size in bits of the loop iterator over kLoopIter iterations
      constexpr int kLoopIterBitSize =
          fpga_tools::BitsForMaxValue<kLoopIter + 1>();

      [[intel::initiation_interval(1)]]  // NO-FORMAT: Attribute
      for (ac_int<kLoopIterBitSize, false> li = 0; li < kLoopIter; li++) {
        fpga_tools::NTuple<TT, pipe_size> pipe_read = AIn::read();

        int write_idx = li % kLoopIterPerColumn;

        fpga_tools::UnrolledLoop<kLoopIterPerColumn>([&](auto k) {
          fpga_tools::UnrolledLoop<pipe_size>([&](auto t) {
            if constexpr (k * pipe_size + t < kColumns) {
              if (write_idx == k) {
                a_load[li / kLoopIterPerColumn][k * pipe_size + t] =
                    pipe_read.template get<t>();
              }
            }

            // Delay data signals to create a vine-based data distribution
            // to lower signal fanout.
            pipe_read.template get<t>() =
                sycl::ext::intel::fpga_reg(pipe_read.template get<t>());
          });

          write_idx = sycl::ext::intel::fpga_reg(write_idx);
        });
      }

      // Computation of the number of iterations required for the triangular
      // loop. Refer to the triangular_loop tutorial for details on how
      // to compute these.
      constexpr int kRegularIterations = kColumns * (kColumns + 1) / 2;
      constexpr int kExtraIterations = (raw_latency - 1) * raw_latency / 2;
      constexpr int kExtraIterationsToRemove =
          kColumns >= raw_latency
              ? 0
              : (raw_latency - kColumns) * (raw_latency - kColumns + 1) / 2;
      constexpr int kTotalIterations =
          kRegularIterations + kExtraIterations - kExtraIterationsToRemove;

      // Compute the L matrix
      int column = 0;
      int row = 0;
      TT div_term{0};
      [[intel::ivdep(raw_latency)]] // NO-FORMAT: Attribute
      for (int iteration = 0; iteration < kTotalIterations; iteration++) {
        // Perform the dot product of the elements of the two rows indexed by
        // row and column from element 0 to column

        TT sum = 0;
        fpga_tools::UnrolledLoop<kColumns>([&](auto k) {
          TT to_add;
          bool should_compute =  k < column;
          TT mul_lhs = should_compute ? l_result_compute[row][k] : T{0};
          TT mul_rhs = should_compute ? l_result_compute_copy[column][k] : T{0};
          
          if constexpr (is_complex) {
            to_add = mul_lhs * mul_rhs.conj();
          } else {
            to_add = mul_lhs * mul_rhs;
          }
          sum += to_add;
        });

        TT a_loaded = (row < rows) ? a_load[row][column] : TT{0};
        TT diff = a_loaded - sum;

        // Only do useful work for meaningful iterations
        TT to_store;
        if (row == column) {
          // Perform the reciprocal sqrt rather than the sqrt because:
          // - it has a shorter latency and will reduce the RAW latency
          //   of the loop
          // - the result of the sqrt is used as a divisor which is also
          //   a long operation, so replacing x/sqrt by x*rsqrt will save
          //   latency
          // - the diagonal elements will need to be inverted later, but we
          //   can do that while outside this loop when we transfer the L
          //   matrix to the pipe
          if constexpr (is_complex) {
            div_term = {sycl::rsqrt(diff.r()), 0};
          } else {
            div_term = sycl::rsqrt(diff);
          }
          to_store = div_term;
        } else {
          to_store = diff * div_term;
        }

        if (column <= row) {
          // Store the results to two working copies of L to be able to read
          // two complete rows at each iteration
          l_result_compute[row][column] = to_store;
          l_result_compute_copy[row][column] = to_store;
          // Store the result to the output matrix
          if constexpr (is_complex) {
            l_result[row * (row + 1) / 2 + column] = to_store.conj();
          } else {
            l_result[row * (row + 1) / 2 + column] = to_store;
          }
        }

        // Update loop indexes
        if (row == (rows - 1)) {
          column = column + 1;
          row = sycl::min(column, rows - raw_latency);
        } else {
          row = row + 1;
        }

      }  // end of iteration

      // Go over the L matrix and write each element to the pipe
      int l_idx = 0;
      [[intel::loop_coalesce(2)]] // NO-FORMAT: Attribute
      for (int row = 0; row < rows; row++) {
        for (int column = 0; column <= row; column++) {
          TT to_write;
          TT current_l_value = l_result[l_idx];
          // The diagonal elements need to be inverted as the
          // inversion was removed from the above compute loop
          // to reduce the RAW latency
          if (row == column) {
            if constexpr (is_complex) {
              to_write = {1 / current_l_value.r(), 0};
            }
            else{
              to_write = 1 / current_l_value;
            }

          } else {
            to_write = current_l_value;
          }
          LOut::write(to_write);

          l_idx++;
        }
      }

    }  // end of while(1)
  }    // end of operator
};     // end of struct

}  // namespace fpga_linalg

#endif /* __STREAMING_CHOLESKY_HPP__ */

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
### 回答1: 自适应波束形成(Adaptive Beamforming)是一种利用多个接收器和信号处理算法来抑制干扰、提高接收信号质量的技术。在自适应波束形成中,Cholesky算法Cholesky Decomposition)是一种常用的线性代数算法,用于解决矩阵分解的问题。 其原理是对一个正定矩阵进行分解,可以将其表示为一个下三角矩阵和其共轭转置的乘积。在自适应波束形成中,Cholesky算法被广泛应用于计算相关矩阵的逆矩阵或解线性方程组的系数矩阵,从而达到优化波束形成器信号权重系数的目的。 Cholesky算法有较快的计算速度和较低的计算误差,因此在自适应波束形成中应用被广泛认可。它可以通过分解相关矩阵的方降低计算复杂度,从而在实时计算中提高计算速度和效率。 总之,Cholesky算法作为自适应波束形成中的一个重要算法,拥有重要的理论和实际应用价值。它在信号处理、天文学、地球物理学等领域中都有广泛应用,并成为优化信号处理质量和效率的重要工具。 ### 回答2: 自适应波束形成算法是一种信号处理技术,常用于从阵列接收机接收来自不同方向的信号,并抑制干扰和噪声。在自适应波束形成算法中,一般使用 Cholesky 分解来计算逆矩阵,从而计算最优权重系数。 Cholesky 分解是一种基于矩阵正定性的矩阵分解方法,它将正定矩阵分解为下三角矩阵和其共轭转置的乘积。采用 Cholesky 分解可以简化矩阵计算和求逆矩阵的复杂度,从而提高计算效率。 在自适应波束形成算法中,首先需要对接收的信号进行分析和处理,得到阵列接收到的复信号的自相关矩阵。接着,采用 Cholesky 分解求解自相关矩阵的逆矩阵,并将其作为权重系数应用于输入信号,从而实现波束形成。 由于 Cholesky 分解方法具有计算效率高、求逆矩阵稳定性好等优点,在自适应波束形成算法中得到了广泛应用。同时,由于其算法复杂度较高,需要计算大量的矩阵乘法和求逆运算,因此也需要考虑如何使计算能够高效地进行。 ### 回答3: 自适应波束形成算法 Cholesky 是一种常用的线性滤波算法,在信号处理和雷达目标检测中具有广泛的应用。该算法的作用是通过根据接收机信号的统计学特征来调节发射信号波束中各阵元的相位和振幅,从而实现最佳目标探测和抑制背景杂音的效果。 Cholesky 算法的基本思想是通过矩阵分解的方法将大型矩阵分解成两个下三角矩阵的乘积,然后通过前向后向替换的方法求解线性方程组。该算法的优点在于节省了内存空间,同时可以快速有效地对信号进行处理。 具体来说,自适应波束形成算法 Cholesky 可以通过以下步骤实现: 1.收集接收机输出信号的数据,并计算其统计学特征。 2.根据统计学特征,构建一个包含接收机中所有阵元的协方差矩阵。 3.对协方差矩阵进行 Cholesky 分解,得到两个下三角矩阵 L 和 L^T,其中 L^T 表示 L 的转置。这个步骤可以使用线性代数中的 Cholesky 分解公进行计算。 4.对分解后的矩阵进行前向后向替换,以求解线性方程组 Ax = b,其中 A 为协方差矩阵,b 为接收机输出信号。 5.得到最佳的发射波束相位和振幅参数,用于下一次的波束形成。 总的来说,自适应波束形成算法 Cholesky 是一种高效、准确的信号处理算法,可以在雷达目标检测中起到非常重要的作用。它的应用范围非常广泛,不仅可以用于天文学、地球物理学、通信工程等领域,而且还可以用于探测海底、预测水文气象等各种应用场景中。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值