Cholesky算法伪代码如下:
int row_size = 0;
for (column = 0; column <= row_size; column++) {
for (row = column; row < rows; row++) {
float sum = 0;
for (k = 0; k < column; k++)
sum += L[row][k] * L[column][k];
if (row == column)
L[row][column] = sqrt(A[row][row] - sum);
else
L[row][column] = (A[row][column] - sum) / L[column][column];
}
}
本实验笔记基于Intel oneAPI C++ SYCL的流式Cholesky算法的FPGA优化版本。
在FPGA上进行Cholesky算法优化的关键点在于:
1. 数据通过管道从全局内存流入FPGA片上内存,防止全局内存读取,并设计无停顿的内核,结果通过管道传出内核。
2. 利用循环依赖忽略优化和流水线Triangular Loop优化,增加一定数量的虚拟计算,避免先写后读问题,防止编译器自动增加循环启动间隔II。
3. 使用bank技术和私有内存拷贝,允许多个迭代并发(or并行?)执行,不产生存储体冲突。
Streaming部分代码分析:
模版参数读入了rows行数,pipe_size是管道每次读取的矩阵元素个数,AIn和LOut分别代表输入数据的管道和输出结果的管道,其中raw_latency代表了解决read after write错误所需要的延迟计算的最大次数,这一部分参考上面提到的循环依赖编译器优化和Triangular Loop优化。
template <typename T, // The datatype for the computation
bool is_complex, // True if T is ac_complex<X>
int rows, // Number of rows==columns in the A matrices
int raw_latency, // Read after write (RAW) latency (in iterations) of
// the triangular loop of this function.
// This value depends on the FPGA target, the
// datatype, the target frequency, etc.
// This value will have to be tuned for optimal
// performance. Refer to the Triangular Loop
// design pattern tutorial.
// In general, find a high value for which the
// compiler is able to achieve an II of 1 and
// go down from there.
int pipe_size, // Number of elements read/write per pipe operation
// to read the input matrix
typename AIn, // A matrix input pipe, receive pipe_size
// elements from the pipe with each read
typename LOut // L matrix output pipe, send one element to the
// pipe with each write.
// Only lower-left elements of L are
// sent in row order, starting with row 0.
>
下面的代码计算了总共需要传输的L结果矩阵的元素个数,因为rows值是程序编译期间就可知的,SYCL内核不允许动态决定本地数组长度,所以内核需要提前确定变量大小。
// Functional assertions
static_assert(rows >= 4,
"Only matrices of size 4x4 and over are supported");
static_assert(pipe_size >= 1,
"The pipe must be able to contain at least one element");
// Set the computation type to T or ac_complex<T> depending on the value
// of is_complex
using TT = std::conditional_t<is_complex, ac_complex<T>, T>;
constexpr int kColumns = rows;
// Number of lower-left elements in the L output matrix
constexpr int kLMatrixSize = kColumns * (kColumns + 1) / 2;
下面的代码块确定了存储体(Bank)的宽度,每个存储体宽度为单次管道输入的元素个数pipe_size的实际大小,存储体的个数为rows / pipe_size,代表a_load[rows][kColumns]矩阵一列数据划分到每个存储体中。
l_result_compute[rows][kColumns]本地数组私有拷贝四次,允许流水线同时并行访存结果数组。
constexpr short kBankwidth = pipe_size * sizeof(TT);
constexpr unsigned short kNumBanks = rows / pipe_size;
// When specifying numbanks for a memory, it must be a power of 2.
// Unused banks will be automatically optimized away.
constexpr short kNumBanksNextPow2 =
fpga_tools::Pow2(fpga_tools::CeilLog2(kNumBanks));
[[intel::numbanks(kNumBanksNextPow2)]] // NO-FORMAT: Attribute
[[intel::bankwidth(kBankwidth)]] // NO-FORMAT: Attribute
[[intel::private_copies(4)]] // NO-FORMAT: Attribute
[[intel::max_replicates(1)]] // NO-FORMAT: Attribute
TT a_load[rows][kColumns];
// Two copies of L to be able to load two complete rows per iteration
// Multiple private copies to be able to overlap multiple loop
// iterations
[[intel::private_copies(4)]] // NO-FORMAT: Attribute
TT l_result_compute[rows][kColumns];
[[intel::private_copies(4)]] // NO-FORMAT: Attribute
TT l_result_compute_copy[rows][kColumns];
[[intel::private_copies(2)]] // NO-FORMAT: Attribute
TT l_result[kLMatrixSize];
接下来是读取管道数据到本地数组中,这里使用了Intel提供的循环展开模版函数,通过展开循环每一列需要读取的次数(kLoopIterPerColumn)和每次读取的元素个数(pipe_size),将数据保存到片上内存中,提高后续计算访存效率。
利用FPGA寄存器插入到循环展开部分(fpga_reg),防止在展开读取过程中产生过大的扇出(large fanout),因为过大的扇出导致FPGA在物理硬件层面设置更长的布线,从而降低效率和fMAX,显示的插入寄存器可以防止树状结构过大,并且提高系统频率,但是可能会增加延迟,该部分可以参考fpga寄存器防止扇出。
// Copy a matrix from the pipe to a local memory
// Number of pipe reads of pipe_size required to read a full column
constexpr int kExtraIteration = ((rows % pipe_size) != 0) ? 1 : 0;
constexpr int kLoopIterPerColumn = (rows / pipe_size) + kExtraIteration;
// Number of pipe reads of pipe_size to read all the matrices
constexpr int kLoopIter = kLoopIterPerColumn * kColumns;
// Size in bits of the loop iterator over kLoopIter iterations
constexpr int kLoopIterBitSize =
fpga_tools::BitsForMaxValue<kLoopIter + 1>();
[[intel::initiation_interval(1)]] // NO-FORMAT: Attribute
for (ac_int<kLoopIterBitSize, false> li = 0; li < kLoopIter; li++) {
fpga_tools::NTuple<TT, pipe_size> pipe_read = AIn::read();
int write_idx = li % kLoopIterPerColumn;
fpga_tools::UnrolledLoop<kLoopIterPerColumn>([&](auto k) {
fpga_tools::UnrolledLoop<pipe_size>([&](auto t) {
if constexpr (k * pipe_size + t < kColumns) {
if (write_idx == k) {
a_load[li / kLoopIterPerColumn][k * pipe_size + t] =
pipe_read.template get<t>();
}
}
// Delay data signals to create a vine-based data distribution
// to lower signal fanout.
pipe_read.template get<t>() =
sycl::ext::intel::fpga_reg(pipe_read.template get<t>());
});
write_idx = sycl::ext::intel::fpga_reg(write_idx);
});
}
接下来是计算部分的主题,也就是前述伪代码的具体实现。
在原来的伪代码中,column是外循环,row是内循环,根据Triangular Loop的优化策略,我们需要把内外两个循环的总行程计数(kTotalIterations)计算出来,它由原来的基本循环次数加上虚拟循环次数决定的,虚拟循环计数是根据raw latency决定,这个数值是靠实验得出(正好使得循环启动周期II为1)。
// Computation of the number of iterations required for the triangular
// loop. Refer to the triangular_loop tutorial for details on how
// to compute these.
constexpr int kRegularIterations = kColumns * (kColumns + 1) / 2;
constexpr int kExtraIterations = (raw_latency - 1) * raw_latency / 2;
constexpr int kExtraIterationsToRemove =
kColumns >= raw_latency
? 0
: (raw_latency - kColumns) * (raw_latency - kColumns + 1) / 2;
constexpr int kTotalIterations =
kRegularIterations + kExtraIterations - kExtraIterationsToRemove;
计算部分和伪代码部分步骤相似,先计算sum值,这里把k设置为展开kColumns次,但是实际上这是根据当前所在列决定的,通过一个bool判断展开的循环里是否需要计算。
计算结果矩阵的某个位置(坐标为row,columb)的时候,当前位置的结果值计算过程只和小于等于row和column的值有关。
把每次计算结果存到to_add变量中,与sum累加。
int column = 0;
int row = 0;
TT div_term{0};
[[intel::ivdep(raw_latency)]] // NO-FORMAT: Attribute
for (int iteration = 0; iteration < kTotalIterations; iteration++) {
// Perform the dot product of the elements of the two rows indexed by
// row and column from element 0 to column
TT sum = 0;
fpga_tools::UnrolledLoop<kColumns>([&](auto k) {
TT to_add;
bool should_compute = k < column;
TT mul_lhs = should_compute ? l_result_compute[row][k] : T{0};
TT mul_rhs = should_compute ? l_result_compute_copy[column][k] : T{0};
if constexpr (is_complex) {
to_add = mul_lhs * mul_rhs.conj();
} else {
to_add = mul_lhs * mul_rhs;
}
sum += to_add;
});
然后加载A[row][row]或者A[row][column],预处理出减去sum的值diff。
当行列相等时,只要做一个开方即可,但是为了运算效率和降低计算延迟,需要用到sycl反开方库函数rsqrt,因为乘法比除法后续效率更高,当然这里的结果最后也需要做个倒数。
当行列不相等时,就可以用到之前的反开方结果了,直接相乘上diff差值即可。
TT a_loaded = (row < rows) ? a_load[row][column] : TT{0};
TT diff = a_loaded - sum;
// Only do useful work for meaningful iterations
TT to_store;
if (row == column) {
// Perform the reciprocal sqrt rather than the sqrt because:
// - it has a shorter latency and will reduce the RAW latency
// of the loop
// - the result of the sqrt is used as a divisor which is also
// a long operation, so replacing x/sqrt by x*rsqrt will save
// latency
// - the diagonal elements will need to be inverted later, but we
// can do that while outside this loop when we transfer the L
// matrix to the pipe
if constexpr (is_complex) {
div_term = {sycl::rsqrt(diff.r()), 0};
} else {
div_term = sycl::rsqrt(diff);
}
to_store = div_term;
} else {
to_store = diff * div_term;
}
把结果更新,存储到两个拷贝数组和最终数组中。
if (column <= row) {
// Store the results to two working copies of L to be able to read
// two complete rows at each iteration
l_result_compute[row][column] = to_store;
l_result_compute_copy[row][column] = to_store;
// Store the result to the output matrix
if constexpr (is_complex) {
l_result[row * (row + 1) / 2 + column] = to_store.conj();
} else {
l_result[row * (row + 1) / 2 + column] = to_store;
}
}
最后更新row和column索引,这里需要取最小值决定是否需要做虚拟计算。
// Update loop indexes
if (row == (rows - 1)) {
column = column + 1;
row = sycl::min(column, rows - raw_latency);
} else {
row = row + 1;
}
} // end of iteration
最后处理结果,把正确结果写入输出管道中,流式程序结束。
// Go over the L matrix and write each element to the pipe
int l_idx = 0;
[[intel::loop_coalesce(2)]] // NO-FORMAT: Attribute
for (int row = 0; row < rows; row++) {
for (int column = 0; column <= row; column++) {
TT to_write;
TT current_l_value = l_result[l_idx];
// The diagonal elements need to be inverted as the
// inversion was removed from the above compute loop
// to reduce the RAW latency
if (row == column) {
if constexpr (is_complex) {
to_write = {1 / current_l_value.r(), 0};
}
else{
to_write = 1 / current_l_value;
}
} else {
to_write = current_l_value;
}
LOut::write(to_write);
l_idx++;
}
}
下面是FPGA优化的Streaming Cholesky算法完整代码:
#ifndef __STREAMING_CHOLESKY_HPP__
#define __STREAMING_CHOLESKY_HPP__
#include "constexpr_math.hpp"
#include "tuple.hpp"
#include "unrolled_loop.hpp"
namespace fpga_linalg {
/*
Cholesky decomposition - Computes L such that A=LL* where:
- A is the input matrix (hermitian, positive definite)
- L is a lower triangular matrix
- L* is the conjugate transpose of L
The input and output matrices are consumed/produced from/to pipes.
*/
template <typename T, // The datatype for the computation
bool is_complex, // True if T is ac_complex<X>
int rows, // Number of rows==columns in the A matrices
int raw_latency, // Read after write (RAW) latency (in iterations) of
// the triangular loop of this function.
// This value depends on the FPGA target, the
// datatype, the target frequency, etc.
// This value will have to be tuned for optimal
// performance. Refer to the Triangular Loop
// design pattern tutorial.
// In general, find a high value for which the
// compiler is able to achieve an II of 1 and
// go down from there.
int pipe_size, // Number of elements read/write per pipe operation
// to read the input matrix
typename AIn, // A matrix input pipe, receive pipe_size
// elements from the pipe with each read
typename LOut // L matrix output pipe, send one element to the
// pipe with each write.
// Only lower-left elements of L are
// sent in row order, starting with row 0.
>
struct StreamingCholesky {
void operator()() const {
// Functional assertions
static_assert(rows >= 4,
"Only matrices of size 4x4 and over are supported");
static_assert(pipe_size >= 1,
"The pipe must be able to contain at least one element");
// Set the computation type to T or ac_complex<T> depending on the value
// of is_complex
using TT = std::conditional_t<is_complex, ac_complex<T>, T>;
constexpr int kColumns = rows;
// Number of lower-left elements in the L output matrix
constexpr int kLMatrixSize = kColumns * (kColumns + 1) / 2;
// Compute Cholesky decompositions as long as matrices are given as inputs
while (1) {
// Break memories up to store pipe_size elements per bank
constexpr short kBankwidth = pipe_size * sizeof(TT);
constexpr unsigned short kNumBanks = rows / pipe_size;
// When specifying numbanks for a memory, it must be a power of 2.
// Unused banks will be automatically optimized away.
constexpr short kNumBanksNextPow2 =
fpga_tools::Pow2(fpga_tools::CeilLog2(kNumBanks));
[[intel::numbanks(kNumBanksNextPow2)]] // NO-FORMAT: Attribute
[[intel::bankwidth(kBankwidth)]] // NO-FORMAT: Attribute
[[intel::private_copies(4)]] // NO-FORMAT: Attribute
[[intel::max_replicates(1)]] // NO-FORMAT: Attribute
TT a_load[rows][kColumns];
// Two copies of L to be able to load two complete rows per iteration
// Multiple private copies to be able to overlap multiple loop
// iterations
[[intel::private_copies(4)]] // NO-FORMAT: Attribute
TT l_result_compute[rows][kColumns];
[[intel::private_copies(4)]] // NO-FORMAT: Attribute
TT l_result_compute_copy[rows][kColumns];
[[intel::private_copies(2)]] // NO-FORMAT: Attribute
TT l_result[kLMatrixSize];
// Copy a matrix from the pipe to a local memory
// Number of pipe reads of pipe_size required to read a full column
constexpr int kExtraIteration = ((rows % pipe_size) != 0) ? 1 : 0;
constexpr int kLoopIterPerColumn = (rows / pipe_size) + kExtraIteration;
// Number of pipe reads of pipe_size to read all the matrices
constexpr int kLoopIter = kLoopIterPerColumn * kColumns;
// Size in bits of the loop iterator over kLoopIter iterations
constexpr int kLoopIterBitSize =
fpga_tools::BitsForMaxValue<kLoopIter + 1>();
[[intel::initiation_interval(1)]] // NO-FORMAT: Attribute
for (ac_int<kLoopIterBitSize, false> li = 0; li < kLoopIter; li++) {
fpga_tools::NTuple<TT, pipe_size> pipe_read = AIn::read();
int write_idx = li % kLoopIterPerColumn;
fpga_tools::UnrolledLoop<kLoopIterPerColumn>([&](auto k) {
fpga_tools::UnrolledLoop<pipe_size>([&](auto t) {
if constexpr (k * pipe_size + t < kColumns) {
if (write_idx == k) {
a_load[li / kLoopIterPerColumn][k * pipe_size + t] =
pipe_read.template get<t>();
}
}
// Delay data signals to create a vine-based data distribution
// to lower signal fanout.
pipe_read.template get<t>() =
sycl::ext::intel::fpga_reg(pipe_read.template get<t>());
});
write_idx = sycl::ext::intel::fpga_reg(write_idx);
});
}
// Computation of the number of iterations required for the triangular
// loop. Refer to the triangular_loop tutorial for details on how
// to compute these.
constexpr int kRegularIterations = kColumns * (kColumns + 1) / 2;
constexpr int kExtraIterations = (raw_latency - 1) * raw_latency / 2;
constexpr int kExtraIterationsToRemove =
kColumns >= raw_latency
? 0
: (raw_latency - kColumns) * (raw_latency - kColumns + 1) / 2;
constexpr int kTotalIterations =
kRegularIterations + kExtraIterations - kExtraIterationsToRemove;
// Compute the L matrix
int column = 0;
int row = 0;
TT div_term{0};
[[intel::ivdep(raw_latency)]] // NO-FORMAT: Attribute
for (int iteration = 0; iteration < kTotalIterations; iteration++) {
// Perform the dot product of the elements of the two rows indexed by
// row and column from element 0 to column
TT sum = 0;
fpga_tools::UnrolledLoop<kColumns>([&](auto k) {
TT to_add;
bool should_compute = k < column;
TT mul_lhs = should_compute ? l_result_compute[row][k] : T{0};
TT mul_rhs = should_compute ? l_result_compute_copy[column][k] : T{0};
if constexpr (is_complex) {
to_add = mul_lhs * mul_rhs.conj();
} else {
to_add = mul_lhs * mul_rhs;
}
sum += to_add;
});
TT a_loaded = (row < rows) ? a_load[row][column] : TT{0};
TT diff = a_loaded - sum;
// Only do useful work for meaningful iterations
TT to_store;
if (row == column) {
// Perform the reciprocal sqrt rather than the sqrt because:
// - it has a shorter latency and will reduce the RAW latency
// of the loop
// - the result of the sqrt is used as a divisor which is also
// a long operation, so replacing x/sqrt by x*rsqrt will save
// latency
// - the diagonal elements will need to be inverted later, but we
// can do that while outside this loop when we transfer the L
// matrix to the pipe
if constexpr (is_complex) {
div_term = {sycl::rsqrt(diff.r()), 0};
} else {
div_term = sycl::rsqrt(diff);
}
to_store = div_term;
} else {
to_store = diff * div_term;
}
if (column <= row) {
// Store the results to two working copies of L to be able to read
// two complete rows at each iteration
l_result_compute[row][column] = to_store;
l_result_compute_copy[row][column] = to_store;
// Store the result to the output matrix
if constexpr (is_complex) {
l_result[row * (row + 1) / 2 + column] = to_store.conj();
} else {
l_result[row * (row + 1) / 2 + column] = to_store;
}
}
// Update loop indexes
if (row == (rows - 1)) {
column = column + 1;
row = sycl::min(column, rows - raw_latency);
} else {
row = row + 1;
}
} // end of iteration
// Go over the L matrix and write each element to the pipe
int l_idx = 0;
[[intel::loop_coalesce(2)]] // NO-FORMAT: Attribute
for (int row = 0; row < rows; row++) {
for (int column = 0; column <= row; column++) {
TT to_write;
TT current_l_value = l_result[l_idx];
// The diagonal elements need to be inverted as the
// inversion was removed from the above compute loop
// to reduce the RAW latency
if (row == column) {
if constexpr (is_complex) {
to_write = {1 / current_l_value.r(), 0};
}
else{
to_write = 1 / current_l_value;
}
} else {
to_write = current_l_value;
}
LOut::write(to_write);
l_idx++;
}
}
} // end of while(1)
} // end of operator
}; // end of struct
} // namespace fpga_linalg
#endif /* __STREAMING_CHOLESKY_HPP__ */