基于可变数据压缩的GPU核辅助加速策略-CSDN博客

本文链接：https://blog.csdn.net/u013893074/article/details/54315879

报告介绍了Core-Assisted Bottleneck Acceleration (CABA)框架，旨在利用未充分利用的GPU资源解决内存带宽瓶颈问题。CABA通过硬件辅助数据压缩，启用帮助线程（assist warps）来加速应用程序执行，减少计算、内存和数据依赖停滞时的闲置资源。使用Base-Delta-Immediate (BDI)压缩方法，CABA在一组带宽敏感的GPU应用上平均提高了41.7%的系统性能。

摘要由CSDN通过智能技术生成

A report brief about paper A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps.
A comprehensive design and evaluation.
Off-chip memory bandwidth is one of the bottleneck of GPU execution, which makes the GPU computational resources idle. The paper introduces Core-Assisted Bottleneck Acceleration CABA framework to make use of unutilized on-chip resources to solve the idle problem mentioned.

CABA employs hardware available on-chip but underutilized, and offers versatility in algorithm choice(hardware-based) for different applications, once the application can't benefits from compression, CABA can easily disable the compression. Unutilized compute resources come from Compute, Memory and Data Dependence Stalls. And unutilized on-chip memory is limited by the available registers and shared memory, the hard limit on the number of threads and thread blocks per core, the number of thread blocks in the occupancy. The helper thread of CABA is low overhead, and need to be treated differently from regular threads. To implement low overhead, helper thread is easy to be managed, to enable, trigger and kill threads, and it should be flexible enough to adapt to the runtime behavior of the regular program, while communicating with original thread.

Assist warps compress data,executing code to speed up application execution, and shares the same context as the regular warp to simplify scheduling and data communication. Assist warps compress cache blocks before written to memory, and decompress before cache blocks placed into cache.

The CABA Framework is based on hardware/software co-design, with pure software only will have high overhead and with pure hardware would make register allocation and data communication more difficult. In hardware level, sequences of instructions are dynamically inserted into the execution stream.The author track and manage of the instructions at the granularity of a warp, which was called Assist Warps. The assist warps does not own a separate context, but shares both a context and a warp ID with regular warp. For different actions, helper thread requires a different number of registers, which have a short lifetime. And the subroutine of the assist warp can be written both by CUDA extensions with PTX instructions or the microarchitecture in the internal GPU instruction format. There are three main hardware additions, Assist Warp Store, Assist Warp COntroller and Assist Warp Buffer.

To compress the data, Base-Delta-Immediate compression BDI is used. BDI represents a cache line with low dynamic range using a common base (or multiple bases) and an array of deltas. The author views a cache line as a set of fixed-size values, and decompression is simply a masked vector addition of the deltas to the appropriate bases.

The use of CABA or memory compression improves system performance about 41.7% on average on a set of bandwidth-sensitive GPU applications.