GPU上冗余多线程的实际实现

A brief report about paper Real-World Design and Evaluation of Compiler-Managed GPU Redundant Multithreading.


This paper introduces the use of Redundant Multithreading to provide an efficient software solution on GPU. The modification is made in kernel level and have two strategy, Intra-Group RMT and Inter-Group RMT. The paper also proves that GPU RMT performance depends on the unique behaviors of each kernel and the required SoR.


The author's method has three differences from others. First, it assume protection in storage and transfer to off-chip resources and target an on-chip protection domain; Second, the paper focus on detection on the GPU not on the CPU(RMT was originally used in CPU); Third it's on software-only GPU reliability solution, not hardware, since hardware is expensive to implement,inflexible and GPU simulators reflect inappropriate.


So what's RMT? Redundant Multithreading is a little like RAID1, but thread is duplicated, not data. The GPU has a mode called Fault modes, which can cause permanent or transient faults, hard and soft separately. Fault is caused by physical level, which is hard to avoid. But with RMT, the possibility of two faults creating simultaneous identical errors can be ignored. RMT relicate all the values when enter the sphere of replication SoR, and compare the output before a correct copy leave the SoR, just like RAID1. And in this paper RMT is implemented in GPU kernels with OpenCL, so there is a transformation between OpenCL kernels and RMT programs for error detection.


The first strategy is Intra-Group, and there are two different types of Intra-Group RMT, Intre_Group+LDS and Intre_Group-LDS, +LDS means the LDS is in the SoR, so will be deplicated and protected, while -LDS is out and not. Other stuff like scalr register file SRF, scalar uint SU, instruction fetch, decode and scheduling logic are all out of SoR and are not protected by Intra-Group RMT. There are three kernel modifications in Intra-Group : Work-Item ID is modified to create a pair of identical, redundant workitems; LDS is included in the SoR, its allocation and map redundant loads and stores are doubled; communication and output comparison are added as well. The Intra-Group flavors perform not very well, because memory operations DCT and MM spend a lot of time. And for some applications, the inter-work-item communication cost a lot. Of course the behaviour of doubling the size of work-groups takes time too. Although RMT executes twice as many work-items,the power consumption increases is small, less than 2%. And the cost of redundant computation can be hidden behind Intra-Group RMT latency, while Instruction fetch scheduling and decode logic of each CU can be considered inside of the SoR.


Kernel modifications of Inter-Group RMT : adding explicit synchronization to coordinate communication between work-items; modify work-item ID to avoid deadlock; communication buffers are in global memory, for Inter-Group RMT communication between work-times is more expensive than Intra-Group RMT communicaition. The poor performance of Inter-Group RMT is caused by using global memory for inter-work-item communication, which is extremely high. And the CU under-utilization is related to the the work-groups launched.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值