LLFI is a compiler-based fault injection framework that uses the LLVM compiler to instrument programs for fault injection. The CUDA Nvidia compiler NVCC is also based on LLVM and compiles LLVM IR to PTX representation, which is then compiled to SAS machine code by Nvidia’s backend compiler. At first glance, integrating LLFI and NVCC to build a GPU-based fault injector seems straightforward. However, two practical challenges arise:
- Challenge One: Access to LLVM IR
NVCC does not expose the LLVM IR code and directly transforms it to PTX code. LLFI relies on the IR code to perform instrumentation for fault injection, and thus cannot inject faults into the IR used by NVCC. - Challenge Two: Fault Injection in Multi-threaded Programs
GPU programs are multi-threaded, often consisting of hundreds of threads, and thus faults need to be injected into a random thread at runtime. However, LLFI does not support injecting faults into multi-threaded programs.
To address these challenges, the following solutions were implemented:
-
Solution One: Intercepting LLVM IR
By attaching a dynamic library to NVCC, which can intercept its call to the LLVM compilation module, we can invoke the instrumentation passes of LLFI to perform the program’s instrumentation. We then return the instrumented LLVM IR to NVCC, which proceeds with the rest of the compilation process to transform it to PTX code. -
Solution Two: Fault Injection into Random Threads
AthreadID
field is added to the profiling data collected by LLFI to uniquely identify each thread. A thread is then randomly chosen from all threads in the program to inject faults at runtime. Additionally, information on the kernel call executed and the total number of kernel calls is added to the profiling data to help choose which kernel calls to inject faults into.
LLFI-GPU operates as follows:
- Profiling: LLFI-GPU first profiles the program to gather data on the total number of kernel calls, the number of threads per kernel call, and the total number of instructions executed by each kernel thread.
- Instrumentation: It then creates an instrumented version of the program by inserting fault injection functions into the CUDA portion of the program’s code. This is similar to LLFI but restricted to the CUDA portion.
- Fault Injection: Based on the profiling data, LLFI-GPU selects a random thread within a random kernel call and a random dynamic instruction executed by that thread. The chosen instruction is selected uniformly from all executed instructions. For the selected instruction, LLFI-GPU injects a fault by overwriting the result value with a faulty version, such as flipping a single bit, and then continues running the application.
Error Propagation Analysis (EPA)
After injecting a fault, LLFI-GPU tracks memory data at every kernel boundary for the analysis of error propagation.
The process is as follows:
- Memory Allocation and Deallocation: Memory is allocated on the device using
cudaMalloc()
before launching kernels and deallocated usingcudaFree()
at the end of the program. - Data Saving: After each kernel invocation, LLFI-GPU saves all memory data allocated on the GPU to disk. This step corresponds to lines 6-13 in the figure.
- Error Comparison: Later, the saved data after each kernel call is compared with data from a golden run. Any differences are marked as a result of error propagation.
By performing this comparison at kernel boundaries, non-determinism introduced by thread interleaving within the GPU is avoided.
Limitations
Our fault injections are performed at the LLVM IR level rather than at the SASS or PTX code levels. One potential drawback of this approach is that downstream compiler optimizations may change both the number and order of instructions, or even remove the fault injection code we inserted. To mitigate this effect, we ensured that our fault injection pass is applied after various optimization passes in the LLVM IR code.
Additionally, during the profiling phase, LLFI-GPU gathers all executed instructions. We ensure that all target instructions for fault injection are gathered and inject faults only into these instructions.
Despite these precautions, backend optimizations after the IR is generated may change the mapping of instructions at the machine assembly levels (e.g., SASS level). This could result in different absolute values of the Silent Data Corruption (SDC) rate for fault injections performed at different levels. However, our focus is on obtaining insights into error propagation intrinsic to applications rather than deriving precise SDC rates.
Metrics:
the use of kernel invocations as a metric ensures that the study’s findings are more robust and applicable across different platforms, which is crucial for developing effective application-level error-resilience mechanisms.