Abdul Rehman Anwer, Guanpeng Li, Karthik Pattabiraman, Michael Sullivan, Timothy Tsai and Siva Hari, ACM International Conference on High-Performance Computing, Networking, Storage, and Analyzis (SC), 2020 (Acceptance Rate: 25.1%) [PDF | Talk] (Code)
Abstract: Graphics Processing Units (GPUs) are popular for reliability-conscious uses in high-performance computation (HPC). Fault injection (FI) techniques are generally used to determine the reliability profiles of programs in the presence of soft errors, but these techniques are highly resource- and time-intensive. Prior research developed a model called TRIDENT to analytically predict Silent Data Corruption probabilities of single-threaded CPU applications without requiring any FIs. Unfortunately, TRIDENT is incompatible with GPU programs, due to their high degree of parallelism and different memory architectures compared to CPU programs. The main challenge is that modeling error propagation across thousands of threads in a GPU kernel requires enormous amounts of data to be profiled and analyzed, posing a major scalability bottleneck for HPC applications. Further, there are GPU-specific behaviors that must be modeled. In this paper, we propose GPU-TRIDENT, a novel technique that is both accurate and scalable for modeling error propagation in GPU programs.