Bo Fang, Karthik Pattabiraman, Matei Ripeanu and Sudhanva Gurumurthi, To appear in the proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2014, Monterrey, CA. (Acceptance rate: 30 %) [PDF | Talk ]
Abstract: While GPUs (Graphics Processing Units) have gained wide adoption as accelerators for general purpose applications (GPGPU), the end-to-end reliability implications of their use have not been quantified. Fault injection is a widely used method for evaluating the reliability of applications. However, building a fault-injector for GPGPU applications is challenging due to their massive parallelism, which makes it difficult to achieve representativeness while being time-efficient.
This paper makes three key contributions. Firstly, it presents the design of a fault injection methodology to evaluate end-to- end reliability properties of application kernels running on GPUs. Secondly, it introduces a fault injection tool that uses real GPU hardware and offers a good balance between the representativeness and the efficiency of the fault injection experiments. Finally, this paper presents a characterization of the error resilience characteristics of twelve GPGPU applications.