GPU-Qin: A Methodology for Evaluating the Error Resilience of GPGPU Applications

Bo Fang, Karthik Pattabiraman, Matei Ripeanu and Sudhanva Gurumurthi, To appear in the proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2014, Monterrey, CA. (Acceptance rate: 30 %) [PDF | Talk ]

Abstract: While GPUs (Graphics Processing Units) have gained wide adoption as accelerators for general purpose applications (GPGPU), the end-to-end reliability implications of their use have not been quantified. Fault injection is a widely used method for evaluating the reliability of applications. However, building a fault-injector for GPGPU applications is challenging due to their massive parallelism, which makes it difficult to achieve representativeness while being time-efficient.

This paper makes three key contributions. Firstly, it presents the design of a fault injection methodology to evaluate end-to- end reliability properties of application kernels running on GPUs. Secondly, it introduces a fault injection tool that uses real GPU hardware and offers a good balance between the representativeness and the efficiency of the fault injection experiments. Finally, this paper presents a characterization of the error resilience characteristics of twelve GPGPU applications.

Comments are closed.