A Systematic Methodology for Evaluating the Error Resilience of GPGPU Applications

Bo Fang, Karthik Pattabiraman, Matei Ripeanu and Sudhanva Gurumurthi, IEEE Transactions on Parallel and Distributed Systems (TPDS), 2016. [ PDF ]

This paper supercedes our conference paper.

Abstract: The wide adoption of graphics processing units (GPUs) as accelerators for general-purpose applications makes the end-to-end reliability implications of their use increasingly significant. Fault injection is a widely adopted method to evaluate the resilience of applications. However, building a fault injector for general-purpose GPU applications is challenging due to their massive parallelism, which makes it difficult to achieve representativeness while being time-efficient.

This paper makes four key contributions. First, it presents a fault-injection methodology to evaluate the end-to-end reliability properties of application kernels running on GPUs. Second, it introduces GPU-Qin, a fault-injection tool that uses real GPU hardware and offers a tunable and efficient balance between the representativeness and the cost of a fault-injection campaign. Third, it characterizes the error resilience characteristics of seventeen application kernels. Finally, it provides preliminary insights on correlations between the algorithmic properties of applications and their error resilience.

Comments are closed.