Evaluating the Error Resilience of Parallel Programs

Bo Fang, Karthik Pattabiraman, Matei Ripeanu and Sudhanva Gurumurthi, To appear in the Workshop on Fault Tolerance for HPC at Extreme Scale (FTXS), 2014. Co-located with the DSN 2014 conference. [ PDF | Talk ]

Abstract: As a consequence of increasing hardware fault rates, HPC systems face significant challenges in terms of reliability. Evaluating the error resilience of HPC applications is an essential step for building efficient fault-tolerant mechanisms for these applications. In this paper, we propose a methodology to characterize the resilience of OpenMP programs using fault-injection experiments. We find that the error resilience of OpenMP applications depends on the program structure and thread model; hence, these need to be taken into account while characterizing error resilience. We also report preliminary results about the correlation between the application’s error resilience and the algorithm(s) used in the application.