BlockWatch: Leveraging Similarity in Parallel Programs for Error Detection

Jiesheng Wei and Karthik Pattabiraman, Proceedings of the IEEE Workshop on Silicon Errors in Logic, System Effects (SELSE), 2012. [ PDF File | Talk ]

This paper is superceded by our DSN12 paper

Abstract: The scaling of Silicon devices has exacerbated the unreliability of modern computer systems, and power constraints have necessitated the involvement of software in hardware error detection. Simultaneously, the multi-core revolution has impelled software to become parallel. Therefore, there is a compelling need to protect parallel programs from hardware errors.
Parallel programs’ tasks have significant similarity in control data due to the use of high-level programming models. In this study, we propose BLOCKWATCH to leverage the similarity in parallel program’s control data for detecting hardware errors. BLOCKWATCH statically extracts the similarity among different threads of a parallel program and checks the similarity at runtime. We evaluate BLOCKWATCH on seven SPLASH-2 benchmarks to measure its error detection coverage. We find that BLOCKWATCH provides an average SDC coverage of 98% for faults in the control data.