Talk: Tolerating Silent Data Corruption (SDC) causing Hardware Faults Through Software Techniques

Talk given in the Electrical and Computer Engineering Department, Georgia Tech, June 2014. (Also, at IBM T. J. Watson Research, Aug 2014) [ PDF ].

Abstract: Commodity software is designed with the assumption that the hardware is fault-free, and hence software hardly ever needs to deal with hardware errors. However, this assumption is increasingly difficult to satisfy as CMOS devices scale to smaller and smaller sizes, and as manufacturing variations increase. In addition, traditional solutions such as guard-banding and dual modular redundancy (DMR) are rendered impractical by stringent power constraints. Therefore, there is a need to develop low overhead software approaches for protecting programs from hardware errors.

In this talk, I will describe our work on automated compiler and runtime techniques to detect Silent Data Corruption (SDC) causing hardware faults in applications. SDCs are particularly insidious as they leave no visible indication that the application’s result was wrong. We also investigate the influence of static and dynamic program features and the algorithm on SDC rates. An important aspect of our work is the building of robust and accurate fault injectors to explore these questions. I will conclude by outlining some of the research challenges and opportunities in this area.