Talk: Good Enough Systems: Tolerating (most) hardware errors using software

Talk given at the University of Pittsburgh, CS department colloqium. Feb 22, 2012. [ PDF File ]

Abstract: Commodity software is designed with the assumption that the hardware is fault-free, and hence software hardly ever needs to deal with hardware errors. However, this assumption is increasingly difficult to satisfy as CMOS devices scale to smaller and smaller sizes, and as manufacturing variations increase. In addition, traditional solutions such as guard-banding and dual modular redundancy (DMR) are rendered impractical by stringent power constraints. Therefore, there is a need to develop low overhead approaches for protecting software from hardware errors.

However, it is challenging to ensure that all hardware errors are masked from the software without incurring excessively high power and performance overheads. In this talk, I will present the approach we have been taking in my group where we allow software to occasionally deviate from its correct behaviour under hardware errors, provided the deviation does not cause unsafe or egregious consequences. We call this the “good enough” approach, after an article that appeared in the Wired magazine (+). I will present two systems we have developed based on this approach, and the research challenges exposed by the good enough approach.

This is joint work with Jiesheng Wei (UBC), Song Liu (Northwestern), Thomas Moscibroda and Ben Zorn (MSR).

+ “The Good Enough Revolution: When cheap and simple is just fine”, Wired Magazine, Sep 2009.