Guanpeng Li, Karthik Pattabiraman, Siva Kumar Sastry Hari, Michael Sullivan, and Timothy Tsai. To appear in the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2018. (Acceptance Rate for Regular Papers: 25%) [PDF (coming soon) | Talk]
Abstract: As technology scales to lower feature sizes, devices become more susceptible to soft errors. Soft errors can lead to silent data corruptions (SDCs), seriously compromising the reliability of a system. Traditional hardware-only techniques to avoid SDCs are energy hungry, and hence not suitable for commodity systems. Researchers have proposed selective software-based protection techniques to tolerate hardware faults at lower costs. However, these techniques either use expensive fault injections or inaccurate analytical models to determine which parts of a program must be protected for preventing SDCs.
In this work, we construct a three-level model, TRIDENT, that captures error propagation at the static data dependency, control-flow and memory levels, based on empirical observations of error propagations in programs. TRIDENT is implemented as part of a compiler, and can predict both the overall SDC probability of a given program, and the SDC probabilities of individual instructions of programs, without fault injections. We find that TRIDENT is nearly as accurate as fault injection, but is much faster and more scalable. We also demonstrate the use of TRIDENT to guide selective instruction duplication to efficiently mitigate SDCs under a given performance overhead bound.