Guanpeng Li and Karthik Pattabiraman, To appear in the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2018. (Acceptance Rate for Regular Papers: 25%) [PDF (coming soon) | Talk]
Abstract: Transient hardware faults are increasing in computer systems due to shrinking feature sizes. Traditional methods to mitigate such faults are through hardware duplication, which incurs huge overhead in performance and energy consumption. Therefore, researchers have explored software solutions such as selective instruction duplication, which require fine-grained analysis of instruction vulnerabilities to Silent Data Corruptions (SDCs). These are typically evaluated via Fault Injection (FI), which is often highly time-consuming. Hence, most studies confine their evaluations to a single input for each program. However, there is often significant variation in the SDC probabilities of both the overall program and individual instructions across inputs, which compromises the correctness of results with a single input.
In this work, we study the variation of SDC probabilities across different inputs of a program, and identify the reasons for the variations. Based on the observations, we propose a model, VTRIDENT, which predicts the variations in programs’ SDC probabilities without any FIs, for a given set of inputs. We find that VTRIDENT is nearly as accurate as FI in identifying the variations in SDC probabilities across inputs. We demonstrate the use of VTRIDENT to bound overall SDC probability of a program under multiple inputs, while performing FI on only a single input.