Guanpeng Li, Qining Lu and Karthik Pattabiraman, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2015. (Acceptance rate: 22%). [ PDF | Talk ] (Code)
Abstract: As the rate of transient hardware faults increases, researchers have investigated software techniques to tolerate these faults. An important class of faults are those that cause long-latency crashes (LLCs), or faults that can persist for a long time in the program before causing it to crash. In this paper, we develop a technique to automatically find program locations where LLC causing faults originate so that the locations can be protected to bound the program’s crash latency.
We first identify program code patterns that are responsible for the majority of LLC causing faults through an empirical study. We then build CRASHFINDER, a tool that finds LLC locations by statically searching the program for the patterns, and then refining the static analysis results with a dynamic analysis and selective fault injection-based approach. We find that CRASHFINDER can achieve an average of 9.29 orders of magnitude time reduction to identify more than 90% of LLC causing locations in the program, compared to exhaustive fault injection techniques, and has no false-positives.