DIEBA: Diagnosing Intermittent Errors By BackTracing Application Failures

Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan, Proceedings of the IEEE Workshop on Silicon Errors in Logic, System Effects (SELSE), 2012. [ PDF File | Talk ]

Intermittent hardware faults have emerged as a leading cause of system failures in the real world. Unlike transient faults, intermittent faults recur at the same location, and need to be diagnosed in order to mitigate their effects. However, unlike permanent faults, intermittent faults are non-deterministic, which makes them challenging to diagnose through traditional methods. We propose a software-based technique to Diagnose Intermittent hardware Errors in microprocessors by Backtracing Application state at the time of a failure (DIEBA). We focus on faults that occur in the micro-architectural units in a core, and either result in program failure or have been detected by software/hardware detectors. We have evaluated DIEBA through fault-injection experiments, and show that it can successfully diagnose 70% of the errors that result in failures or detections.