Formal Diagnosis of Hardware Transient Errors in Programs

Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan, Workshop on Silicon Errors in Logic, System Effects (SELSE), 2010. [ PDF File ][ Talk Slides ]

As silicon technology continues to scale down and validation expenses continue to increase, more processors with vulnerable parts are shipped to customers. Comprehensive information about architectural units with relatively high failure rates is a critical aspect of the feedback to thread scheduling algorithms and to fault detection and recovery mechanisms.

We present a technique to identify instructions that cause program failure by utilizing the failure symptoms such as program crash and failing detectors. Our technique employs formal verification and is unique in that it does not require hardware support or special instrumentation of the code. We find that careful engineering of the program’s error detectors and their locations highly increase the chances of diagnosing soft errors. We further show that we can diagnose up to 80% of faults using 1-4 fault-detectors for two applications.