Majid Dadashi, Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan, To appear in the Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2014. (Accept Rate: 30%)[ PDF | Talk ]
Abstract: Intermittent hardware faults are hard to diagnose as they occur non-deterministically. Hardware-only diagnosis techniques incur significant power and area overheads. On the other hand, software-only diagnosis techniques have low power and area overheads, but have limited visibility into many micro-architectural structures and hence cannot diagnose faults in them.
To overcome these limitations, we propose a hardware- software integrated framework for diagnosing intermittent faults. The hardware part of our framework, called SCRIBE continuously records the resource usage information of every instruction in the processor, and exposes it to the software layer. SCRIBE incurs a performance overhead of 12% and power overhead of 9%, on average. The software part of our framework is called SIED and uses backtracking from the program’s crash dump to find the faulty micro-architectural resource. Our technique has an average accuracy of 84% in diagnosing the faulty resource, which in turn enables fine-grained deconfiguration with less than 2% performance loss after deconfiguration.