Intermittent Hardware Errors Recovery: Modeling and Evaluation

Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan, To appear in the Proceedings of the International Conference on the Quantitative Evaluation of Systems (QEST), 2012.
[ PDF file | Talk ]

The frequency of hardware errors is increasing due to shrinking feature sizes, higher levels of integration, and increasing design complexity. Intermittent errors are those that occur non-deterministically at the same location. It has been shown that intermittent hardware errors contribute to about 39% of the total hardware failures. Recovery from intermittent hardware errors has been a challenge since these errors have characteristics that are different than transient and permanent errors.

In this paper, we evaluate the impact of different intermittent error recovery scenarios on processor performance. To achieve this, we model a system that consists of (1) a model of a fault-tolerant processor, (2) a few models of intermittent hardware faults. Due to the lack of information about intermittent faults’ exact characteristics, our fault models are based on insights from related work at the physical level. We find that the frequency of the intermittent error and the relative importance of the error location play an important role in choosing the recovery action that maximizes the processor’s performance.