Abstract: Extreme CMOS technology scaling is causing significant concerns in the reliability of computer systems. Intermittent hardware errors are non-deterministic bursts of errors that occur in the same physical location. Recent studies have found that 40% of the processor failures in real-world machines are due to intermittent hardware errors. A study of the effects of intermittent faults on programs is a critical step in building fault-tolerance techniques of reasonable accuracy and cost. In this work, we characterize the impact of intermittent hardware faults in programs using fault-injection campaigns in a microarchitectural processor simulator. We find that 80% of the non-benign intermittent hardware errors activate a hardware trap in the processor, and the remaining 20% cause Silent Data Corruptions (SDCs). We have also investigated the possibility of using the program state at failure time in software-based diagnosis techniques, and found that much of the erroneous data is intact and can be used to identify the source of the error.
Phone: 1-604-827-4245 (Please do NOT leave Voicemail !)
Address: Room 4048, Fred Kaiser Building, 2332 Main Mall, Vancouver, BC V6T1Z4.