Error Resilient Software

Project Publications and Talks Summary Collaborators Funding


Summary

As future computer systems push the envelope in performance with stringent power constraints, reliability is often compromised. In particular, future systems are more vulnerable to transient and intermittent hardware faults, caused by particle strikes, temperature and manufacturing variations. To deal with these faults, computer systems have traditionally employed techniques such as hardware duplication. However, such techniques are becoming increasingly difficult due to energy costs. In contrast, our research focuses on software techniques to protect computer systems from hardware faults. This is because many studies have found that errors get progressively masked as we go up the system stack, with only a fraction of the errors that occur at the circuit level having an impact on the application (see figure below). Therefore, it is more economical to deploy protection techniques at the application level, as one can precisely target the impactful errors.

Propagation of errors in the system stack

Our work in this area consists of three directions. First, we have developed automated compiler-based techniques to identify critical data in a program that must be protected in order to prevent erroneous outcomes. In the case of soft-computing applications, or those applications that tolerate some deviations in their output (e.g., multimedia applications), these outcomes are what we call Egregious Data Corruptions (EDCs) [DSN’13]. In the case of general purpose applications, these outcomes are Silent Data Corruptions (SDCs) [CASES’14]. We have also explored the use of data similarity in parallel programs to perform efficient detection for errors in the program’s control data [DSN’12].

The second thrust we have been exploring is the use of software and hardware techniques for performing diagnosis of intermittent faults. Intermittent faults are recurring, non-deterministic faults that occur due to temperature effects or manufacturing variations. They are challenging to diagnose due to their non-determinism; yet, it is important to diagnose them to enable fine-grained hardware reconfigurability of the processor [QEST’12]. We have analyzed the propagation of intermittent faults in programs using fault injection studies [IEEE TR]. Based on these studies, we have proposed a hardware-software co-designed technique for diagnosing intermittent faults in processors [DSN’14a].

The final thrust we have explored in this area is in building robust fault injectors, to evaluate software error resilience techniques. In this regard, we have built three fault injectors. The first one is called LLFI, and is based on the LLVM compiler. LLFI is highly configurable, allows easy mapping from the source code to the fault’s propagation, and is accurate compared to assembly level injection [DSN’14b] [SELSE’13]. The second fault injector is called GPU-Qin, and allows efficient injection of faults on real GPU hardware. The main advantage of GPU-Qin is that it allows one to study the end-to-end behaviour of GPGPU applications [ISPASS’14]. Finally, we have built PINFI to inject faults into program binaries using the PIN tool from Intel. All three fault injectors have been freely released to the public and are available here.

Back to top


Collaborators

Industry: Sudhanva Gurumurthi (AMD), Pradip Bose, Chen-Yong Cher, Meeta Gupta and Jude Rivers (IBM)

Colleagues: Sathish Gopalakrishnan and Matei Ripeanu

Students: Bo Fang, Guanpeng Li

Alumni: Layali Rashid, Jiesheng Wei, Anna Thomas, Majid Dadashi, Qining Lu, Nithya Narayanamurthi,

Funding:

NSERC, MITACS, AMD, and Lockheed Martin

Back to top