Tag Archives: 2007

Position Paper – ToleRace: Tolerating and Detecting Races

Rahul Nagpal, Karthik Pattabiraman, Darko Kirovski and Benjamin Zorn, Second Workshop on Software Tools for Multi-core Systems (STMCS), 2007.
[ PDF File | Talk ]

This paper is super-ceded by the following conference paper

This paper introduces ToleRace, a software tool that increases the reliability of multi-threaded programs by tolerating or detecting race conditions. ToleRace modifies the implementation of critical sections at runtime to provide the following benefits. ToleRace allows programs with certain classes of races to operate as though the race did not exist. ToleRace probabilistically allows programmers to detect many of the remaining races when they happen, with low performance overhead. ToleRace achieves its ability to tolerate and detect races by judiciously duplicating shared data inside a critical section, thereby providing an illusion of atomicity when the shared data is updated. Our early experiments reveal that the performance overhead of ToleRace is considerably lower than existing dynamic race detection tools.

FPGA Hardware Implementation of Statically Derived Error Detectors

Peter Klemperer, Shelley Chen, Karthik Pattabiraman, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Workshop on Dependable and Secure Nanocomputing (WDSN), 2007.
[ PDF File | Talk ]

This paper is superceded by the following conference paper.

Abstract: Previous software-only error detection techniques have provided high-coverage, low-latency detection but suffer significant performance overheads with a large percentage of benign detections. This paper presents a FPGA hardware implementation of application-aware data error detectors. The detectors are automatically derived at compile time and executed in hardware at runtime, minimizing the performance overhead. We implement the static detectors using the Reliability and Security Engine, which provides a standard interface for developing reliability and security hardware modules. An initial, proof-of-concept model shows that there is only a 2% performance penalty when the detectors are implemented in hardware.

Critical Variable Recomputation for Transient Error Detection

Karthik Pattabiraman, Zbigniew Kalbarcyk and Ravishankar Iyer, Workshop on Silicon Errors in Logic – System Effects (SELSE), 2007.
[ PDF File | Talk ]

This paper is super-ceded by the following conference paper

Abstract: This paper presents a technique to derive and implement error detectors to protect an application from data errors. The error detectors are derived automatically using compiler-based static analysis from the backward program slice of critical variables in the program. Critical variables are defined as those that are highly sensitive to errors, and deriving error detectors for these variables provides high coverage for errors in any data value used in the program. The error detectors take the form of checking expressions and are optimized for each control flow path followed at runtime. The derived detectors are implemented using a combination of hardware and software.

Processor-Level Selective Replication

Nithin Nakka, Karthik Pattabiraman and Ravishankar Iyer, Proceedings of the International Conference on Dependable Systems and Networks (DSN), 2007.
[ PDF File | Talk ]

Abstract: Full duplication of an entire application (through spatial or temporal redundancy) would detect many errors that are benign to the application from the perspective of the end-user. It has also been seen that duplication has upto 30% performance overhead and needs significant introduction of hardware to synchronize the replicas. In order to overcome the drawbacks of performance overhead and detection of “benign” faults, we propose a processor-level technique called Selective Replication, which provides the application the capability to choose where in its application stream and to what degree it requires replication. Recent work on static analysis and fault-injection based experiments on applications reveals that certain variables in the application are critical to its crash- and hang-free execution. If it can be ensured that the computation of these variables is error-free, then a high degree of crash/hang coverage can be achieved at a low performance overhead to the application. The Selective Replication technique provides an ideal platform for validating this claim. The technique is compared against complete duplication as provided in current architectural level techniques. The results show that with about 59% less overhead than full duplication selective replication detects 97% of the data errors and 87% of the instruction errors that were covered by full duplication. It also reduces the detection of errors benign to the final outcome of the application by 17.8% as compared to full duplication.

Automated Derivation of Application-aware Error Detectors using Static Analysis

Karthik Pattabiraman, Zbigniew Kalbarczyk and Ravishankar Iyer, Proceedings of the IEEE International Online Testing Symposium (IOLTS), 2007. [ PDF File | Talk ]

Abstract: This paper presents a technique to derive and implement error detectors to protect an application from data errors. The error detectors are derived automatically using compiler-based static analysis from the backward program slice of critical variables in the program. Critical variables are defined as those that are highly sensitive to errors, and deriving error detectors for these variables provides high coverage for errors in any data value used in the program. The error detectors take the form of checking expressions and are optimized for each control flow path followed at runtime. The derived detectors are implemented using a combination of hardware and software.Experiments show that the derived detectors incur low performance overheads while achieving high detection coverage for errors that impact the application.

This paper is superceded by the following journal paper.