Tag Archives: reliability

Processor-level Selective Replication

Nithin Nakka, Karthik Pattabiraman, Zbigniew Kalbarczyk and Ravishankar Iyer, Workshop on Silicon Errors in Logic- System Effects (SELSE), 2006.
[ PDF File | Talk ]

This paper is superceded by the following conference paper.

Abstract: Even though replication has been widely used in providing fault tolerance, the underlying hardware is unaware of the application executing on it. The application cannot choose to use redundancy for a specific code section and run in a normal, unreplicated mode for the rest of the code. In this paper we propose Processor-level Selective Replication, a mechanism to dynamically configure the degree of instruction-level replication according to the applications demands. The application can choose to replicate only code sections that are critical to its crash-free execution. This decreases the impact on the performance. It is also known that many of the processor-level faults do not lead to failures observable in the application outcome. So, selective replication also decreases the number of false positives.

FPGA Hardware Implementation of Statically Derived Error Detectors

Peter Klemperer, Shelley Chen, Karthik Pattabiraman, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Workshop on Dependable and Secure Nanocomputing (WDSN), 2007.
[ PDF File | Talk ]

This paper is superceded by the following conference paper.

Abstract: Previous software-only error detection techniques have provided high-coverage, low-latency detection but suffer significant performance overheads with a large percentage of benign detections. This paper presents a FPGA hardware implementation of application-aware data error detectors. The detectors are automatically derived at compile time and executed in hardware at runtime, minimizing the performance overhead. We implement the static detectors using the Reliability and Security Engine, which provides a standard interface for developing reliability and security hardware modules. An initial, proof-of-concept model shows that there is only a 2% performance penalty when the detectors are implemented in hardware.

Critical Variable Recomputation for Transient Error Detection

Karthik Pattabiraman, Zbigniew Kalbarcyk and Ravishankar Iyer, Workshop on Silicon Errors in Logic – System Effects (SELSE), 2007.
[ PDF File | Talk ]

This paper is super-ceded by the following conference paper

Abstract: This paper presents a technique to derive and implement error detectors to protect an application from data errors. The error detectors are derived automatically using compiler-based static analysis from the backward program slice of critical variables in the program. Critical variables are defined as those that are highly sensitive to errors, and deriving error detectors for these variables provides high coverage for errors in any data value used in the program. The error detectors take the form of checking expressions and are optimized for each control flow path followed at runtime. The derived detectors are implemented using a combination of hardware and software.

Modeling Coordinated Checkpointing for Large-Scale Supercomputers

Long Wang, Karthik Pattabiraman, Lawrence Votta, Christopher Vick, Alan Wood, Zbigniew Kalbarczyk and Ravishankar Iyer, Proceedings of the International Conference on Dependable Systems and Networks (DSN), 2005.
[ PDF File | Talk ]

Abstract: Current supercomputing systems consisting of thousands of nodes cannot meet the demands of emerging high-performance scientific applications. As a result, a new generation of supercomputing systems consisting of hundreds of thousands of nodes is being proposed. However, these systems are likely to experience far more frequent failures than today’s systems, and such failures must be tackled effectively. Coordinated checkpointing is a common technique to deal with failures in supercomputers. This paper presents a model of a coordinated checkpointing protocol for large-scale supercomputers, and studies its scalability by considering both the coordination overhead and the effect of failures. Unlike most of the existing checkpointing models, the proposed model takes into account failures during checkpointing and recovery, as well as correlated failures. Stochastic Activity Networks (SANs) are used to model the system, and the model is simulated to study the scalability, reliability, and performance of the system.

Toward Application-aware Security and Reliability

Ravishankar Iyer, Zbigniew Kalbarczyk, Karthik Pattabiraman, William Healey, Wen-Mei Hwu, Peter Klemperer and Reza Farivar, IEEE Security and Privacy Magazine, January 2007 (Invited). [ PDF File ]

No abstract is available.

Here is a news article in the Chicago Tribune that describes this work.

Samurai: Protecting Critical Data in Unsafe Languages

Karthik Pattabiraman, Vinod Grover and Benjamin G. Zorn, Proceedings of the European Conference on Computer Systems (EuroSys), 2008.
[ PDF File | Talk ]
Continue reading

SymPLFIED: Symbolic Program-Level Fault Injection and Error Detection Framework

Karthik Pattabiraman, Nithin Nakka, Zbigniew Kalbarczyk and Ravishankar Iyer, Proceedings of the International Conference on Dependable Systems and Networks (DSN), 2008.
This paper won the William C. Carter award for the best paper at the conference
[ PDF File | Talk ]
You can find the tech report for the conference paper here.


Abstract:
This paper introduces SymPLFIED, a program-level framework which allows specification of arbitrary error detectors and the verification of their efficacy against hardware errors. SymPLFIED comprehensively enumerates all transient hardware errors in registers, memory and computation (expressed as value errors) that potentially evade detection and cause program failure. The framework uses symbolic execution to abstract the state of erroneous values in the program and model checking to comprehensively find all errors that evade detection. We demonstrate the use of SymPLFIED on a widely deployed aircraft collision avoidance application, tcas. Our results show that the SymPLFIED framework can be used to uncover hard-to-detect corner cases caused by transient errors in programs that may not be exposed by random fault-injection based validation.

The Coordinated Science Lab at UIUC did an article about this paper

Processor-Level Selective Replication

Nithin Nakka, Karthik Pattabiraman and Ravishankar Iyer, Proceedings of the International Conference on Dependable Systems and Networks (DSN), 2007.
[ PDF File | Talk ]

Abstract: Full duplication of an entire application (through spatial or temporal redundancy) would detect many errors that are benign to the application from the perspective of the end-user. It has also been seen that duplication has upto 30% performance overhead and needs significant introduction of hardware to synchronize the replicas. In order to overcome the drawbacks of performance overhead and detection of “benign” faults, we propose a processor-level technique called Selective Replication, which provides the application the capability to choose where in its application stream and to what degree it requires replication. Recent work on static analysis and fault-injection based experiments on applications reveals that certain variables in the application are critical to its crash- and hang-free execution. If it can be ensured that the computation of these variables is error-free, then a high degree of crash/hang coverage can be achieved at a low performance overhead to the application. The Selective Replication technique provides an ideal platform for validating this claim. The technique is compared against complete duplication as provided in current architectural level techniques. The results show that with about 59% less overhead than full duplication selective replication detects 97% of the data errors and 87% of the instruction errors that were covered by full duplication. It also reduces the detection of errors benign to the final outcome of the application by 17.8% as compared to full duplication.

Application-based Metrics for Strategic Placement of Detectors

Karthik Pattabiraman, Zbigniew Kalbarczyk and Ravishankar K. Iyer, Proceedings of the International Symposium on Pacific-Rim Dependable Computing (PRDC), 2005.
[ PDF File | Talk ]

Abstract: The goal of this study is to provide low-latency detection and prevent error propagation due to value errors. This paper introduces metrics to guide the strategic placement of detectors and evaluates (using fault injection) the coverage provided by ideal detectors embedded at program locations selected using the computed metrics. The computation is represented in the form of a Dynamic Dependence Graph (DDG), a directed-acyclic graph that captures the dynamic dependencies among the values produced during the course of program execution. The DDG is employed to model error propagation in the program and to derive metrics (e.g., value fanout or lifetime) for detector placement. The coverage of the detectors placed is evaluated using fault injections in real programs, including two large SPEC95 integer benchmarks (gcc and perl). Results show that a small number of detectors, strategically placed, can achieve a high degree of detection coverage.

Dynamic Derivation of Application-specific Error Detectors and their Hardware Implementation

Karthik Pattabiraman, Giacinto Paulo Saggese, Daniel Chen, Zbigniew Kalbarczyk and Ravishankar Iyer, Proceedings of the European Dependable Computing Conference (EDCC), 2006. [ PDF File | Talk ]

Abstract: This paper proposes a novel technique for preventing a wide range of data errors from corrupting the execution of applications. The proposed technique enables automated derivation of fine-grained, application-specific error detectors. An algorithm based on dynamic traces of application execution is developed for extracting the set of error detector classes, parameters, and locations in order to maximize the error detection coverage for a target application. The paper also presents an automatic framework for synthesizing the set of detectors in hardware to enable low-overhead run-time checking of the application execution. Coverage (evaluated using fault injection) of the error detectors derived using the proposed methodology, the additional hardware resources needed, and performance overhead for several benchmark programs are also reported.

This paper is super-ceded by the following journal paper.