Tag Archives: 2005

Modeling Coordinated Checkpointing for Large-Scale Supercomputers

Long Wang, Karthik Pattabiraman, Lawrence Votta, Christopher Vick, Alan Wood, Zbigniew Kalbarczyk and Ravishankar Iyer, Proceedings of the International Conference on Dependable Systems and Networks (DSN), 2005.
[ PDF File | Talk ]

Abstract: Current supercomputing systems consisting of thousands of nodes cannot meet the demands of emerging high-performance scientific applications. As a result, a new generation of supercomputing systems consisting of hundreds of thousands of nodes is being proposed. However, these systems are likely to experience far more frequent failures than today’s systems, and such failures must be tackled effectively. Coordinated checkpointing is a common technique to deal with failures in supercomputers. This paper presents a model of a coordinated checkpointing protocol for large-scale supercomputers, and studies its scalability by considering both the coordination overhead and the effect of failures. Unlike most of the existing checkpointing models, the proposed model takes into account failures during checkpointing and recovery, as well as correlated failures. Stochastic Activity Networks (SANs) are used to model the system, and the model is simulated to study the scalability, reliability, and performance of the system.

Application-based Metrics for Strategic Placement of Detectors

Karthik Pattabiraman, Zbigniew Kalbarczyk and Ravishankar K. Iyer, Proceedings of the International Symposium on Pacific-Rim Dependable Computing (PRDC), 2005.
[ PDF File | Talk ]

Abstract: The goal of this study is to provide low-latency detection and prevent error propagation due to value errors. This paper introduces metrics to guide the strategic placement of detectors and evaluates (using fault injection) the coverage provided by ideal detectors embedded at program locations selected using the computed metrics. The computation is represented in the form of a Dynamic Dependence Graph (DDG), a directed-acyclic graph that captures the dynamic dependencies among the values produced during the course of program execution. The DDG is employed to model error propagation in the program and to derive metrics (e.g., value fanout or lifetime) for detector placement. The coverage of the detectors placed is evaluated using fault injections in real programs, including two large SPEC95 integer benchmarks (gcc and perl). Results show that a small number of detectors, strategically placed, can achieve a high degree of detection coverage.