Tag Archives: workshop

BlockWatch: Leveraging Similarity in Parallel Programs for Error Detection

Jiesheng Wei and Karthik Pattabiraman, Proceedings of the IEEE Workshop on Silicon Errors in Logic, System Effects (SELSE), 2012. [ PDF File | Talk ]
Continue reading

DIEBA: Diagnosing Intermittent Errors By BackTracing Application Failures

Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan, Proceedings of the IEEE Workshop on Silicon Errors in Logic, System Effects (SELSE), 2012. [ PDF File | Talk ]
Continue reading

Comparing the Effects of Transient and Intermittent Faults on Programs

Jiesheng Wei, Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan, Workshop on Dependable and Secure Nano-Systems (WDSN), 2011. [ PDF File | Talk ]
Continue reading

Formal Diagnosis of Hardware Transient Errors in Programs

Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan, Workshop on Silicon Errors in Logic, System Effects (SELSE), 2010. [ PDF File ][ Talk Slides ]
Continue reading

Towards Understanding the Effects of Intermittent Hardware Faults on Programs

Layali Rashid, Karthik Pattabiraman and Sathish Gopalakrishnan, Proceedings of the IEEE International Workshop on Dependable and Secure Nano-computing (WDSN), 2010. [ PDF File | Talk ]
Continue reading

Automated Derivation and Hardware Implementation of Application-Specific Error Detectors

Karthik Pattabiraman, Giacinto Paolo Saggese, Daniel Chen, Zbigniew Kalbarczyk and Ravishankar Iyer, Workshop on Reliability Issues in High-Performance Computing (HPCRI), 2006.
[ PDF File | Talk ]

Super-ceded by the following conference paper.

Abstract: This paper proposes a novel technique for automated derivation of fine-grained, application-specific error detectors. An algorithm based on dynamic traces of application execution is developed for extracting the optimal set of error detectors for a target application. An automatic framework is proposed for synthesizing the derived detectors in hardware and enabling low-overhead run-time checking of the application execution. Coverage (evaluated using fault injection) of the error detectors obtained using the proposed methodology, the additional hardware resources, and performance overhead for several benchmark programs are also reported.

Position Paper – ToleRace: Tolerating and Detecting Races

Rahul Nagpal, Karthik Pattabiraman, Darko Kirovski and Benjamin Zorn, Second Workshop on Software Tools for Multi-core Systems (STMCS), 2007.
[ PDF File | Talk ]

This paper is super-ceded by the following conference paper

This paper introduces ToleRace, a software tool that increases the reliability of multi-threaded programs by tolerating or detecting race conditions. ToleRace modifies the implementation of critical sections at runtime to provide the following benefits. ToleRace allows programs with certain classes of races to operate as though the race did not exist. ToleRace probabilistically allows programmers to detect many of the remaining races when they happen, with low performance overhead. ToleRace achieves its ability to tolerate and detect races by judiciously duplicating shared data inside a critical section, thereby providing an illusion of atomicity when the shared data is updated. Our early experiments reveal that the performance overhead of ToleRace is considerably lower than existing dynamic race detection tools.

Processor-level Selective Replication

Nithin Nakka, Karthik Pattabiraman, Zbigniew Kalbarczyk and Ravishankar Iyer, Workshop on Silicon Errors in Logic- System Effects (SELSE), 2006.
[ PDF File | Talk ]

This paper is superceded by the following conference paper.

Abstract: Even though replication has been widely used in providing fault tolerance, the underlying hardware is unaware of the application executing on it. The application cannot choose to use redundancy for a specific code section and run in a normal, unreplicated mode for the rest of the code. In this paper we propose Processor-level Selective Replication, a mechanism to dynamically configure the degree of instruction-level replication according to the applications demands. The application can choose to replicate only code sections that are critical to its crash-free execution. This decreases the impact on the performance. It is also known that many of the processor-level faults do not lead to failures observable in the application outcome. So, selective replication also decreases the number of false positives.

FPGA Hardware Implementation of Statically Derived Error Detectors

Peter Klemperer, Shelley Chen, Karthik Pattabiraman, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Workshop on Dependable and Secure Nanocomputing (WDSN), 2007.
[ PDF File | Talk ]

This paper is superceded by the following conference paper.

Abstract: Previous software-only error detection techniques have provided high-coverage, low-latency detection but suffer significant performance overheads with a large percentage of benign detections. This paper presents a FPGA hardware implementation of application-aware data error detectors. The detectors are automatically derived at compile time and executed in hardware at runtime, minimizing the performance overhead. We implement the static detectors using the Reliability and Security Engine, which provides a standard interface for developing reliability and security hardware modules. An initial, proof-of-concept model shows that there is only a 2% performance penalty when the detectors are implemented in hardware.

Critical Variable Recomputation for Transient Error Detection

Karthik Pattabiraman, Zbigniew Kalbarcyk and Ravishankar Iyer, Workshop on Silicon Errors in Logic – System Effects (SELSE), 2007.
[ PDF File | Talk ]

This paper is super-ceded by the following conference paper

Abstract: This paper presents a technique to derive and implement error detectors to protect an application from data errors. The error detectors are derived automatically using compiler-based static analysis from the backward program slice of critical variables in the program. Critical variables are defined as those that are highly sensitive to errors, and deriving error detectors for these variables provides high coverage for errors in any data value used in the program. The error detectors take the form of checking expressions and are optimized for each control flow path followed at runtime. The derived detectors are implemented using a combination of hardware and software.