EECE 513: Fault-tolerant Digital Systems

This is a graduate class that I offered in Fall 2010. You can find the latest offering here.

Logistics Synopsis Assessment Topics Textbooks Readings Assignments

Logistics:

Class: Tue/Thu 10:30 AM to 12:00 PM at MCLD 207
Office hours: Wednesday: 3 to 4 PM in 4048 KAIS

Course Synopsis

Our daily lives are becoming increasingly dependent on computer systems, from small, embedded computers to large-scale data centers. Any disruption in or malfunctioning of these systems can lead to devastating consequences for society as a whole. The reliability and availability of these systems is thus essential for our quality of life and for the smooth functioning of society. Therefore, it is important to build computer systems that operate correctly in the face of errors and failures.

This course focuses on the design of fault-tolerant and reliable computer systems. In particular, we will attempt to understand the root causes of faults in computer systems and their impact. We will study both traditional and cutting-edge techniques to provide fault-tolerance and error resilience. Finally, we will explore the practical applications of the techniques in the context of real systems.

An important thread that runs through the course is the evaluation of fault-tolerant systems. To this end, we will study techniques ranging from analytical modeling to empirical validation. The assignments will give you hands-on exposure to cutting edge tools and techniques for dependability evaluation, and will prepare you for the final project. You are encouraged (but not required) to work on a project related to your research interests. The final project constitutes a significant part of the grade.

Back to top

Assessment

Weightage Component Comments
10% Assignment 1 Due in early October
10% Assignment 2 Due in early November
10% Assignment 3 Due in late November
50% Final Project Project proposal + final report + presentation
10% Paper reviews Roughly 2-3 papers every other week
10% Class participation

Back to top

Learning Objectives

The goal of this course is to give you a sound footing in dependability techniques and their evaluation. At the end of the course, you should be able to (1) design highly dependable systems and rigorously justify the design trade-offs you make and, (2) understand how to evaluate the dependability of real-world systems using state-of-the-art tools and techniques.

Back to top

Topics Covered


This is a tentative list of topics to be covered in the course. Note that these topics are subject to change.

Some slides are based on Prof. Saurabh Bagchi’s slides for “Fault Tolerant Computer System Design” (ECE 695B) at Purdue University. Used with permission.

Topic Lectures Sub-topics
Introduction and Overview 3 Introduction to the course, Basic concepts, Sources of faults in computer systems
Modeling and Evaluation -1

2 Probability review and discrete probability, Continuous probability and TMR
Hardware fault-tolerance 2 Architectural techniques
Modeling and Evaluation -2 2 Markov processes, Stochastic Activity Networks
Software fault-tolerance 3 N-version programming, recovery blocks, robust data structures and process pairs
Modeling and Evaluation – 3 2 Fault-injection: techniques and tools, Formal methods
Parallel and Distributed systems 4 Check-pointing and recovery, Byzantine fault-tolerance and paxos
Case Studies 2 Stratus and AT&T systems

Back to top

Textbooks

There is NO required textbook. However, the following books are recommended:

  1. D. P. Siewiorek and R. S. Swarz, Reliable Computer Systems – Design and Evaluation, 3rd edition, 1999, A.K. Peters, Limited.
  2. K. Trivedi, Probability and Statistics with Reliability, Queuing and Computer Science Applications, 2nd edition, 2001, John Wiley & Sons.

Back to top

Paper Readings


NOTE: Here is the way we will run the paper reading classes (starting on Sep 23).

  1. Each paper-reading class will discuss approximately two papers. Each class will have a discussion leader.
  2. By noon on the previous day of the discussion session, each of you should email me your reviews as PDF files.
  3. The discussion leader will summarize the papers as well as the reviews during the class. This counts for class participation marks.
Topic Date Papers assigned Leader
Logic-level techniques Sep 23 1. Robust System Design with Built-in Soft-Error Resilience (Ignore the sidebars),
2. Time Redundancy Based Soft-Error Tolerance to Rescue Nanometer Technologies
Alex Brant (Slides)
Architectural techniques Sep 30 1. A Fault-tolerant approach to micro-processor design .
2. Argus: Low-Cost, Comprehensive Error Detection in Simple Cores
Joydip Das (Slides)
O.S.-level techniques Oct 14 1. Recovering Device Drivers
2. Failure resilience for device drivers
Taiseer Sadiq ( Slides )
Application-level techniques Oct 21 1. Whither Generic Recovery from Application Faults: A Fault-study using open-source software
2. Rx: Treating Bugs as Allergies — a safe method to survive software failures
Frolin Ocariza ( Slides )
Checkpointing and recovery Nov 18 1. Checkpointing for peta-scale systems: A look into the future of rollback-recovery
2. Modeling Coordinated Checkpointing for Large-Scale Systems
Jiesheng Wei (Slides)
Agreement Protocols Nov 25 1. Practical Byzantine Fault-tolerance
2. Paxos made live
Jeff Goeders (Slides)

Back to top

Assignments

Number Released Due date Tools
Assignment 1 Sep 21 Oct 7 None
Assignment 2 Oct 12 Nov 3 Mobius tool (password protected)
Mobius Manual
Assignment 3 Nov 16 Nov 30 SymPLFIED tool (password protected)
Tcas application
(You may want to refer to the following website for SimpleScalar installation instructions – thanks, Frolin)

NOTE: We have obtained an academic license for Mobius, which means that you are free to use Mobius for the purposes of this class. However, if you need to use it outside of this class, then please contact the developers of Mobius for a license.