EECE 513: Design of Fault-tolerant Systems

Logistics Objectives Synopsis Assessment Topics Textbooks Readings Announcements

Logistics:


Class: Tue/Thu 10:30 AM to 12:00 PM at MCLD 207
Office hours: TBD in 4048 KAIS
Back to top

Learning Objectives

  1. Define common terms such as availability, reliability, dependability etc.
  2. List common threats to dependability and their mitigation methods
  3. Solve reliability block diagrams involving series, parallel and networks of components. Apply the laws of discrete probability to evaluating systems.
  4. Evaluate simple redundancy schemes through the laws of continuous probability, provided the failures are exponentially distributed.
  5. Apply fault-tolerance techniques such as error correcting circuits and duplicate execution to the design of hardware systems.
  6. Model systems using Markov models and Stochastic Activity Networks (SAN)
  7. Apply fault-tolerance techniques such as N-version programming, robust data structures etc. to the design of software systems
  8. Evaluate the reliability of systems through fault-injections and simulations
  9. Apply fault-tolerance such as checkpointing and byzantine agreement to the design of parallel and distributed systems
  10. Critique the design of real-world fault-tolerant systems such as Tandem, ESS

Back to top

Course Synopsis


This course focuses on the design of fault-tolerant and reliable computer systems. In particular, we will attempt to understand the root causes of faults in computer systems and their impact. We will study both traditional and cutting-edge techniques to provide fault-tolerance and error resilience. Finally, we will explore the practical applications of the techniques in the context of real systems.

An important thread that runs through the course is the evaluation of fault-tolerant systems. To this end, we will study techniques ranging from analytical modeling to empirical validation. The assignments will give you hands-on exposure to cutting edge tools and techniques for dependability evaluation, and will prepare you for the final project. You are encouraged (but not required) to work on a project related to your research interests. The final project constitutes a significant part of the grade.
Back to top

Assessment

Weightage Component Comments
40% Project Four milestones: proposal (5 %), midterm report (10%), final presentation (10%) and final report (15%)
30% Assignments Three assignments each comprising 10% of the grade
20% Paper reviews and discussion leading Approximately two papers each in six sessions (15%), and leading discussion (5%)
10% Class participation In both lectures and discussions

Back to top

Topics Covered

Topic Dates Readings/homework
Introduction Sep 6 Fill in this form about yourself
Dependability terms and definitions Sep 8 Basic concepts and taxonomy of dependable and secure computing
Sources of faults in computer systems Sep 13
Redundancy and quantitative evaluation Sep 15 and 22 Review of basic probability, Trivedi chapters 1, 2, 3
State based Models Sep 27, and 29, Oct 4, 11 and 13 Trivedi chapters 6, 7 and 8. Mobius manual
Software Fault Tolerance Oct 13, Oct 18, Oct 23 1. An experimental evaluation of the assumption of independence in N-version programming, J. Knight and N. Leveson, IEEE Transactions on Software Engineering (TSE), 1986.
2. Using simplicity to control complexity, L. Sha, IEEE Computer, 2000.
3. Samurai: Protecting Critical Heap Data in Type-Unsafe Languages, K. Pattabiraman et al., Eurosys 2008.
Fault Injection Nov 1 and 3 1. Fault injection techniques and tools, Hsueh et al.,
2. Emulation of Software Faults: A Field Data Study and Practical Approach, Duraes et al.,
3. NFTAPE: A Framework for Assessing Dependability in Distributed Systems with Lightweight Fault Injectors, Stott et al.,
4. SymPLFIED: Symbolic Program Level Fault Injection and Error Detection Framework, Pattabiraman et al.
Guest lecture by Charng-da Lu Nov 8 1. Assessing fault sensitivity in MPI Applications, C. Lu et al., Supercomputing 2005.
2. Reliability challenges in large systems, C. Lu et al., Future generation computer systems, 2006.
Distributed Systems Nov 15 1. The Byzantine generals problem, L. Lamport, ACM TOPLAS 1982.
2. The part-time parliament, L. Lamport, ACM TOCS 1998.
3. Paxos made simple, L. Lamport, ACM Distributed Computing News, 2001.
Architectural fault-tolerance Nov 22

Back to top

Textbooks

There is NO required textbook. However, the following books are recommended:

  1. D. P. Siewiorek and R. S. Swarz, Reliable Computer Systems – Design and Evaluation, 3rd edition, 1999, A.K. Peters, Limited.
  2. K. Trivedi, Probability and Statistics with Reliability, Queuing and Computer Science Applications, 2nd edition, 2001, John Wiley & Sons.

Back to top

Paper Readings

Date Topic Papers Leaders
Sep 22 Fault Sources 1. Why do Internet Services Fail and what can be done about it?, Oppenheimer et al. Usenix ISTS, 2003. (reviews)
2. A large-scale study of failures in high-performance computing systems, Schroeder et al., DSN 2006. (reviews)
1. Hootan Rashtian,
2. Farid Tabrizi
Oct 6th Hardware fault tolerance 1. The StageNet Fabric for Constructing Resilient Multi-core Systems, S. Gupta et al., Micro 2008. (reviews)
2. Configurable isolation: Building High Availability Systems with Commodity Multi-core processors, N. aggarwal et al, ISCA 2007. (reviews)
1. Bo Fang, 2. Xiping Hu
Oct 20th Software fault tolerance – 1 1. DieHard: probabilistic memory safety for unsafe languages, E. Berger et al., PLDI 2006. (reviews)
2. Rx Treating Bugs As Allergies— A Safe Method to Survive Software Failures, F. Qin et al, SOSP 2005. (reviews)
1. Majid Dadashi, 2. Jin Hu
Oct 27th Software fault tolerance – 2 1. Data Structure Repair Using Goal-Directed Reasoning, Demsky et al, ICSE 2005. (reviews)
2. Automatically Finding Patches Using Genetic Programming, W. Weimer et al, ICSE 2010. (reviews)
1. Anna Thomas, 2. Jacques Clapach
Nov 10th Parallel Systems 1. Checkpointing for Peta-Scale Systems: A look into the future of Rollback Recovery, Planck et al., TDSC 2004. (reviews)
2. Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources, Z. Chen et al., IPDPS 2006 (reviews)
1. Hao Yang, 2. Kan Chen
Nov 17th Distributed Systems 1. Paxos made live: An engineering perspective, T. Chandra et al., PODC 2007. (reviews)
2. Upright Cluster Services , A. Clement et al., SOSP 2007. (reviews)
1. Prabhjot Singh, 2. Richard Chen
Nov 24th Case Studies 1. Commercial Fault Tolerance: a tale of two systems, W. Bartlett et al., TDSC 2004. (reviews)
2. Chamaleon: a software infrastructure for adaptive fault tolerance , Z. Kalbarczyk et al., TPDS 1999. (reviews)
1. Mina Raymond, 2. Karthik Pattabiraman

Back to top

Announcements


This is the place for course announcements. Please check for updates.

  • Sep 5th: Our first class will be on Sep 6 at 10:30 AM. See you there.
  • Sep 8th: Please read chapters 1-2 of the Trivedi book (see textbooks) for next Thursday’s class. Chapters 3 and 4 are optional reading.
  • Sep 13th Project proposal documents (2 pages) are due Sep 29th. Please set up a time to talk with me about your project on or before Sep 22nd, or come to my office hours.
  • Sep 14th : The papers to be discussed on Sep 22nd have been posted. Instructions on doing the reviews and leading the discussions are posted. Failure to follow the instructions can result in you losing points.
  • Sep 17th : Examples of projects are posted here – these are only examples. Please define your own project based on your interests and after discussion with me.
  • Sep 21st : The class’s reviews for the first two papers have been posted. Please look at them before class tomorrow.
  • Sep 22nd : A couple of clarifications about the review submission. First, if you’re the discussion leader of a session, you do not need to submit a review for either paper in your session. Second, for calculating the total marks for the reviews, I’ll automatically drop your lowest score on one session and consider the rest. So if you don’t turn in reviews for one session, that’ll be considered the one session with the lowest score and be dropped.
  • Sep 22nd : Assignment 1 has been posted here. It is due on Oct 13th, Thursday, in class. You are expected to work on it individually and turn in your solutions in written form.
  • Sep 28 : The papers to be discussed on Oct 6th have been posted. Please turn in your reviews by Oct 5th noon. Update (Oct 5): The reviews of the class have been posted.
  • Oct 6 : I’ve posted a password protected link to the Mobius tool here. Mobius is a tool for modeling systems through Stochastic Activity Networks (SANs). We will use Mobius for our second assignment. We have an academic license for using Mobius in this class. Do not distribute Mobius outside the class.
  • Oct 12: I’ve posted the papers to be discussed on Oct 20th. Reviews are due by noon on Oct 19th. Update (Oct 19) : Reviews have been posted.
  • Oct 12: Assignment 2 has been posted here. It is due on Nov 3rd, Thursday in class. You will need to install the Mobius tool for solving this assignment (see Oct 6th’s announcement).
    Update (Oct 27th): Assignment due date postponed to Nov 8, 2011.
  • Oct 13: Project midterm reports are due in class on Nov 8th, 2011 . The report must contain the motivation, related work and experimental methodology you plan to deploy in your project, and must be no more than six pages in length including references. The format is IEEE Computer Society double column format available here. Note that each group needs to submit only one report.
    Update (Oct 27th): Midterm reports due date postponed to Nov 10th, 2011.
  • Oct 19: The papers for discussion on Oct 27th have been posted. Reviews are due by noon on Oct 26.
    Oct 26 : The reviews for the papers have been posted.
  • Oct 30 : Many of you have reported trouble with running Mobius under Windows. For this reason, I suggest that you use Mobius in Linux or MacOS X only.
  • Oct 31 : We will have a guest lecture on Nov 8th at the regular class time. The speaker will be Charng-da Lu from the center for computational research, SUNY Buffalo. The lecture will be in KAIS 2020 .
  • Nov 2: The papers for discussion on Nov 10th have been posted. Reviews are due by noon on Nov 9.
    Update(Nov 9) : The reviews have been posted.
  • Nov 7 : Assignment 3 has been posted here. It is due on December 1st, in class. You’ll also need to download the SymPLFIED framework and the tcas program for this assignment. Please do not distribute SymPLFIED outside this class.
  • Nov 9 : The papers for discussion on Nov 17th have been posted. Reviews are due by noon on Nov 16. Update (Nov 16): The reviews have been posted.
  • Nov 16 : There will be a project presentation session for this class on December 8th, from 9 AM to Noon. This will take the place of the final exam as there is no separate exam. Each group will present their project in this session (details to follow later). You will be expected to attend and participate in the discussion for the entire duration of the session – please let me know asap if you cannot make any part of the session.
  • Nov 16 : There will be no class on Nov 29 and Dec 1. Please use the time to work on your projects. You’ll need to hand in assignment 3 in my office on Dec 1st before noon.
  • Nov 16 : We will have a paper discussion session on Nov 24 – the papers have been posted. Reviews are due by noon on Nov 23rd. Update (Nov 23): : The reviews have been posted.
  • Nov 17 : Assignment 3 has some corrections. First, when you execute normal.maude in phase 2, you don’t need to specify the scripts subdirectory in its path because you’re already in that directory (thanks to Anna for pointing this out). Also, Mina pointed out this useful link which explains how to install the gcc cross-compiler for SimpleScalar.
  • Nov 22 : Because many of you said you had difficulties installing Simplescalar-gcc, I’ve decided to make available the tcas.maude file which you need to run with SymPLFIED to get the results for Assignment 3. In effect, this means that you only need to do phase 3 of assignment 3. I’m also extending the assignment’s due date to Dec 8th, 9 AM.
    Update: You also need to copy this file into the tcas directory before the experiment.
  • Nov 22 : As discussed in class today, the final project reports are due by Dec 20th at noon. These should be at most 10 pages in IEEE Computer Society double column format (same as the midterm reports). You need to email these to me in PDF form with the subject line – EECE513: Project report.
  • Nov 24 : Today was the last class in this course. There will be no class next week – please work on your projects during this time. I really enjoyed teaching you all. I’d appreciate it if you can take a few minutes to complete the teaching evaluations for the class. Also, feel free to send any suggestions for improvement directly to me.
  • Nov 30 : This is a reminder that we will have the project presentations on Dec 8th from 9 AM to Noon. The presentations will be held in KAIS 4018. Each group will have 15 mins to present their work, at the end of which we’ll have time for questions (we’ll adhere strictly to the 15 minute constraint, so please practice your talk).
    I’ll provide coffee and light refreshments for the meeting. Please arrive at 8:45 AM to give yourself time to settle down. I’ll call upon the groups in random order, and if you’re not there at that time, you’ll receive a 0 for the presentation.
  • Dec 5 : To minimize switching times during the presentation on Dec 8th, please email me your slides in PPT or PDF format on or before Dec 8th morning (6 AM). I’ll load them up on my laptop prior to the meeting. If you absolutely cannot do this, I suggest you show up at 8:45 AM sharp with a memory stick containing the slides.

Back to top