CSI 5134 Fault Tolerance (Fall 2012)

  • Professor

Dr. Amiya Nayak
Room: SITE 5001 (ext. 2165)
E-Mail: nayak@uottawa.ca
Office Hours: Mondays 10:00-11:00

  • Lectures

Fridays       17:30-20:30    KED  B005

  • Prerequisites

Basic course on Digital Logic, Computer Architecture, and Probability.

  • Course Objectives

This course deals with hardware and software techniques for fault tolerance. Topics include modeling and evaluation techniques, error detecting and correcting codes, module and system level fault detection mechanisms, design techniques for reliable, fault-tolerant and fail-safe systems, software fault tolerance through recovery blocks, N-version programming, algorithm-based fault tolerance, checkpointing and recovery techniques,  and survey of practical fault-tolerant systems.

  • Recommended Reading

1)  I. Koren, C.M. Krishna, Fault-Tolerant Systems, Elsevier, 2007.
2)  M.L. Shooman, Reliability of Computer Systems and Networks: Fault Tolerance, Analysis and Design, Wiley, 2002.
3)  D.K. Pradhan (ed.), Fault Tolerant Computer System Design, Computer Science Press, 2003.
4)  D.P. Siewiorek and R.S. Swarz, Reliable Computer Systems: Design and Evaluation, A.K. Peters Limited, 1998.
5)  M.R. Lyu (ed.), Software Fault Tolerance, Wiley, 1995.

  • Marking Scheme

 

Assignments

20%

Project

Final Exam

30%

50%

  • Course Outline


Week 1: Fault Tolerance Terminologies, Fault Classification/Models
History of fault-tolerant computing; causes and characteristics of faults, fault models.

Week 2-3: Reliability, Availability and Maintainability Analysis
Quantitative methods and reliability models such as combinatorial models and Markov models.

Week 4-5: Fault Tolerance Techniques
Various redundancy schemes and their evaluation, error detecting and correcting codes.

Week 6-7: Fault Tolerance Techniques (contd.)
Design for testability, module and system level fault detection, test pattern generation algorithms, system diagnosis.

Week 8-9: Software Fault Tolerance
Software fault tolerance schemes such as recovery blocks, N-version programming, N-shelf-checking programming.

Week 10-11: Software Fault Tolerance (contd.)
Robust data structures, algorithm-based fault tolerance, checkpointing and recovery techniques, Survey of practical fault-tolerant systems.

Week 12: Class presentations.