Skip past navigation NASA Langley Formal Methods


home

welcome

quick page

philosophy

team

research

quote

links

new?
  home > philosophy > motivation

The Design Error Problem

A computer system may fail to perform as expected because a physical component fails or because a design error is uncovered. For a system to be both ultra-reliable and safe, both of these potential causes of failure must be handled.

Established techniques exist for handling physical component failure; these techniques use redundancy and voting. The reliability assessment problem in the presence of physical faults is based on Markov modeling techniques and is well understood.

The design error problem is a much greater threat. Unfortunately, no scientifically justifiable defense against this threat is currently used in practice. There are 3 basic strategies that are advocated for dealing with the design error:

  1. Testing (Lots of it)
  2. Design Diversity (i.e. software fault-tolerance: N-version programming, recovery blocks, etc.)
  3. Fault Avoidance (i.e. formal specification/verification, automatic program synthesis, reusable modules)
The problem with life testing is that in order to measure ultrareliability one must test for exorbitant amounts of time. For example to measure a 10^{-9} probability of failure for a 1 hour mission one must test for more than 114,000 years.

Many advocate design diversity as a means to overcome the limitations of testing. The basic idea is to use separate design/implementation teams to produce multiple versions from the same specification. Then, non-exact threshold voters are used to mask the effect of a design error in one of the versions. The hope is that the design flaws will manifest errors independently or nearly so.

By assuming independence one can obtain ultra-reliable-level estimates of reliability even though the individual versions have failure rates on the order of 10^{-4}. Unfortunately, the independence assumption has been rejected at the 99% confidence level in several experiments for low reliability software.

Furthermore, the independence assumption cannot ever be validated for high reliability software because of the exorbitant test times required. If one cannot assume independence then one must measure correlations. This is infeasible as well---it requires as much testing time as life-testing the system because the correlations must be in the ultra-reliable region in order for the system to be ultra-reliable. Therefore, it is not possible, within feasible amounts of testing time, to establish that design diversity achieves ultra-reliability. Consequently, design diversity can create an ``illusion" of ultra-reliability without actually providing it.

From this analysis, we conclude that formal methods currently offers the most promising method for handling the design fault problem. Because the often quoted 1 - 10^{-9} reliability is beyond the range of quantification, we have no choice but to develop life-critical systems in the most rigorous manner available to us, which is the use of formal methods.


References

  1. Butler, Ricky W.; and Finelli, George B.: The Infeasibility of Quantifying the Reliability of Life-Critical Real-Time Software. IEEE Transactions on Software Engineering, vol. 19, no. 1, Jan. 1993, pp. 3-12.

  2. Holloway, C. Michael: Holloway, C. Michael: Why Engineers Should Consider Formal Methods, Proceedings of the 16th AIAA/IEEE Digital Avionics Systems Conference, October 26-30, 1997, Irvine CA, Volume 1, pages 1.3-16 - 1.3.-22. (Takes a very different approach than that given on these pages.)

 

  Skip past navigation  
home | welcome | quick page | philosophy | team | research | quote | links | new?
Curator and Responsible NASA Official: Ricky W. Butler
larc privacy statement
last modified: 6 August 2001 (14:19:51)