CHAPTER 10



SOFTWARE SAFETY IN COMPUTING: A REVIEW AND ANALYSIS METHODS

B. S. DHILLON

Faculty of Engineering

University of Ottawa

Ottawa, Ontario, K1N 6N5

CANADA

Abstract: This paper presents various different aspects of software safety including facts and figures, software reliability versus software safety, computer related myths, software hazard causing ways, ten software hazard analysis methods, and practical software safety design-related guidelines.

Key-words: Software, Safety, Computers, Analysis methods, Hazard, Risk

1. Introduction

Today computers are used in a wide range of applications including domestic, transportation, defense, and space exploration. They are made up of both hardware and software components and nowadays much more money is spent in developing computer software than hardware, in comparison to the first generation computers. For example, in 1955 software accounted for approximately 20% of the total computer cost and in the mid-1980s, the percentage increased to around 90% [1]. Furthermore, the amount of money spent annually on software in the United States alone is estimated to be at least $300 billion dollars and the size of computer programs increased from around 100 lines in the early 1950s to multi-million lines in the late 1990s [2, 3].

Needless to say, software has become a driving force in computing along with concerns for its safety. In many applications, its proper functioning is so important that a simple mal-function may result in a large-scale loss of lives. For example, Paris commuter trains serving around 800,000 passengers daily depend on software signaling [4].

2. Facts and Figures and Examples of Software Error

Some of the facts, figures, and examples directly or indirectly related to software safety are as follows:

• Annual software industry in the United Sates worth at least $300 billion [5].

• Over 70% of the organizations involved in software development develop their software by using ad hoc and unpredictable methods and techniques [6].

• A program of size approximately 3,200 words consumes an average code-plus-debug time of roughly 178 hours of a typical programmer [7].

• A software error caused a radioactive heavy water spill at a Nuclear Power Generating Station owned by Ontario Hydro, Canada [8].

• A software error in a French meteorological satellite resulted in the destruction of 72 of the 141 weather balloons [9].

• Two patients died and a third was severely injured due to software errors in a computer-controlled therapeutic radiation machine known as Therac 25 [10 – 13].

• A software error shut down a Patriot missile system during the Gulf War. Consequently, an enemy SCUD missile killed 27 persons and wounded 97 [8].

• Two persons died in incidents involving heart pacemakers due to software errors [10].

• A safety-related software issue was responsible for an instrument failure causing the SAAB JAS39 Gripen fighter plane to crash [8].

• A software error resulted in the failure of computer-aided dispatch system for the London Ambulance Service by sending a wrong ambulance to an incident [14].

• An error in the blood data bank program permitted over 1,000 pints of blood possibly contaminated with Acquired Immune Defiency Syndrome (AIDS) to be distributed [8].

3. Software Reliability versus Software Safety and Security versus Safety

Often, safety and reliability are equated particularly with respect to software. However, there is an increasing trend to separate the two concepts from each other. Broadly speaking, reliability related requirements are concerned with making a system failure free and the safety mishap free. More specifically, reliability is concerned with each and every possible software error and in contrast safety only those errors that may lead to actual system hazards. However, it is to be noted that not all software errors lead to safety problems and not all software functioning as per specification is safe [15, 16]. As per Ref. [17], severe mishaps have happened while something was functioning exactly as per the requirement (i.e., without failure).

All in all, it maybe said that there is a definite need for a totally different approach to safety related problems, more specifically, for an approach complementary to standard reliability techniques and focuses on malfunctions having the most serious consequences [18].

Although, safety and security are not exactly the same, but they are related by their negative based control policies. A secure system does not allow control/information to be passed on to unauthorized persons. More specifically, security is basically concerned with protecting authority and information in question. In contrast, a safe system prevents the occurrence of injury through devices under its control and by protecting life and property [19].

Some of the similarities of both safety and security are as follows [16]:

• Both are concerned with risks or threats. More specifically, safety with threats to life or property and security with threats to privacy/national security.

• Often, both safety and security involve negative requirements that time to time may be in conflict with some key functional or mission requirements.

• Both safety and security involve global system requirements difficult to deal with outside of system context.

4. Computer Related Myths, Software Risk Increasing Ways and Categories, and Software Hazard Causing Ways

There are many myths associated with computers. Some of these, along with their clarifications or realities, are as follows [20]:

• Myth 1. It is easy to make changes to software.

Reality. Although, it is easy to make changes to software, but making changes without introducing errors is a very challenging task.

• Myth 2. Improvement in software reliability will lead to better safety.

Reality. Software reliability can be improved by eradicating software errors totally unrelated to system safety. It simply means that reliability is increased while safety remains at the same level.

• Myth 3. Reusing software improves safety.

Reality. Past experiences indicate that the reuse of proven software components can improve reliability, but with little or no effect on safety.

• Myth 4. Testing software can eradicate all the existing errors.

Reality. The shortcomings associated with software testing are well known. Furthermore, the large number of states of realistic software’s makes exhaustive testing impossible to perform.

• Myth 5. The application of computers lowers risk over mechanical systems.

Reality. Although computers possess the potential to reduce risk, but not all applications of computers can achieve this potential.

There are basically three ways, as shown in Fig. 1, in which software can increase risk [21]. Risks, in which losses can result, may be grouped under the following three categories [22]:

• Category I. This includes situations where workers are poorly trained to perform assigned tasks, projects are poorly planned or funded, the required manuals are not available, and so on. An example of such situation is using a C programmer without adequate experience or training to develop an Ada code with stringent schedule constraints.

• Category II. This includes situations where a number of environmental conditions may impact a software professional’s ability to perform his/her job effectively. Some examples of these conditions are screen glare, the incorrect software development tools for the job, inadequate computer hardware or memory, and inadequate lighting.

• Category III. This includes situations where there is no compliance with policies and procedures, there are inadequate standards, standards are not used, etc. Two examples of these situations are taking a software patch and installing it into an operational system without the proper testing of the patch and writing a software code prior to adequately completing the requirements definition.

Software can cause/contribute to a hazard in various ways as shown in Fig. 2 [16, 19].

5. Software Hazard Analysis Methods

Over the years a large number of methods has been developed to perform various types of software hazard analysis. The most useful software hazard analysis methods are described below [23 – 26].

5.1 Software Sneak Circuit Analysis

This is employed to highlight software logic that causes undesired outputs. More specifically, program source code is converted to topological network trees and six basic patterns are used to model the code: return dome, single line, parallel line, trap, iteration loop, and entry dome. Each software mode is modeled using the basic patterns linked in a network tree flowing from top to bottom. The involved analyst (s) asks questions concerning the use and interrelationships of the instructions that are considered components of the structure.

The effective answers to questions asked are useful to provide important clues that identify sneak conditions that may result in undesirable outputs. Nonetheless, the analysts search for four basic kinds of software sneaks: wrong timing, presence of an undesired output, a program message poorly describing the actual condition, and the undesired inhibit of an output.

The clue-generating questions are taken from the topograph representing the code segment. Whenever sneaks are found, the concerned analysts perform investigative analyses to verify that the code does indeed produce the sneaks. Subsequently, the sneaks’ impacts are assessed and appropriate corrective measures recommended.

5.2 Code Walk-Through

This is a useful method for improving the safety and quality of software products and basically is a team effort among professionals such as software programmers, software engineers, program managers, and the system safety people. Code walk-throughs are in-depth review of the software under consideration through inspection and discussion of the software functionality. All logic branches and the function of each concerned statement are discussed at a significant length. More simply, this process provides a checks and balances system of the software produced.

The system reviews the software functionality and compares it with the system requirements. This verifies that all software safety-associated requirements are properly implemented, in addition, to determining the functionality accuracy. The method is described in detail in Ref. [26].

5.3 Nuclear Safety Cross-Check Analysis (NSCCA)

This is a comprehensive software safety analysis method originally developed to meet the requirements of the United Sates Air force entitled “Nuclear Surety Design Certification for Nuclear Weapon System Software and Firmware” [27]. The method consists of an adversarial approach with the objective of showing a high degree of confidence that the software will not cause an undesirable event. The NSCCA is composed of two main components: technical and procedural. The purpose of the technical component is to ensure that system safety-related requirements are fully satisfied. Similarly, the purpose of the procedural component is to provide effective protection and security for critical software elements.

The technical component analyzes and tests the software under consideration. First, the degree to which each software function affects safety goals is assessed and then the software is broken-down to the lowest-level functions. All these lowest-level functions are reviewed and the ones that do not affect critical events are not reviewed again. A critical matrix is established by plotting software functions against safety objectives. Each matrix cell assigns influence rating categorized into three levels: high, medium, or low. Furthermore, each software function is assigned recommendations for applicable evaluation methods.

The procedural component is concerned with security and control measures, in which factors such as background investigations for personnel clearances, facility security, configuration control and document security, and product controls are instituted.

Both the above components are performed by an independent organization. More specifically, this organization is totally independent from the software developer. All in all, this is a useful method for assuring that the system software has no incorrect design, programming, fabrication, or application that may result in unsafe conditions.

5.4 Proof of Correctness

This is another method used in performing software hazard analysis and it decomposes a given program into a number of logical segments. For each of these segments input/output assertions are defined. Subsequently, the concerned software professional verifies from the perspective that each input assertion and its associated output assertion are true and that if all of the input assertions are true, then all output assertions are also true. All in all, this method uses mathematical theorem proving concepts to verify that a program under consideration is consistent with its associated specifications.

5.5 Event-Tree Analysis

This is another software hazard analysis method and it models the sequence of events resulting from a single initiating event. With respect to its application to software the initiating event is taken from a code segment that is safety critical, suspected of error or code inefficiencies. Normally, the event-tree analysis (ETA) approach makes the assumption that each sequence event is either a success or a failure.

Some of the important factors associated with this method are as follows [28]:

• Usually, ETA is used in performing analysis of more complex systems than the ones handled by the failure modes and effects analysis method.

• It is difficult to incorporate delayed success or recovery events.

• The method always leaves some possibility to miss important initiating events.

• It is an excellent approach for identifying undesirable events that need further investigation using fault tree analysis.

ETA is described in detail in Ref. [28, 29].

5.6 Failure Modes and Effects Analysis (FMEA)

This is another useful method for performing software hazard analysis. The method was developed in the early 1950s to perform failure analysis of flight control systems [28]. Time to time, it has been also used to perform software hazard analysis [30]. Basically, the FMEA approach demands listing of all possible failure modes of each part on paper and their effects on the listed subsystems, system, etc. This method is described in Refs. [28, 31].

5.7 Software Fault Tree Analysis

Fault tree analysis (FTA) is a powerful tool used for performing various types of safety analysis. It was developed by H.R. Watson of Bell Telephone Laboratories in 1962 to analyze the Minuteman Launch control System from the safety aspect [31]. Software fault tree analysis (SFTA) is an offshoot of FTA and is used to analyze the safety of a software design. The main objective of SFTA is to demonstrate that the logic contained in the software design will not generate system safety-related failures, in addition to determining environmental conditions that may result in the software causing a safety malfunction [32].

SFTA proceeds in a similar manner to hardware FTA. More specifically, FTA begins by identifying an undesirable event (i.e., top event) of a given system. Fault events which could cause the occurrence of the top event are produced and connected by logic operators usually called OR and AND gates. The OR gate produces an output, if one or more fits input fault events occur. In contrast, the AND gate provides an output if all its input fault events occur. The construction of a fault tree proceeds by generation of fault events in a successive manner until the events need not be developed further. More specifically, in the generation of fault trees involves successively asking the question “How could this fault event occurs?” A fault tree may simply be described as a logic structure relating the top event to the primary or basic event.

SFTA also identifies software-hardware interfaces. Although, fault trees for both software and hardware are developed separately, but they are linked together at their interfaces to permit entire system analysis. This is very critical since software safety procedures cannot be developed in isolation, but must be considered as an element of the total system safety. More specifically, for example the occurrence of a software error may result in a mishap only, if a hardware/human failure occurs simultaneously.

All in all, although SFTA is a powerful hazard analysis tool, it is an expensive technique to use. FTA is described in detail in Ref. [31].

5.8 Cause and Effect Diagram (CAED)

This method was developed in the early 1950s by K. Ishikawa, a Japanese quality expert. The method is also known as Ishikawa diagram or “Fishbone diagram” because of its resemblance to the skeleton of a fish. More specifically, the right hand side of the diagram, i.e., the fish head represents effect, and the left hand side all the possible causes which are linked to the central line called the “Fish Spine”.

The cause and effect diagram could be very useful in determine the root causes of safety problems in software. The main steps used in developing a CAED are as follows[33]:

• Establish problem statement.

• Brainstorm to identify all possible causes.

• Establish major cause groups by stratifying into natural categories and the steps of the process.

• Construct the diagram by connecting the identified causes under appropriate process steps and fill in the problem or the effect in the diagram box (i.e., the fish head) on the extreme right-hand side.

• Refine cause categories by asking questions such as listed below.

. What cause this?

. What is the reason for the existence of this condition?

Some of the advantages of the cause and effect diagram are useful to identify the root causes of a problem, to produce relevant ideas, and to present an orderly arrangement of theories.

5.9 Probability Tree Analysis

This method is quite useful to perform task analysis by diagrammatically representing critical human actions of individuals in software development and other related events. Often, this approach is used to conduct task analysis in the technique for the human error rate prediction (THERP) [34]. Diagrammatic task analysis is denoted by the branches of the probability tree. Most specifically, the branching limbs of the tree denote each event’s outcome (i.e., success or failure) and each branch is assigned an occurrence probability.

There are many advantages of the probability tree method including simplified mathematical computations, a visibility tool, and flexibility to incorporate (i.e., with some modifications) factors such as emotional stress, interaction stress, and interaction effects.

The methods is described in detail in Ref. [34].

5.10. Markov Method

This is a widely used method in general reliability and safety studies. It can also be used to perform various types of software safety studies. The method is known after a Russian mathematician named Andrei Andreyevich Markov (1856-1922) and is based on the following three assumptions [28, 35]:

• The probability of the occurrence of a transition from one system state to another in the finite time interval (t is given by ((t, where the ( is the transition rate from one system state to another.

• The transition probability of more than one occurrences in time interval (t from one state to another is negligible (e.g., (((t)(((t) (0).

• All occurrences are independent of each other.

The application of this method to a simple software problem is demonstrated through the following example:

Example

Assume that a software system experiences 0.002 safety-related faults per month and their correction rate is 0.004 faults per month. Calculate the probability that the system will fail unsafely in the next six months.

By using the Markov method, we obtained the following equation [28]:

[pic] (1)

where

Pus (t) is the probability that the software system will fail unsafely at time t.

( is the software system safety-related fault rate.

( is the fault correction rate.

By substituting the given values into Equation (1) yields

Pus (6) = 0.0118

It means that there is just over one percent chance that the system will fail unsafely.

6. Practical Software Safety Design-Related Guidelines

Over the years professionals working in the field of computers have developed various useful software safety design-related practical guidelines. A careful application of these guidelines can help to improve software safety significantly. Some of these guidelines are as follows [36]:

• Remove unnecessary or obsolete code.

• Separate and isolate with care all non-safety-critical software modules from safety-critical modules.

• Prohibit safety-critical software patches throughout the development process.

• Develop software modules for monitoring critical software with respect to hazardous states, errors, faults, or timing problems.

• Incorporate an operator for validating or authorizing the execution of safety-critical commands.

• Incorporate appropriate provisions for detecting and logging system errors.

• Incorporate mechanism for ensuring safety-critical computer software elements and interfaces to be under positive control at all times.

• Incorporate the need for a password along with confirmation, prior to the execution of a safety-critical software module.

• Do not use all 0s or 1s for critical variables.

• Ensure that conditional statements satisfy all possible conditions and are under full software control.

• Make use of an authorizing bit pattern that is sequentially set by each routine in order, in situations, where the correct sequence of software routines is crucial.

• Develop software design such that prevents unauthorized/inadvertent access and/or modification to the code.

• Initialize spare memory with a bit pattern that if ever accessed and executed, it will directs the software towards a safe state.

• Incorporate the mechanism that causes the system to detect inadvertent jumps within, or into safety critical computer software elements, and return itself to a safe state.

References:

[1] Keene, S.J., Software Reliability Concepts, Annual Reliability and Maintainability Symposium Tutorial Notes, 1992, pp. 1 – 21.

[2] Gibbs, W.W., Software’s Chronic Crisis, Scientific American, September 1994, pp. 86 – 95.

[3] Bennett, K.H., Software Maintenance: A Tutorial, in Software Engineering, edited by M. Dorfman and R.H. Thayer, IEEE Computer Society Press, Los Alamitos, California, 1997, pp. 289 – 303.

[4] Cha, S.S., Management Aspect of Software Safety, Proceedings of the Eighth Annual Conference on Computer Assurance, 1993, pp. 35 – 40.

[5] Hopcroft, J.E., Krafft, D.B., Sizing the U.S., Industry, IEEE Spectrum, December 1987, pp. 58 – 62.

[6] Thayer, R.H., Software Engineering Project Management, in Software Engineering edited by M. Dorfman and R.H. Thayer, IEEE computer Society Press, Los Alamitos, California, 1997, pp. 358 – 371.

[7] Sackman, H., Erikson, W.J., Grant, E.E., Exploratory Experimentation Studies Comparing Online and Offline Programming Performance, communications of the ACM, Vol. 11, 1968, pp. 3 – 11.

[8] Mendis, K.S., Software Safety and Its Relation to Software Quality Assurance, in Handbook of Software Quality Assurance, Edited by G.G. Schulmeyer and J.I. McManus, Prentice Hall, Inc., Upper Saddle River, New Jersey, 1999, pp. 669 – 679.

[9] Anonymous, Blow Balloons, Aviation Week Space Technology, September 20, 1971, pp. 17.

[10] Schneider, P., Hines, M.L.A., Classification of Medical Software, Proceedings of the IEEE Symposium on Applied Computing, 1990, pp. 20 – 27.

[11] Gowen, L.D., Yap, M.Y., Traditional Software Development’s Effects on Safety, Proceedings of the 6th Annual IEEE Symposium on Computer-Based Medical Systems, 1993, pp. 58 – 63.

[12] Joyce, E., Software Bugs: A Matter of Life and Liability, Datamation, Vol. 33, No. 10, 1987, pp. 88 – 92.

[13] Dhillon, B.S., Medical Device Reliability and Associated Areas, CRC Press, Boca Raton, Florida, 2000.

[14] Shaw, R., Safety-Critical software and Current Standards Initiative, Computer Methods and Programs in Biomedicine, Vol. 44, 1994, pp. 5 – 22.

[15] Ericson, C.A., Software and System Safety, Proceedings of the 5th International System Safety Conference, 1981, pp. III B.1 – III B.11.

[16] Leveson, N.G., Software Safety: Why, What, and How, Computing Surveys, Vol. 18, No. 2, 1986, pp. 125 – 163.

[17] Roland, H.E., Moriarty, B., System Safety Engineering and Management, John Wiley and Sons, New York, 1983.

[18] Leveson, N.G., Software Safety in Computer-Controlled Systems, IEEE Computer, February 1984, pp. 48 – 55.

[19] Friedman, M.A., Voas, J.M., Software Assessment, John Wiley and Sons, New York, 1995.

[20] Leveson, N.G., Safeware, Addison-Wesley Publishing Company, Reading, Massachusetts, 1995.

[21] Leveson, N.G., Shimeall, T.J., Safety Verification of ADA Programs Using Software Fault Trees, IEEE Software, July 1991, pp. 48 – 59.

[22] Fortier, S.C., Michael, J.B., A Risk-Based Approach to Cost-Benefit Analysis of Software Safety Activities, Proceedings of the Eighth Annual Conference on Computer Assurance, 1993, pp. 53 – 60.

[23] Ippolito, L.M., Wallace, D.R., A Study on Hazard Analysis in High Integrity Software Standards and Guidelines, Report No. NISTIR 5589, National Institute of Standards and Technology, U.S. Department of Commerce, Washington, D.C., January 1995.

[24] Hammer, W., Price, D., Occupational Safety management and Engineering, Prentice Hall, Inc., Upper Saddle River, New Jersey, 2001.

[25] Hansen, M.D., Survey of Available Software-Safety Analysis Techniques, Proceedings of the Annual Reliability and Maintainability Symposium, 1989, pp. 46 – 49.

[26] Sheriff, Y.S., Software Safety Analysis: The Characteristics of Efficient Technical Walk-Throughs, Microelectronics and Reliability, Vol. 32, No. 3, 1992, pp. 407 – 414.

[27] AFR-122-9, Nuclear Surety Design Certification for Nuclear Weapon System Software and Firmware, Department of the Air Force, Washington, D.C., August 1987.

[28] Dhillon, B.S., Design Reliability, CRC Press, Boca Raton, Florida, 1999.

[29] Cox, S.J., Tait, N.R.S., Reliability, Safety, Risk Management, Butterworth-Heinemann Ltd., London, 1991.

[30] IEEE 1228-1994, Software Safety Plans, Institute of Electrical and Electronic Engineers (IEEE), New York, May 1994.

[31] Dhillon, B.S., Singh, C., Engineering Reliability: New Techniques and Applications, John Wiley and Sons, New York, 1981.

[32] Leveson, N.G., Harvey, P.R., Analyzing Software Safety, IEEE Transactions on Software Engineering, Vol. 9, No. 5, 1983, pp. 569 – 579.

[33] Mears, P., Quality Improvement Tools and Techniques, McGraw Hill, Inc., New York, 1995.

[34] Swain, .A D., A Method for Performing a Human-Factors Reliability Analysis, Report No. SCR-685, Sandia Corporation, Albuquerque, New Mexico, August 1963.

[35] Dhillon, B. S., Reliability in Computer System Design, Ablex Publishing Corporation, Norwood, New Jersey, 1987.

[36] Keene, S.J., Assuring Software Safety, Proceedings of the Annual Reliability and Maintainability Symposium, 1992, pp. 274 – 279.

-----------------------

Directing the system toward a hazardous state

Failure to detect and take an appropriate corrective action to recover from a hazardous condition

Failure to mitigate the damage after the occurrence of an accident

Ways for software to increase risk

Fig. 1 Ways in which software can increase risk

Failure to recognize a hazardous situation requiring a corrective measure

Poor timing of response for an adverse situation

Failure to perform a required function

Software hazard causing ways

Performed a function out-of-sequence

Poor response to a contingency

Performed a function not required

Provided incorrect solution to a problem

Fig. 2 Ways for software to cause hazards

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download