CHAPTER I



International Association for the Management of Technology

Lausanne, Switzerland

March, 2001

Managing Complex Technical Systems

Working on a Bridge of Uncertainty

by

Eli Berniker Ph.D

Pacific Lutheran University

Tacoma, WA, USA

And

Frederick Wolf Ph.D

University of Puget Sound

Tacoma, WA, USA

Keywords: normal accidents, complexity, uncertainty, hazardous chemical releases, petroleum refineries, safety

Managing Complex Technical Systems

Working on a Bridge of Uncertainty

“Anyone can make something work. It takes an engineer to make it barely work.” Drew Worth, 1988 (personal communication)

Introduction

Our systems have become very safe and effective. However, as they continue to evolve towards increasing complexity and they proliferate, the present levels of safety will become unacceptable. A Boeing vice president stated that if the current accident rate in the very safe commercial airline industry is maintained and air travel increases as has been projected over the next few decades, we will experience the equivalent of one wide body airplane crash a week. That is clearly unacceptable.

Similar projections might be ventured in a wide variety of industries operating complex, tightly coupled high consequence technical systems. Despite the efforts of industry and government to improve safety, major industrial accidents are as likely today as they were 10 years ago. Since 1982, there has been no reduction in the fatality rate or major accident rate in Us or European industry (Sellers,1993) (National Safety Council,1998)

All such systems are socio-technical systems, i.e. technical systems operated by organized groups of people. Therefore, any general theory of complex systems risk must be anchored in both engineering, and consequently the natural sciences, and organizational science. At present, the engineering, managerial, and organizational approaches to systems safety derive from incommensurable theoretical foundations producing partial analyses. They cannot be readily integrated into a coherent theory of systems failure and, therefore, a usefully complete model for systems safety. We will propose a general theory of complex systems failure, anchored in the laws of physics which define the fundamental challenges of systems safety and suggest links to the engineering and organizational disciplines relevant to these challenges.

We start with Normal Accident Theory (NAT) (Perrow, 1984, 1999) as a validated model of the relationship between technical system characteristics and their reliability and safety. Normal Accident Theory will be restated in terms of the Second Law of Thermodynamics as the basis for a general theory of systems failure relating systems complexity, and coupling. and the necessary uncertainty associated with the stochastic nature of failure paths. The fundamental challenge in managing complex high-consequence systems is the need to operate safely under conditions of uncertainty and continuing failures.

We will review current theories of systems safety with particular emphasis on how they relate to uncertainty, failure, systems complexity and coupling. Revisiting NAT, we will discuss the research evidence for its validity. Finally, we will propose a general model for interdisciplinary collaboration to improve the management of complex high consequence systems.

Normal Accident Theory

Perrow defines normal accidents as the outcome of multiple interacting events that are incomprehensible to operators when they occur and result in catastrophic disruption of systems. These are systems failures rather than component failures. However, component failures can be catalysts of systems accidents. Normal accidents are rare but, according to Perrow, an inevitable result of systems designs.

Perrow defined interactive complexity and tight coupling as characteristics of engineered systems that lead to normal accidents, complexity and coupling are also characteristics of organizations and, in the form of limited resource availability, act to tighten the effective coupling of engineered systems. As a social scientist, Perrow utilized a two by two matrix to classify industries according to these parameters. Interactive complexity can be understood as an indication of the number of possible states available to the system which Perrow described in terms of close proximity, interconnected subsystems, common mode connections, multiple interacting controls, indirect information and limited understanding. It is important, for theoretical reasons, to differentiate the engineered characteristics of systems from those related to their human operators. Both indirect information and limited understanding can relate to engineering design and also to operating organizations. Intuitively, although not rigorously, interactive complexity suggests the number of ways a system might fail.

Coupling is descriptively defined as the availability of buffers, resources, time, and information that can enable recovery from failures. Intuitively, coupling suggests the availability of possible recovery paths. Tight coupling implies limited recovery potential while loose coupling implies many more recovery paths. Perrow (1999) wonderfully illustrates tightness of coupling in describing a very complex sequence of events at Three Mile Island that all occurred within eighteen seconds. Note that by far the vast majority of failures are minor incidents for which recovery is possible and adverse consequences rare. However rarely, normal accidents do occur often with catastrophic consequences.

The research challenge was to convert this classification scheme into a useful metric that can be applied to engineered technical systems. NAT is, first and foremost, a theory about engineered systems focusing “on the properties of systems themselves rather than on the errors that owners, designers, and operators make in operating them “ (Perrow, 1999). Wolf, in his research on petroleum refineries, ( 2001) operationalized complexity in terms of process parameters and their states. Coupling was operationalized in terms of resource availability, a constraining factor in all complex systems. Both allow NAT to be stated in thermodynamic terms.

The Second Law of Thermodynamics

The Second Law of Thermodynamics, a fundamental law of physics, states that all systems must move towards increasing entropy or disorder. Shrodinger (1967) explained the apparent capacity of open systems, including living systems, to grow and maintain themselves by their ability to import “negative entropy.” Such a process must convert larger amounts of energy into entropy than it needs in order to retain ordered energy and grow. If the living, or open system, and its environment, are taken as the system, the Second Law is not violated.

A petroleum refinery is a very high energy complex technical system that must perpetually seek entropic equilibrium. The ultimate equilibrium entropic state of such a plant would be a smoking black hole. The resources invested in ongoing maintenance of the refinery are necessary to assure safe operation and to reverse the entropic effects of wear on the system

Boltzman formulated the equation for the entropy of a system (Schrodinger, 1967).:

S = k ln D

Where:

S is a measure of the entropy (disorder) present in a system. The equilibrium state represents the end point of the system.

k is the Boltzman constant, which for high-risk technologies would be expressed as the energy density of the system.

D is a “measure of the atomistic disorder” (Shrodinger, 1967)

For engineered systems, D is the total number of possible states accessible to the system.

Wolf (2001) operationalized system interactive complexity as Ci, an index of possible states of the system that is a subset of D. Coupling was operationalized in terms of slack resources available to the system. Greater resources loosens coupling and enables both prevention and recovery. Scarcity of resources limits recovery possibilities. In thermodynamic terms, resources represent the sources of negative entropy required to maintain the system and to recover from wear, damage, and failures.

It should be noted that this use of Boltzman is not the first time that his equation has been applied to organizational science. Shroedinger (1967) in What is Life inverted Boltzmann’s equation coining the result “negative entropy” for order which, in turn, Bertalanffy (1968) translated into information and organization as a foundation for general systems theory. The systems model so widely used in the organizational sciences owes much to the Second Law equation.

A General Theory of Systems Failure

The Second Law of Thermodynamics systems provides a firm basis for integrating Normal Accident Theory into a general theory of failure. The Second Law forces us to accept the inevitable risk of failure of technical systems. The Boltzman equation makes clear that catastrophic failure is a consequence of entropy or disorder associated with the equilibrium end state of systems. Thus, failure must be associated with uncertainty and the impossibility of designing, building, or operating perfect technical systems. Leveson (1995) has argued about similar limitations on software and Weinberg (1975) has demonstrated that such limitations are inherent in our scientific models. Weick ((1990) recognized this character of technology:

The very complexity and incomprehensibility of new technologies may warrant a reexamination of our knowledge of the cause and effect relations in human actions (p150) [and] the unique twist of the new technologies is that the uncertainties are permanent rather than transient. (p152)

Based upon the Second Law, we may conclude:

Any system of sufficient complexity and tightness of coupling must, over time, exhibit entropic behavior and uncertainty resulting in its failure.

Failure is a certainty and, therefore, recovery and repair must be possible for the system to avoid end state equilibrium. In the case of complex, tightly coupled systems, we simply cannot know enough about them to anticipate all modes of failure and prevent them within the time and resources available.

The challenge is:

How can we design, build, and operate complex, high consequence, technical systems under conditions of necessary uncertainties and many possible paths to failure?

Theories of Failure

Two major disciplines: organizational science and engineering, provide alternative models of failure. We will test their underlying paradigms with respect to their implicit sources of failure, and, therefore, conditions for systems safety.

We note that each of these disciplines makes valuable contributions to systems safety. Yet, given necessary uncertainties, all must be deficient.

The Assumption of Human Error

Shared across engineering and organizational science is a focus on human error as the source of failure and a strong tendency to confound failure with error. A pervasive theme in the psychological and organizational literature is human error. The assumption is that if human error can be controlled and prevented, system safety will be assured. Implicit in this view is the denial of systems uncertainty.

Turner and Pidgeon’s (1997) research, on 13 large industrial accidents, demonstrates the focus on human error. They note during the time immediately preceding an accident: 1) events were unnoticed or misunderstood because erroneous assumptions were made; 2) discrepant events were unnoticed or misunderstood as a result of problems in information handling in complex situations; 3) events that warned of danger passed unnoticed as a result of human reluctance to fear the worst; and/or 4) formal precautions were not up to date and violation of rules and procedures were accepted as the norm. Note that each of these failure categories relates to human error or cognitive processes.

According to Weick and Roberts (1993), “We suspect that normal accidents represent a breakdown of social process and comprehension rather than a failure of technology. Inadequate comprehension can be traced to flawed mind rather than flawed equipment” (p.378). Roberts (1989) has written, “It is not really clear that all high risk technologies will fail” (p. 287). By contrast, Leveson (1995) notes that systems can fail when they operate exactly as planned.

Weick, Sutcliffe and Obstfeld (1999) argue that:

Theoretically, a system with a well-developed capability for improvisation should be able to see threatening details in even the most complex environment, because, whatever they discover, will be something they can do something about.

The evidence presented in validating NAT will show that with respect to complex systems, this is a cognitively impossible task.

It is clear that we make errors and that many failures may be caused by our errors. Leveson (1995) has critiqued engineering attributions of accidents to human errors arguing that “the data may be biased and incomplete” and that “positive actions are usually not recorded.” Attributions are often based on the premise that operators can overcome every emergency…. It appears…that the operator who does not prevent accidents caused by design deficiencies or lack of proper design controls is more likely to be blamed than the designer.” Moreover, “separating operator error from design error is difficult and perhaps impossible.”

She cites Rasmussen as arguing that “human error is not a useful term” and expresses particularly well our core disagreement with this school of thought:

The term human error implies that something can be done to humans to improve the state of affairs; however, the “erroneous” behavior is inextricably connected to the same behavior required for successful completion of the task. (Leveson,1995, p102)

Industrial Safety

The assumptions of human error are linked with a perception shared across organizational and engineering disciplines that incident rates are linked to systems level risk management.

In 1959, Heinrich published his classic work, “Industrial Accident Prevention.” He described a ‘foundation for industrial safety’ programs which has come to be known as ‘Heinrich’s Triangle’ by safety practitioners in industry. Based on observations derived from a sample of 5000 industrial accidents, Heinrich (1959) determined “in a unit group of 330 accidents of the same kind and involving the same person, 300 result in no injury, 29 in minor injuries and 1 in a major loss” (p. 26). This observation led to the realization that a ‘preventative opportunity’ exists to manage safety. If the frequency of accidents that result in no injury can be reduced, the corresponding frequency of more serious accidents can also be reduced.

Proponents of High Reliability Organizations share this focus on tracking and corrective action of incidents and minor accidents is essential in assuring continued reliable operations. However, as discussed by Rasmussen (1990) “. . . the error is a link in the chain, in most cases not the origin of the cause of events” (p. 1186).

There is no doubt that the safety records of complex systems can be improved when we focus on incidents and minor accidents. We cannot assume that Heinrich’s triangle correlating rates of incidents and accidents is evidence that reducing the frequency of non injurious accidents results in a corresponding reduction in systems accidents. Wolf (2001) found that “Safety performance… was unrelated to the complexity of the process system” while his results strongly demonstrate that complex refineries experience more frequent hazardous chemical accidents.” Therefore, the linkage between efforts to improve individual safety and the prevention of systems accidents is problematic.

Organizational Risk Management and Resource Models

Systems failure incidents may be understood as signals of impending risk. Responses to these warnings may be controlled or mediated by resource availability.

Marcus and Nichols (1996)(1999) described an organizational ‘band of safety’. According to this model, organizations ‘drift’ within an acceptable performance envelope. Warnings of impending danger are signaled by increased rates of minor incidents and accidents. The organization can take action and correct the deviation recognizing, “Correction depends on the magnitude of the signal, the sensitivity of detection and the width of the detection recovery zone” (Marcus and Nichols, 1996, p. 3).

Since a ‘drifting band’ varies temporally, it calls for a decision model that deals with time variation. According to March (1994), “Solutions are answers to problems that may or may not have been recognized. They can be characterized by their arrival times and their access to choice opportunities as well as by the resources they provide to decision makers who are trying to make their choices” (p. 200). As Rasmussen (1994) cautions, “During periods of economic pressure the safety boundary is likely to be breached because the accident margin will be unknown from normal work practice and will not reveal itself locally” (p. 29).

Rose (1990) identified a linkage between safety performance and resource allocation. In her study of airlines “… lower profitability correlated with higher accident and incident rates” (p. 944). She attributed this to risky organizational behavior including the use of older equipment, deferred upgrades and equipment modernization; the hiring of less experienced, lower salaried employees, and to the use of low-bid outside contractors for aircraft maintenance

Integrating these models suggests that the bandwidth of the safety boundary is determined (at least to some extent) by resource availability and economic performance. The bandwidth of the Marcus and Nichols model is established by the frequency of signals, including incidents and minor accidents, mishaps, etc., and the level of risk acceptable to the organization at a particular time.

Shrivastava’s (1992) conclusions concerning the Bhopal accident are consistent with this view. “If a plant is strategically unimportant, it receives fewer resources and less management attention, which in turn, usually makes it less safe” (Shrivastava, 1992, p. 43). Stein and Kanter (1993) also recognized this phenomenon; “The culprit becomes a system under such pressure to perform that mistakes are encouraged, constructive actions undercut and information withheld” (p. 59). Davidson (1970) observed:

The supervisor has to make a decision as to whether to halt production when a question is raised on the safety of continued operation…he asks himself if it is unsafe to the point that he must shut down the operations. Nine times out of ten he is incapable of making the judgement. He is trained to keep production going…Always on the back of his mind is the knowledge he will be held responsible for the loss of production that will occur if he takes the cautious course of slowing or stopping operations. (p. 108)

The key assumptions underlying the organizational and managerial approaches to complex systems management are that we can organize for systems safety with the right structure, culture and sufficient resources. The technical system is accepted as a given and it is assumed that all of it’s attendant equivocality and uncertainties can be overcome by organizational means. Uncertainty is not seen as a fundamental characteristic of complex systems but a problem that can be overcome with good organization.

Engineering Reliability

Reliability and safety are concepts based upon opposing logic in engineering. Reliability is a characteristic of items or parts that expressed by the probability that they will perform their required functions in a specified manner over a given time period and under specified or assumed conditions. Reliability uses a bottoms-up approach assuming that if the elements of a system are reliable, the whole system will be safe. (Leveson, 1995).

The assumption that systems reliability is a function of individual component reliability leads to several design principles and a paradox (Sagan, 1996). Reliability enigneers concerned primarily with failure rate reduction may utilize parallel redundancy of parts, standby sparing of units, safety factors and margins, reducing component stress levels, and timed replacements (Leveson, 1995). “While these techniques are often effective in increasing reliability, they do not necessarily increase safety” (Leveson, 1995). Safety is a systems level phenomena which must be evaluated in terms of combinations of events involving both incorrect and correct component behavior and the environmental conditions under which these occur. Safety is a function of systems functioning in relation to its operating environment. This view is more consistent with the Second Law.

The paradox of engineering reliability strategies is “the paradox of redundancy” identified by Sagan (1996). As the number of components increases and the potential for catastrophic common mode failures exists, the probability that a catastrophic failure will occur increases.

In addition, the engineering reliability approach does not incorporate the operating organization and its capabilities. Hirschhorn (1982) notes that:

Engineers have not learned to design a system that effectively integrates worker intelligence with mechanical processes. They seldom understand that workers even automated settings must nevertheless make decisions; rather they tend to regard workers as extensions of machines. (p107, Leveson, 1995).

As a result, engineers often increase the incomprehensibility of complex systems. They design the monitoring, instrumentation, and feedback systems that inform operators about system states. Appropriate design of monitoring systems should be based upon an understanding of cognitive processes and individual and group decision making, domains that are generally excluded in an engineering education.

Engineering Design

Engineering design is a formal process of incremental steps that are integrated to yield a technical system. A design project begins with a feasibility study. The feasibility study is a formalized screening of alternatives that could satisfy the need which the engineering effort is intended to address (Dorf, 1996).

The preliminary design stage follows conceptual design. This design is provides enough detail to allow a value engineering review performed to evaluate the probable life cycle cost of the project. During this stage hazard analysis techniques are used to review the design. While useful, none is capable of identifying all potential failure modes and hazard scenarios associated with the nascent design. Risk balancing is part of the design process. Engineering is problem solving, “…applying factual information to the arts of design. What makes it intellectually exhilarating is resolving conflicts between performance requirements and reliability on one hand and the constraints of cost and legally mandated protection of human safety and environment on the other. design becomes a tightrope exercise in tradeoffs” (Wenk,1995, p. 22).

Safety factors or margins of error must be included in every design to compensate for uncertainties which can arise from many factors. When a failure occurs, it yields information concerning the limitations of its design (Petroski, 1994). There is a Faustian bargain struck in the design process of all technical systems; the engineer must balance the need for reliability and safety versus economic and physical constraints

Petroski warns (1992) not to expect an engineer to “…declare that a design is perfect and absolutely safe, such finality is incompatible with the whole process, practice and achievement of engineering” (p. 145).

Petroski (1992) describes the nature of engineering very clearly:

Engineering design shares certain characteristics with the positing of scientific theories, but instead of hypothesizing about the behavior of a given universe, whether of atoms, honeybees, or plants engineers hypothesize about assemblages of concrete and steel that they arrange into a world of their own making” (p. 43).

Every design is therefore subject to uncertainty; its rejection is only determined at the time of its failure. Engineering design certainly recognizes necessary uncertainty and the limitations of reliability and systems safety. However, there is no conceptualization of the experimental settings within which their designs will be operated and tested. The operating organization is missing from their world view.

The space shuttle Challenger was the subject of several important works that deal with organizational culture. In 1988, Starbuck and Milliken wrote ,“The most important lesson to learn from the Challenger disaster is not that some managers made the wrong decisions or how o-rings worked: the most important lesson is that fine-tuning makes failures very likely” (p. 335). Fine-tuning is “experimentation in the face of uncertainty” in the “context of very complex sociotechnical systems so its outcomes appear partially random”.

Validating Normal Accident Theory

Our general theory of systems failure is derived from the extension of Normal Accident Theory to engineering principles. The validity of Normal Accident theory (NAT) requires demonstration.

NAT is defined and exemplified by Charles Perrow (1984, 1999) as a model that anticipates the failure of complex, tightly coupled technical systems. However, as an organizational study, neither complexity nor coupling were defined in physical terms. As discussed earlier, Wolf (2001)(Wolf & Berniker,1999) operationalized interactive complexity in terms of the Second Law of Thermodynamics and Boltzmann’s equation for entropy. He then validated NAT on a population of petroleum refineries.

Interactive complexity and tightness of coupling are the core of Normal Accident Theory (Perrow, 1984). Perrow defines complexity by illustration rather than by rigorous definition. He illustrates his concept of interactive complexity with examples including “Branching paths, feedback loops, jumps from one linear sequence to another because of proximity…and connections that can multiply as other parts or units or subsystems are reached” (Perrow, 1984, p. 75). While not rigorous, it clearly conveys the idea that he is concerned with interactive complexity and not simply the number of parts, components or subsystems present. Perrow offers additional insight into his concept of complexity by describing what he terms ‘baffling interactions:’

These represent interactions that were not in our original design of the world and interactions that we, as operators could not anticipate or guard against. What distinguishes these interactions is that they were not designed into the system by anybody; no one intended them to be linked. They baffle us because we acted in term of our own designs of a world that we expected to exist-but the world was different. (Perrow, 1984, p. 175)

Perrow’s concept of interactive complexity uses qualitative descriptions of either ‘complex’ or ‘linear’ for a system. He defines complex systems by their characteristics, which include:

. . . component proximity, common-mode connections, interconnected subsystems, limited substitutions, feedback loops, multiple and interacting controls, indirect sources of information, and limited understanding of the processes involved. In contrast, linear systems are characterized by spatial segregation (of components and subsystems), dedicated connections, segregated subsystems, easy substitutions, few feedback loops, single purpose regulating controls, direct information, and extensive understanding of the process technology. (Perrow, 1984, p. 85)

Perrow also defines the term coupling descriptively. He states “Loosely coupled systems tend to have ambiguous or perhaps flexible performance standards” and “Loose coupling, then, allows certain parts of the system to express themselves according to their own logic or intents” (Perrow, 1984, p. 91). He identifies several characteristics of loose coupling associated with technology that include “Processing delays are possible, order of sequence can be changed, slack in resources is possible, buffers and redundancies available and substitution is available” (Perrow, 1984, p. 92). Perrow describes ‘tightly coupled’ systems as having “more time dependent processes, more invariant production sequences, only one way to a production goal” and “tightly coupled systems have little slack” (Perrow, 1984, p. 93).

Testing Normal Accident Theory

Wolf (2001) )(Wolf & Berniker,1999) tested NAT on a population of petroleum refineries. He operationalized interactive complexity utilizing Boltzmann’s equation for entropy, E = k log D. D represents the potential disorder of a system or the total number of possible possible states of the refinery. The refineries were divided into process subsystems and the number of control nodes in each subsystem tabulated. The potential states at each node were defined and a complexity index for each subsystem, and the whole refinery, was determined combinatorially.

[pic] (1)

Where:

Ciplant = Complexity of a specific refinery

n = Number of nodes in a specific refinery

m = Number of parameters at node i

Qij = Number of possible states of parameter j at node i

Qi = Number of possible states of all parameters at node i

Ci = Number of possible states of all parameters of all nodes for the plant

system

An appreciation for the cognitive complexity of an oil refinery may be gained when the criterion for dividing a population of 36 refineries into less and more complex groups is log 30.

Perrow (1999) notes that the research address a vagueness in NAT.

Is incomprehensibility on the part of the operator due solely to hidden, mysterious, and unaticipated interactions, or, can it be the result of the sheer volume of the number of states that the system is forced to monitor. (p363). As Berniker, comments in a letter:

We might postulate a notion of Organizational Cognitive Complexity; i.e. that the sheer number of nodes and states to be monitored exceeds the organizational capacity to anticipate or comprehend. Complexity indices for major refinery units processes varied from 156,000,000 for alkylation/polymerization down to 90 for crude refining. The very use of a logarithmic scale to classify complexity suggests the cognitive challenges facing an organization.

Coupling may be thought of us the density of energy flow in the system. Higher energy flows represent the rate at which material flows through the refinery and its temperature in the various refining processes. Higher energy configurations are higher risk configurations. The systems seeks to minimize its free energy. The more states accessible to the system (the greater its interactive complexity) the larger is its entropy. Increasing temperature increases disorder. As technology becomes more interactively complex and is engineered with ever-increasing energy density (including, but not limited to, heat, pressure, transfer rate kinetics, exothermal reaction kinetics, etc) (Perry, 1997) (Pool, 1997a), its risk of system accident increases. System accidents are a manifestation of disorder required by the second law of thermodynamics.

Coupling was operationalized, with respect to the refineries in two different ways. The original research asked ifthe refinery operated continuously, pumping their products into a pipeline, or discontinuously, delivering their products to tanks. This is a familiar operations distinction, continuous or batch processing. Subsequently, the sample size was greatly expanded and the coupling parameter generalized in terms of available resources. In this second case, Return On Assets served as a measure of resource availability.

The outcome variable was Reportable Quantities of Hazardous Substance Releases (RQ’s). Such releases are very serious accidents. The actual quantities released that must be reported varies with their toxicity.

Research Results

The results of both research efforts are presented in Figures 1 and 2. Figure 1 demonstrates the very significant differences between the two major groups of refineries, those that are more complex, (above Log Ci = 30) and tightly coupled (continuous processing) from those that are less complex and loosely coupled (batch processing). The difference in RQ rates, 9.8/plant per year versus 0.3/plant/year is significant above p = 0.01 (Wolf and Berniker,1999).

Wolf then deepened the research by increasing the sample size to 70 refineries, reconceiving coupling in terms of available resources, operationalized as Return on Assets, and modifying the outcome variable, RQ’s, to take into account scale effects. These results are seen in Figure 2.

This sample of 70 refineries demonstrated a significant relationship exists between complexity, resource availability (as determined by return on assets) and the number of reportable quantity releases experienced by the refineries. (Wolf and Finnie, 2001) First, the outcome variable, RQ/10MBC strongly demonstrates the refineries were highly reliable systems. Given the average values for the less complex and more complex refineries, 1.6 and 2.3 RQs per 10 million barrels of crude oil processed respectively,. reportable quantity release accidents are rare events.

Yet these high-consequence systems behave in accordance with Normal Accident Theory. More complex refineries experienced significantly (p=.01) more RQs per ten million barrels of crude distilled than less complex refineries.

Second: as noted by Marcus and Nichols (1999) , organizations must have signals to serve as warnings that safety may be drifting beyond a level of acceptability and make use of resources available to take corrective action in a timely way. Refineries of equivalent complexity having greater available resources, should on average, experience fewer reportable quantity accidents. Indeed, as shown in Figure 2, this was demonstrated at a high level of significant (p=.01) for the 70) refineries in the sample.

As Wolf (2001) noted:

“…as Perrow argues, poorly run organizations will have more discrete errors that are available for unexpected interactions that can defeat safety systems and thus will be more prone to “system accidents” as he refers to normal accidents. Less financially successful organizations have fewer resources available for operational exigencies such as preventative maintenance, replacement of aging equipment or modernization and less slack operating resources. Poor financial performance can also trigger risky actions including the use of low-bid outside contractors. But secondly, and Perrow does not note this, poorly run or financially starved organizations may be forced to operate with tighter coupling as a result of cost cutting measures. Tight coupling may not be just a requirement of the manufacturing process, but may be a managerial decision based upon budgetary stress or profit targets.”

Refineries operated by more financially successful firms experienced significantly fewer RQs/10MBC when classified according to complexity, than less successful firms. Regardless of how such resources are applied, when fewer resources are available, less corrective action is possible and the band of acceptable performance will be forced wider. In this case, the organizations with fewer resources are faced with the dilemma of having to cope with the signals of an impending safety boundary as evidenced by significantly higher incidences of reportable quantity releases while lacking sufficient resources to take appropriate risk mitigating actions.

Third: the data suggests there is some evidence for improving reliability, perhaps through organizational means. Three of the refineries classified “more complex” experienced no RQs during the period 1993-97.All were operated by the same corporation. We do not know if this is the result of some systematic and deliberate effort to under-report such accidents, or if other factors, possibly organizational in nature are involved. Additional work is planned to determine if organizational factors were involved in the atypical performance of the three refineries.

We believe the validity of NAT as a useful model for the challenges presented by complex, high consequence, systems has been well demonstrated. The question remains, how are such systems to be safely operated?

Managing Complex High Consequence Systems

The primary challenge is:

How can we safely design, build, and operate complex, high consequence, technical systems under conditions of necessary uncertainties and many paths too failure?

This challenge should be addressed at two levels: research and operations. The research paradigms of the engineering and organizational sciences are incommensurate. Each defines phenomena in different terms excluding the other. Neither can move far in reducing uncertainty. There is a need to build a ‘bridge of uncertainty’ upon which both sets of disciplines can collaborate; a bridge between the physical and social sciences.

The foundation for that bridge is the recognition that neither set of disciplines can overcome fundamental uncertainties. However, cooperating, they can push back the walls of uncertainty and make progress. Therefore, the acceptance of uncertainty, the recognition by each discipline of their fundamental limitations and of the complementary potential contributions of the other disciplines is the ‘bond’ that makes such a multi-disciplinary bridge possible. Once any discipline claims to have “The Answer” it has walked off the bridge.

Information and Cognition

The point of articulation between the social and engineering sciences is where information is cognitively processed. Engineers, in designing systems, determine the nature of information that can be drawn about the system, instrumentation, control points, and the sets of potential actions that operators make take. Yet, engineers are not trained in the cognitive processes of individuals or groups and, therefore, their designs are often suboptimal and in some cases, even dangerous (Leveson, 1995).

Organizational and cognitive scientists seek to organize learning and decision making with little recognition of the technical significance of much systems information, i.e. the cause and effect relationships of potential actions and how those actions may be represented by the instrumentation of a system. Without technological understanding, subtle and important distinctions can be lost. Weick (1990) demonstrated the potential complexities of just a few words uttered by the KLM pilot in the Tenerife catastrophe. There is a need for greater technological understanding by organizational scientists and greater cognitive understanding by engineers.

The shared, collaborative, research agendas across this ‘bridge of uncertainty’ might be framed by the following questions: For engineering scientists, what principles should guide the design of complex, high consequence systems intended to be operated by competent teams with the capacity to learn? For organizational scientists, how would competent teams or operating groups be designed to generate technological knowledge about complex, high consequence systems?

Complex Systems Operation

We start with Petroski’s (1992) argument that all engineering designs are hypotheses to be tested and the comments by Starbuck and Millikan (1988) that fine tuning is “experimentation in the face of uncertainty.” Thus, the operation of complex systems constitutes empirical testing of engineering design hypotheses often in poorly designed experimental settings fraught with uncertainty and failures.

Thus, the challenge becomes:

How do we design, build, and operate complex, high consequence systems as competent experiments that will enable designers and operators together to push back the walls of uncertainty and prevent, in so far as possible, unrecoverable failures?

The designers might be guided by the following principles:

• Simplify as much as possible and avoid over design (Leveson, 1995)

• Avoid excessive redundancies that provide back-ups but also increase the paths to failure.

• Increase decoupling in order to increase the potential responses of operators and allow better isolation of subsystems and observation and control of variables.

• Increase information sources relevant to projected uncertainties.

Most important, designers should:

• make explicit those uncertainties that their design safety factors, redundancies, and controls are intended to control.

Questions and uncertainties are better shared with the operating organizations than attempts to design around them. Rather then assuming solutions, they should be empirically tested. Designers should set the experimental agenda to whatever extent possible.

From the organizational side, operating teams should engage in “High Reliability Organizing.” When operating complex, high consequence systems, organizations cannot achieve high reliability, that is, eliminate uncertainty and risk. We can achieve high reliability organizing which is the process of gaining knowledge and pushing back the walls of uncertainty.

In describing High Reliability Organizing (HRO), we draw primarily from the work of Weick, Sutcliffe, and Obstfeld (1999). HRO calls for “mindfulness” as a general term for the many ways that different cognitive processes interact to create knowledge. These processes include:

• Preoccupation with failure

• Reluctance to simplify interpretations

• Sensitivity to operations

• Commitment to resilience

• Underspecification of structures

Consider these processes as guiding principles for the ongoing development and execution of competent experiments linked to unexpected events. The connection between systems uncertainty and mindfulness is captured by Schulman (cited in Weick, Sutcliffe, and Obstfeld (1999). There is

Widespread recognition that all potential failure modes into which the highly complex technical systems could resolve themselves have yet to be experienced. Nor have they been exhaustively deduced.

Therefore, the primary focus should be on competent experimentation to produce knowledge. The budgets for operating groups should include provision for learning. These extra costs are likely to be far smaller than the costs of over design and the extra maintenance costs of over designed systems. Perhaps, as we enter the “knowledge economy,” accountants will be able to see knowledge creation as an investment rather than an overhead expense.

With respect to complex, high consequence systems, organizing will generally involved multi-professional teams because of the wide range of professional knowledge involved. There has to be continuing involvement of design and maintenance engineers along with operators. Such teams are more complex than multi-skilled teams where it is expected that all members can master all required skills. Given the challenge of competent experimentation, it should be apparent that operating roles will become professional with demanding qualifications.

Managing complex high consequence technical systems becomes organizing ongoing competent experiments whose purpose is to push back the boundaries of system uncertainty by mindful attention to system performance, continuos questioning, and the systematic evaluation of what is known, what is in doubt, and what might remain unknown. The organization literature terms this a learning organization. In this case, the subjects are both organizational functioning and the functioning of a complex tightly coupled system. C. West Churchman (1971) called for such a model when he wrote that systems design processes cannot be separated from their outcomes.

Figure 1

Four Quadrants of Normal Accident Theory and Refinery Reliability

As Reportable Quantity Accidental Hazardous Substance Releases

(Wolf and Berniker, 1999)

COMPLEXITY

References

|Bertalanffy, Ludvig von, (1968) General System Theory, New York: George Braziller |

|Churchman, C. (1971). The Design of Inquiring Systems. New York: Basic Books. |

|Davidson, R. (1970). Peril on the Job. Washington, D.C.: Public Affairs Press. |

|Dorf, R. (1996). The Engineers Handbook. Boca Raton, FL: CRC/IEEE Press. |

|Heinrich, H. (1959). Industrial Accident Prevention (4th ed.). New York: McGraw-Hill. |

|Hirschhorn, Larry The soul of a new worker Working Papers pp.42-47 January/February, 1982 cited in Leveson, Nancy G. (1995) |

|Safeware: System Safety and Computers Reading,MA, Addison –Wesley |

|Iveson, W. and Coombs, C. (1996). Handbook of Reliability Engineering and Management. New York: McGraw-Hill. |

|Leveson, Nancy G. (1995) Safeware: System Safety and Computers Reading,MA, Addison –Wesley |

|Mainzer, K. (1995). Thinking in Complexity. Berlin, Germany: Springer. |

|March, J. (1994). A Primer on Decision Making. New York: Free Press. |

|Marcus, A. and Nichols, M. (1996, August). Acquiring and Utilizing Knowledge in Response to Unusual Events in a Hazardous Industry.|

|Paper presented at the Annual Meeting of the Academy of Management, Cincinnati, OH. |

|Marcus, A. and Nichols, M. (1999) On the Edge: Heeding Warnings of Unusual Events. Organization Science, 10 (1), 482-499. |

|National Safety Council. (1998). Accident Facts 1998 Edition. Itasca, IL: Author. |

|Perrow, C. (1984). Normal Accidents. New York: Basic Books. |

|Perrow, C. (1999). Normal Accidents.(2nd ed.). Princeton, NJ: Princeton University Press. |

|Perry, R. (1997). Perry’s Chemical Engineer’s Handbook. New York: McGraw-Hill. |

|Petroski, H. (1992). To Engineer is Human. New York: Vintage Books. |

|Petroski, H. (1994). Design Paradigms. Cambridge, England: Cambridge University. |

|Pool, R. (1997). Beyond Engineering. Oxford, England: Oxford University. |

| |

|Rasmussen, J. (1990). The Role of Error in Organizing Behaviour. Ergonomics, 33 (10/11), 1185-1199. |

|Rasmussen, J. (1994). Open Peer Commentaries in High Risk Systems. Technology Studies, 1 (1), 25-34. |

|Roberts, K. (1989). The Significance of Perrow’s Normal Accidents: Living with High Risk Technologies. Academy of Management |

|Review, 14 (2), 285-289. |

|Rose, N. (1990). Profitability and Product Quality: Economic Determinants of Airline Safety Performance. Journal of Political |

|Economy, 98 (5), 944-964. |

|Sagan, S. (1996). Redundancy, Reliability and Risk. Paper presented at the Annual Meeting of the Academy of Management, Vancouver, |

|B.C. |

|Schrodinger,E. (1967) What is Life Cambridge: Cambridge University |

|Sellers, C. (1993). Why We Can’t Manage Safety. Boston: Arthur D. Little. |

|Shrivastava, P. (1992). Bhopal: Anatomy of a Crisis (2nd ed.). London: Paul Chapman. |

|Starbuck, W. and Milliken, F. (1988). Challenger: Fine-Tuning the Odds Until Something Breaks. Journal of Management Studies, 25 |

|(4), 319-340. |

|Stein, B. and Kanter, R. (1993). Why Good People Do Bad Things: A Retrospective of the Hubble Fiasco. Academy of Management |

|Executive, 7 (4), 58-62. |

|. |

|Turner, B. and Pidgeon, N. (1997). Man-Made Disasters (2nd ed.). Oxford, England: Butterworth-Heinemann. |

|Weick, K. (1990). The Vulnerable System: An Analysis of the Tenerife Air Disaster. Journal of Management, 16 (3), 571-593. |

|Weick K. and Roberts, K. (1993). Collective Mind in Organizations: Heedful Interrelating on Flight Decks. Administrative Science |

|Quarterly, 38 (3), 357-381. |

|Weick, K., Sutcliffe, K. and Obstfeld, D. (1999). Organizing for High Reliability. Research in Organizational Behavior, 21 (1), |

|81-123. |

|Weinberg, Gerald M. (1975) An Introduction to General Systems Thinking New York, Wiley |

|Wenk, E. (1995). Making Waves. Urbana, IL: University of Illinois. |

|Wolf, F. (2001). Operationalizing and Testing Normal Accidents in Petrochemical Plants and Refineries. Production and Operations |

|Management 10,(1) |

| |

|Wolf, Frederick and Berniker, Eli (1999) Validating Normal Accident Theory: Chemical Accidents, Fires and Explosions in Petroleum |

|Refineries High Consequences Engineering Conference. Sandia National Laboratories, Albuquerque,NM,USA |

| |

| |

|Wolf, F. and Finnie, B. (2001) Capacity Utilization, Resource Availability and Coupling: |

|Organizational Factors Influencing Risk in Complex Technical Systems. Production and Operations Management. In Review. |

-----------------------

[pic]

Catastrophic accidents = 0

RQ Releases = 26

RQ average = 0.3/plant/year

COUPLING

(n = 17 refineries)

(n = 3 refineries)

Catastrophic Accidents = 3

RQ Releases = 829

RQ average = 9.8/plant/year

Catastrophic Accidents = 0

RQ Releases = 15

RQ average = 1.0/plant/year

Catastrophic accidents = 0

RQ Releases = 15

RQ average = 1.0/plant/year

Looser Tighter

(n = 16 refineries)

4

2

3

1

More Complex

[pic]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download