Drift into Failure

Drift into Failure, by Sidney Dekker , is one of the most thought-provoking books I’ve read in a while.

“Thought provoking” is usually a shorthand used by buttered-up friends of the author to mean “I agree” or “he/she provided a great blurb for my dust jacket and now I’m returning the favour”.

But in this case, I found that the book provoked a lot of thought on my part. It tied to a lot of other books I’ve read in the past year or so, some of which I’ll name check.

So … what’s it about?

Dekker discusses how complex systems ‘fail’ in unforeseen ways. He characterises some of these failures as ‘drifts’. The system didn’t visibly zoom towards failure; there was no massive perturbation, no onrushing catastrophe, not even dark clouds on the horizon. In a drift-failure, the failure just happens, and only afterwards is there any chance of diagnosing the whys and hows.

Drift essentially cross two fields of work. The first is reliability / failure studies and the second is complex systems. I’m not very familiar with reliability studies except through a Chinese-whispers version that has been transmitted via software operations literature. I feel that I have a more-than-nodding acquaintance with systems theory through a uni course and my own reading in that area.

To a reader unfamiliar with either body of thought, this book might be a bit difficult. Dekker isn’t really addressing the book to the layperson, it’s really addressed to practitioners reliability/failure field. Dekker’s ultimate hypothesis is that a “Newtonian-Cartesian” approach to failure does not and cannot address failures in complex systems.

If you’re not from the reliability field, Dekker’s writing is a bit like being an atheist at a theological debate. Interesting, but a little hard to follow in parts. But boy does he have lots of points to make.

I respectfully disagree

I don’t think Dekker quite nails his case down. For the rest of the review I will try to explain why. Hang on, because it’s a long, circuitous ride.


As I said above, Dekker posits that a Newtonian-Cartesian worldview can’t explain or predict failures in complex systems. Of most concern for yours truly is that, in addition to reaching for complex systems theory, he reaches out for postmodernism. I’m not a particular fan of postmodernism — I think that some of its insights can be usefully appropriated into modernist thinking, but its universalist claims are dangerously nigh to total bunkum. I don’t think Dekker needed it.

Dekker uses postmodernism to posit that failure is a negotiated label. A system isn’t “failed” until after a failure, and the very concept of failure is constructed as an agreement between observers and participants of the system. Hence: failure is subjective.

Well, yes. Certainly, failure is, after a fashion, transmitted backwards in time. But many of the systems humans build are purposive. The purpose is known ahead of time, in advance. Even before further negotiation between subjects take place, many failures are instantly recognisable as failures.

Local optimality, global optimality and failure

Dekker chose the “drift” metaphor because a system arrives at failure in small, locally-rational steps. In one case study, he examines Alaska Airlines flight 261 in great detail. In this case study, a series of small relaxations on safety standards eventually lead to a catastrophic system failure (sudden, unpredictable loss of human life).

Dekker asks: when did the system fail?

  • Did it fail when the particular acme nuts failed?
  • When maintenance was not performed?
  • When times between scheduled maintenances were extended?
  • When the design was made without accounting for the possibility of the above?

This goes back to the distinction between proximal and ultimate causes, popular amongst both reliability studies practitioners and lawyers. The proximal cause is clearly the acme nuts failing … but in this case, Dekker says, where is the ultimate cause? It’s diffused across the entire system, across a series of locally optimal solutions.

Local and global optimality is a classic human problem. In Daniel Kahneman’s excellent book Thinking, Fast and Slow he metaphorically describes different ‘selves’. One self is a fast, almost subconscious self; an intuitive rationaliser. It excels at locally optimal solutions. A second ‘consciously rational’ self must be aroused purposefully. “Math is hard”, as Barbie says, so let’s go shopping. Hence we almost never actually engage that second self, even in situations where we think we have. Kahneman includes lots of little fiendish self-tests for the reader that abundantly prove his case.

Once you see the distinction between the locally optimal and the globally optimal, cases jump out of the woodwork everywhere you look. It’s funny, because I learnt the concept of local/global optima at university but never really clicked to it before reading Kahneman.

And as with optimality, so too rationality. What is rational locally may transpire to have irrational global consequences. Little agents optimising their corner of a large system can cause failed systems. Part of Dekker’s broader hypothesis is that assigning blame is a bit rich in such circumstances — everyone was just acting according to sensible rules within their own situation. It’s all so complicated, give them a break.

I’m not so sure. Take for example the question: “when was the system in a failed state?”

By itself, that question supposes a binary logic. The system IS in a failed state, OR the system IS NOT in a failed state. Dekker sees what anyone can see as a bit of a nonsense and pushes it downwards to our notion of blame. I prefer to look at it and push it up to a narrow conception of logic.

To explain what I mean, I need to make two diversions.

Diversion I: Fuzzy Logic

Here’s where fuzzy logic pops in (and also where, based on the title of this subsection, I lose both of the readers who got this far without giving up out of boredom).

The core insight of fuzzy logic is that we can think of things as belonging to “fuzzy sets”. In normal logic, sets are cut-and-dried. Remember Venn Diagrams? They all looked like this:

Look at all that sharp delineation! Any “thing” in that diagram indisputably in one of five possible states:

  • Blue
  • Yellow
  • Blue AND Yellow
  • Blue OR Yellow
  • NEITHER Blue NOR Yellow

Traditionally we ignore that last condition — the neither/nor — because that way we get a neat formula for calculating the possible number of states for any number of exclusive sets or logical variables.

There’s a lot to like about conventional logic. It’s the granite foundations of the field I hold a degree in — Computer Science. Given ANDs, ORs, NOTs and some ones and zeros, one can build essentially infinitely complex systems (I’ll return to this point later on).

But it doesn’t actually describe a heap of common problems in the actual world.

And speaking of heaps — here’s a classic philosophy question: is this a pile of sand?

Well yes. And if I remove a grain? Still yes. In computer science terms I’ve performed an inductive step, it’s now “turtles all the way down”. The pile of sand is always a pile of sand, perhaps until I remove the last grain.

But we know that’s not “true”, in the every day sense. A few grains of sand does not a pile make. And it gets worse, for when does the pile become a dune? And when does the dune become a desert?

Fuzzy logic sidesteps the issue by saying that the pile of sand has a degree to which it is a pile of sand. This is expressed with a “membership function”. To what degree does this pile of sand belong to the set of all piles of sand? Well in this case, I think we can all agree that it’s a pile of sand, so we grant it a high membership and say it’s a member of that set to a degree of 0.9.

When it gets small, we lower its membership degree. A few handfuls of sand might only rate 0.05 in the membership function. And as it grows very large, its membership degree again shrinks to a low number, even as its membership of the set of dunes grows larger.

Hence in Venn diagram terms, fuzzy sets look a bit more like this:

It’s … fuzzy, as you’d expect. Membership in the blue and yellow sets is not a binary proposition, there are degrees of membership.

Diversion II: Phase Space

Why do we care about sets all of a sudden? Because sets are one way to represent systems. More accurately, any given system has many states, and states can be grouped in various ways as sets.

First, let’s look at a very simple system: a switch. It has two possible states, on and off. The system can be described with a graph, like so:

This is a phase space, a space of all possible states of the system. The phase space diagram here is simple. It has one axis — one dimension — because the system only has one controlling variable. It has two states — two coordinates in phase space — because it’s a discrete binary variable. The switch is on or off. That’s it.

Systems of interest are, unsurprisingly, more complex than that.

Suppose now we have a control panel with one dial. It controls a vent which emits cold air. Next to the dial is a temperature gauge. The dials and gauge are wired to a room which you cannot directly observe. Your job is to reach a certain temperature.

A phase space diagram here would have two axes: one for the dial and one for the temperature. You need both axes to fully describe the configuration of a system at any given point in time.

What does that look like? A bit like this (warning, unsexy diagram):

Now suppose you twiddle the dial. You have changed the configuration of the system — you’ve moved through phase space to a new set of coordinates. We draw a line on the diagram to represent that:

After a while, the temperature falls:

Not the most stunning of diagrams, I grant you. But this is broadly how phase diagrams work. The line is implied to be a span of time, the points are particular configurations of the system.

So when is a system in a state of failure?

Dekker says that systems drift and that from inside the system, such drift isn’t visible until the failure occurs. But we still try to back track to discover “causes”, even when it might make no sense to.

Suppose a 2-variable system drifts to failure:

Dekker posits that in the Newtonian-Cartesian paradigm, we aim to trace that line backwards in time to discover who and what failed. But this is insensible, says Dekker, because in fact the causes can be so diffused over the entire system and not individuals or components.

The “Newtonian-Cartesian mindset”

Dekker decries the “Newtonian-Cartesian” mindset of trying to find discrete causes for failure. Instead each step can be sensible in itself, or causes too diffuse to tease out, or insufficient information to work it out.

I don’t think that Dekker really refutes N-C mindset at all. Just because a step was locally, but not globally optimal, doesn’t excuse it. If global reasoning was available, it should exercised. Causes that are diffuse are still causes. Causes that can’t be detected due to lack of evidence or lack of instruments can still be considered causes (“hidden variables”, in physics parlance).

But Dekker wants to excuse a lot of such cases because, he posits, the Newtonian-Cartesian paradigm is itself broken.

I don’t think he proves his case. Worse still, he handwaves a lot of the rough edges of his argument away. Complex systems are hard to govern, he says. Why are they hard to govern? Because they’re complex. It’s a circular logic.

Ultimately Dekker’s logic relies on the incomplete conception of logic I gave above. In Dekker’s conception, a system is or is not failed. The observable paradoxes of meaning that this generates are then resolved by slapping a “warning: complex system!” tag on it, plus a dose of postmodern voodoo.

What I propose as an alternative is that systems have degrees of failure. Just because, in the every day sense, they have not “failed”, nevertheless within the phase space are fuzzy sets of states that represent all possible failures. And every state in the phase space has some degree of membership in each of those failure states. It might look like this:

(An alternative rendering would be to add the “disaster” membership degree as another axis, but my graphics skills extend only so far).

Going back to Alaska Air Flight 261, when the plane crashed, the aviation safety system was obviously belonged to the “tragedy” failure set to a 1.0 degree. But before the crash, its degree of membership in that set grew steadily as the system drifted towards it.

My formulation does not excuse actors and components in a complex system. They are, where any degree of global insight is possible, still on the hook.

Legalism and Realism

Dekker describes the hunt for a single person or component causing a failure is something he describes as being a “legal view” of things. Which is funny, because lawyers have been grappling with complex systems for thousands of years. They’ve got some tricks up their sleeves.

One debate amongst lawyers is in what role a judge should play. One classic doctrine is “Legalism”, most famously demanded by Sir Owen Dixon during his tenure as Chief Justice of the High Court of Australia:

Close adherence to legal reasoning is the only way to maintain the confidence of all parties in federal conflicts. It may be that the court is thought to be excessively legalistic. I should be sorry to think that it is anything else. There is no safer guide to judicial decisions in great conflict than strict and complete legalism.

Legalism meant that, in considering a case, judges should strive to ignore all considerations but the law. This is, in a strict sense, impossible. The world is too mixed up in the law, the law to mingled with the affairs of the world. Judges are mere humans; a sea of passions with a few stony outcrops of reason. Legalism is, like Newton’s laws stretched to their limits, strictly impossible.

That last argument leads us to Realism, which basically says: judges are biased. Judges make law, in practice. Get used to it.

But the funny thing is that, when we zoom out, which better serves society at large? I would personally argue legalism, imposing as it does much lower uncertainty costs and politicking costs on society at large. And that was Sir Dixon’s point. The loosey-goosey “broadness” of Realism turns out, upon closer inspection, to be founded on a narrower view of society than Legalism. The Legalist embraces an important impossibility because it serves a higher good.

My analogy here is that Dekker is poo-pooing the analogical Legalism of the Newtonian-Cartesian world view — that causes can be ultimate derived from computation and analysis — in favour of a kind of Realism. Systems are complex, he says. Get used to it.

But like the Realists, his analysis is too narrow. Even if he is right (and I think he is only half right, as I will go on to say below), his postmodern / complex system view nurses dangerous seeds. Embracing the concept that there is always a cause or a set of causes leads to better systems, even if it isn’t true.

What is a Complex System, anyhow?

Dekker never really makes this clear, perhaps because he lacks the fuzzy logic terminology to point out that it’s a matter of degree.

I suggest that a “complex” system is any system which successfully confounds human understanding. That’s a fuzzy statement already: which human? What counts as confounding? What counts as understanding? But if we accept the fuzzy logic worldview, it’s less of a problem. Systems will belong to the “complex systems” set with a different level of degree.

But I suggest that there is no qualitative change. It’s just that some problems are too big for humans. Some problems are too big for any computational device, as computer science has discovered — some problems cannot be solved at all by a computing device; some can’t be solved before the heat death of the universe.

But suppose availability of sufficiently advanced hypercomputer (or more quaintly, a god). What could it predict? How deep a system? What level of complexity? Newtonian — really Einsteinian — physics breaks down at the limit because of the uncertainty principle. But supposing it could be done, would this universe be predictable?

I think so. And that’s the most complex conceivable system there is — ie, the System of Everything. No qualitative shift has occurred. It’s a matter of (very, very, very large) quantitative differences.

So in fact “complex” systems are a human phenomenon, a label given to things that exceed 1) our ability to observe and 2) our ability to compute.

Epistemological Confusion

Dekker’s contest between the Newtonian-Cartesian vs Complex-Postmodern worldviews
is really akin to the debate of atheism vs agnosticism.

Newtonian-Cartesianism says “this is reality, this is what is objective” — it’s a statement of belief. Postmodernism/Complexitism is “it’s unknowable, it’s constructed between subjects, it can’t realistically be done that way”. That’s a statement of epistemology, about what is knowable.

But these are talking past each other. Reality is, in a sense, both. There’s an objective reality, broadly a Newtonian-Cartesian reality at the humanly experienceable macroscale. And there’s our understanding of that reality. In a sense complexity just means “intractably difficult to compute”. Dekker has confused a statement of fact (“the world is not Newtonian-Cartesian at the macroscale”) with a statement of epistemology (“the world is not truly knowable at a complex scale”).

To me, a mechanistic universe does not preclude complexity, it predicts it. I can only imagine that a non-mechanistic universe would have no emergent phenomena and would resemble mere randomness. A non-mechanistic universe is entropic in an information-theoretic sense. No information arises from it, and therefore any claims of complexity are meaningless in a postmodern sense.

For example, the mechanistic nature of computers (Turing machines) belies the experienced complexity of modern computer systems. Alan Turing wrote a paper to discuss an important mathematical question, and as a side-effect invented one pillar of the modern world. At the basic level Turing’s hypothetical machine is extremely simple: a tape, a tape reader, a pen and some agreed symbols that can be read or written on the tape. Modern computers, at their most basic and fundamental level, still resemble a pastiche of the Turing machine.

Yet from this very modest little well springs a fountain of complexity. Modern software systems are stupendously complex. Failure is their normal condition; trying to exhaustively test every combination of factors is so vast a task that it is laughed out of polite company. Yet we can test the common cases. Better yet, with some deft mathematical footwork we can simply eliminate whole swathes of phase space from consideration. This is the Newtonian-Cartesian paradigm at work, busily mending its own fences.

Should you read this book?

Yes, I think so. But critically. Dekker’s book makes fascinating reading and I greatly enjoyed it. I may have attacked it here, but that’s only because I think he fell short of elucidating and proving his case. A fine book can still be a fine book even if its contents or conclusions are, in one’s own opinion, wrong (cf. Plato’s Republic).

This entry was posted in Books, Systems. Bookmark the permalink.

4 Responses to Drift into Failure

  1. Fascinating, Jacques.

    In software, at least, it’s my experience that most things can indeed be traced back to a root cause, or causes.

  2. It depends on where you draw the boundaries, Robert. Within the machine? Certainly, in theory. Within the larger system of people interacting with people and other machines? Maybe.

  3. Pingback: Review: The Essence of Hayek (Part 1) | Journal de Jacques

  4. For some time, foolishly I have considered my self to be a very good troubleshooter of mechanical and electrical systems. As we have drifted e from hardware to software and from Mon and Pop to global firms things got much more complicated than my faithful linear cause and effect methods could support. Systems today are not only complicated, they are complex, tight and non linear. Small changes (un-noticed) changes can make for major changes when compounded. The concept of drift perhaps taken from controls is a valuable addition to the needs of today’s troubleshooters and incident investigators.

Comments are closed.