A not-so-brief aside on reigning in chaos

Everything that humans touch eventually becomes complex, whether we like it or not.

(Blog posts too. This one started out as a comparison of three competing software system alternatives and subsequently bloated into a discussion of chaos in computer systems.)

The problem is that being humans, we design systems. Design is less about the desired purpose of a system and much more about the limits of the designers. We’re trying to optimise in two dimensions: the problem space itself, and then according to the limits of human cognition. Within the problem space we can make incremental progress along an ostensibly continuous axis. In terms of human understanding … well that gets tapped out much more quickly.

That’s why I say to people that computer science is about the limits of computation; software engineering is about the limits of the engineers.

What an engineer wants is a linear system. Something where all the relationships are mapped out. Where any scenario can be foreseen and tested with pen and paper. But the world is filled with problems that aren’t linear systems. They have loops, unknown inputs, sudden catastrophic tipping points, surprising sensitivity to initial conditions and so forth.

It gets worse: some of these systems are completely artificial. Take computers. The basis of computer science, and by extension all actual computing systems of any kind, is the Turing Machine. It’s a tremendously simple device: it’s a tape marked with symbols. The tape can be advanced or rewound, and there is a head with a pen and an eraser that can mark or clear points on the tape. An example of such a machine, hewing to Turing’s hypothetical design, would look like this:

You establish a few rules about what to do when a certain symbol is under the head — and voila, all the things that can be computed are now computable. Indeed, any Turing Machine can simulate any other Turing Machine. If you’re interested, the book to look at is Charles Petzold’s fascinating The Annotated Turing.


Deep down, modern CPUs still resemble Turing Machines, if you squint a bit. Here’s some very low-level code for adding 1 to 2:

        li    $t1,   1
        add   $t0,   $t1,   2

The first line says “load the number 1 into location t1”. The second line says “add 2 to the value in location t1. Store the result in location t0”.

You can see how this could be visualised in tape-and-head terms. Symbols or strings of symbols can encode the instructions, the addresses and the values being computed. Low-level code of this kind (called assembler) mostly concerns itself from shuffling values from memory into the chip, performing a simple operation, then shuffling it to another spot. Memory for a chip looks like a lot like Turing’s tape.

Deterministic but Complex … even Chaotic

So computers are deterministic. At each step, you can always predict what the next step will be. And often you can see a few steps ahead (but there are certain things you can’t know in advance — read the Petzold book for why).

Yet computers, at the level that we actually use them, are wildly unpredictable. Things regularly go kerflooie. What’s worse: we don’t know why. We have a computer system that is faulting, but we don’t know what defect or complex of interacting defects has caused it. The process of diagnosis is usually called “debugging” and it is a storied art in its own right.

The root approach to debugging is basically a degenerate scientific method. Form a gut suspicion, change something, observe the results, repeat. And indeed some of the literature on debugging calls for this approach to be consciously practiced. And it’s a useful abstraction to borrow, because it adds a few little accoutrements to the standard frantic intuitive debugging that occurs.

The better-dressed process for debugging has two parts. First you must be able to recreate the faulty behaviour; second you must diagnose the root defect(s). I’ll worry about diagnosis of defects another time.

So first you take the bug report and try to see if you can recreate the defect. And this is usually where debugging falls flat on its face. Computer systems are, it seems on the face of it, chaotic: small differences in starting conditions can blow up into stunning differences in actual performance. The developer’s workstation environment is sufficiently different from the production environment that some common factor is missing. (This is why web developers always start by asking “which browser are you using?” in the faint hope that they can blame it on Internet Explorer and wash their hands of the matter).

Controlling Chaos

So one source of misbehaviour in complex computer systems is divergent configurations. Different operating systems, slightly different versions of the same operating system, slightly misaligned system clocks, differently configured services, different network interfaces and on and on ad infinitum. And human intervention can make things worse, not better. System administrators log into a misbehaving server, notice that a particular setting is wrong and fix it manually. Before they go on to fix the secondary server the phone rings and they’re distracted. This well-meaning system administrator has just made things worse. The level of entropy has increased. The system landscape has just become more fiddly and has more hidden gullies of failure. The odds of failure have increased.

Luckily, computers are not humans. They will put up with any amount of regimentation and strictness without complaint. And unlike human laws, where compliance can never be complete, computers will always faithfully obey commands. Even the defective ones.

Let’s play make-believe for a minute. Playing make-believe is uniquely important to software development because — in general, modulo mathematical impossibilities — if we can imagine a computer doing something, we can eventually make the computer do that something.

Imagine that we have a reference model of how the system should be configured. Periodically some software inspects the real system and compares it to the ideal model. If there’s a difference found, that software takes necessary actions to bring the real system into alignment with the model we established earlier.

The pair of you who read this blog might recognise what we have here. Yes: it’s another control system.

I love it when an analogy comes together

Luckily for me, it won’t be necessary to develop such a control system myself (and thereby pick up something useless like a Masters or PhD en route). This genre already exists, and there are three main pieces of software in it: CFEngine, Puppet and Chef.

CFEngine, Puppet and Chef compared

The first tool I looked at was Chef. Of the three, Chef is the youngest. Being a groovy, funky, channel 27 sort of fellow, I decided this meant that It Was The Best.

I persisted with Chef for about 6 or 7 weeks, including a long diversion into trying to write a new package provider for it. I threw in the towel for three reasons. Most important of these was the realisation that my vision of such a system, and Chef’s vision, were actually very different. Chef positions itself as an alternative for Puppet and CFEngine, but it’s not, really. It’s a remote execution engine with a great deal of architectural overhead.

Part of why it took me so long to realise this is that Chef falls prey to the Ruby community propensity to utterly unnecessary puns and wordplay. I’m now embarking on my fourth decade and the appeal of wrapping concepts in clever but non-descriptive names has well and truly worn the hell off. Chef takes the whole metaphor to the nth degree — recipes, knives and so on — until mysteriously it doesn’t (Ohai). This did not help me in coming to grips with how the thing actually worked.

And that’s the second problem: you need to know too much about how Chef works in order to get work done. Chef is less a tool and more a framework; and it’s not until you’ve spent weeks head-butting the documentation that it becomes apparent that this framework is about abstractly listing commands for remote execution.

If only I’d paid more attention to the surrounding literature. Chef’s documentation repeatedly mentions the concept of idempotency. The point is that Chef only seems to get as far as idempotency and no further. If you scripts are idempotent, goes the Chef reasoning, running them multiple times will always cause your system to converge to the correct state.

The problem is that this pushes all responsibility for ensuring idempotency back onto me. And I don’t bloody want it. I would rather have a detect-repair mechanism embedded in the tool than have to recreate that logic myself. So far as I can tell, Chef supports this only partially.

Next I looked at CFEngine. This tool has been around in various forms since the early 1990s. Indeed there’s a certain amount of theory that underlies the tools. But I still found that CFEngine wasn’t for me, for two reasons.

First, the specification language is unnecessarily poor. There’s too much unnecessary syntax and fiddliness that could have been done away with. In some places the chosen nomenclature is confusing. Why are “classes” not called “conditions”, for example? Especially since the term “class” means something entirely different in other languages — and the term “conditions” neatly describes what CFEngine means by “class”.


My main issue with CFEngine is that it does not allow for dependencies between elements of the model to be enforced. Instead CFEngine relies on multiple applications of policies to “gradually converge” to the desired state. In the book Learning CFEngine, an enormous amount of jiggery-pokery is devoted to recreating the concept of dependencies. “Baby”, goes the saying, “I don’t got time”.

Puppet is the middle child. Puppet inherits some of its thinking from CFEngine but, blessedly, directly allows modelling of dependencies between different parts of the model. This turns the configuration model from a being either a collection of idempotent scripts (Chef) or a thin gas of atomic configuration items (CFEngine) into being something useful: a directed, acyclic graph of configuration items.

What Puppet appears to get right, in contrast to the other two, is that I want both the declarative, detect-and-repair model for individual configuration items and that the relationship between those items matters.

In this respect Puppet better models how the real world of systems works. It also fits better into models such as ITIL, which are grounded in dependency models of configuration items and services.

And so it is that I will be using Puppet from hereon out to configure my systems and to continuously reign in creeping entropy on them. One more pocket of the universe saved from the greedy clutches of chaos.

This entry was posted in Robojar, Software Engineering, Systems, Technical Notes. Bookmark the permalink.

27 Responses to A not-so-brief aside on reigning in chaos

  1. David says:

    Give Salt a try. saltstack.org
    Configuration is easier.
    Salt scales.
    Salt is fast.
    #irc on irc.freenode.net

  2. Excellent article.

    CFEngine’s classes predate OOP classes.

    CFEngine is adding support for modeling and enforcing dependencies with the depends_on attribute.

    CFEngine is a small C binary which means it can run in a lot of environments and on a lot of devices.

    Thanks again for your comparative analysis.

    Aleksey
    CFEngine Trainer
    CM Enthusiast

    • CFEngine’s classes predate OOP classes.

      Given that CFEngine 1 was created in 1993 and Simula 67 was created in … 1967 … this is a silly claim to make.

      I like the compiled nature of CFEngine and it’s good news that proper dependency modelling will be added to the language. For now I’ll stick with puppet because it’s already there.

      • Right on. Thanks for the data re Simula 67, that’s good to know. I appreciate the correction.

        I should have said the CFEngine term “class” was picked before OOP got really popular – according to Wikipedia that was in early to mid 1990’s. But I don’t mean to belabor the point. Thanks again for the correction.

        Best,
        Aleksey

  3. Pingback: IT Automation Digest for November 2, 2012 | Puppet Labs

  4. Eystein Stenberg says:

    Thanks for writing the article.

    When it comes to dependencies, CFEngine deliberately avoids modeling them as a DAG because it is just not possible in the generic case (unfortunately). It seems to be a mixed blessing for the Puppet users also (try googling for ‘puppet could not find dependency’ for example).

    The issue of ordering is certainly not easy. In Chef you implicitly specify an order since it procedural, in Puppet you must explicitly specify an order for things that need it (otherwise they may be executed in any order). In CFEngine, you can specify the sequence for parts of the policy implicitly (through bundles) – so it is somewhere in the middle – and convergence also severs as an important concept.

    • In CFEngine, you can specify the sequence for parts of the policy implicitly (through bundles)

      And the CFEngine book spends about half its time doing exactly this. You wind up littering the conceptual universe with “class” after “class” in order to sort-of create an order.

      The concept of convergence is nice, so far as it goes. But the way CFEngine achieves it is … to be run a 3 times and assume that convergence will occur. While in practice the Beetlejuice method probably works a treat, I simply don’t like leaving it to chance. As I say in the review, I assumed for all three systems that the conceptual model is a DAG with a central detect-and-repair mechanism to explicitly keep the actual system aligned with the model. But only Puppet actually does that.

      • Mark Burgess says:

        I don’t think you really understand how CFEngine’s algorithm works. Three passes (by default) are used to resolve non-deterministic dependencies not preplanned ones. There is chance involved, because you cannot have complete information about the system at the start.

        Unfortunately, you are leaving it to chance with Puppet too. It is theoretically impossible to guarantee the correct outcome when the very system you are running can change the initial state. Puppet has to base its computation of a DAG on the state before the system is changed, but then it can change the system, so that invalidates the original assumption. In the worst case, there could be an infinite loop, which is why CFEngine allows an arbitrary cutoff at tree-depth of 3.

        The comparison in the article seems fairly biased in favour of Puppet, and the conclusions are not quite supported by evidence. But I enjoyed reading the introduction. Thank you for the nice article.

        • Mark Burgess says:

          Ah, I meant to add: loop or you can miss an item so that the DAG is incomplete. It’s a hard problem that needs formal methods rather than “opinions” to solve.

          • Let me see if I have you right.

            Your problem is not so much with having a DAG model, but with the repair phase.

            ie while trying to satisfy A, B becomes invalid; while satisfying B, A becomes invalid. So the repair mechanism flip-flops in a loop forever, unable to progress.

            Is that right?

        • There is chance involved, because you cannot have complete information about the system at the start.

          Interesting. You’re saying that a single-pass DAG approach can’t work because each step in the repair phase can invalidate the model.

          OK, I see that. One thing I didn’t address that’s been lingering at the back of my head is that these tools all tend to emphasise modelling the “system at rest”, rather than the “system in motion”. By that I mean, they excel at describing the contents of configuration files, what software should be installed and so on. Conceptually they are best at describing how the system should look at boot time. CFEngine had by far the richest capability for this.

          Where it starts to come a bit unglued is in the transition to runtime. CFEngine has better mechanisms for periodically performing process-montioring tasks compared to Chef or Puppet. But it irks me, for example, that the common case is to say “deploy this Upstart/SMF/monit/whatever script” rather than “this service relies on that service”. There’s a clear fault line between the bits on disk and the bits when they spring to life.

          In the worst case, there could be an infinite loop, which is why CFEngine allows an arbitrary cutoff at tree-depth of 3.

          Puppet dies when a circular dependency is created.

          The comparison in the article seems fairly biased in favour of Puppet, and the conclusions are not quite supported by evidence.

          Well it’s biased insofar as that’s how I found myself reacting to the experience. I didn’t have a horse in this race before I began.

          But in retrospect, I see that the DAG/detect-and-repair approach is how I’d mentally modelled the problem before I even started. So really it was my fault for not seeing that the three programs actually have very different conceptual approaches.

    • “in Puppet you must explicitly specify an order for things that need it (otherwise they may be executed in any order). ”

      Actually that’s not accurate. Since Puppet 2.7.0, resources that are not specifically ordered now sort in the same order each run. They are executed in the same order each time NOT any order. This removes issues where the random order of resource execution may cause issues in multiple runs.

      • Aleksey Tsalolikhin says:

        How is that order determined, please, James?

        Best,
        Aleksey

        • Order is based on the:

          1) the status of the resource (ready, done)
          2) the explicit & implied dependencies
          3) the salted SHA1 of the title (for stability)

          The last one, 3), is the change that ensures ordering is no longer top sorted and potentially random. I’m not an expert in the maths but it was described to me as frontier ordered by salted SHA1.

  5. Hi Jacques,

    First of all, great article – modeling, monitoring, configuring and fixing real systems is a hard task.

    Second, and disclaimer: I work at CFEngine and I’m also the author of Learning CFEngine 3, so I’m not completely unbiased 😉 but I try to be fair.

    You make some good points about all systems, but I’d like to comment on a few regarding CFEngine, since there have been a number of significant changes recently:

    About dependencies: when needed, ordering can be expressed in CFEngine using classes (as you saw in my book). As of CFEngine 3.4.0 (now in public beta), the depends_on attribute (which used to be informative only) has become active, and can be used to vastly simplify the declaration and handling of these dependencies. Now you can just use depends_on => "promise_handle", and CFEngine will automatically do the rest. This is not mentioned in the book because it covers only CFEngine 3.3.0 – I’m working new edition in the works that will cover all the new features in 3.4.0.

    About services and service dependencies: CFEngine 3.3.0 introduced “services:” promises for Unix (previously they existed only on the Enterprise version, for Windows), which allow you to express the desired state and dependencies of services at a high level, and have CFEngine handle the underlying details automatically. Again, something that will be better covered in the upcoming edition of the book.

    About keeping up with chaos and modeling the system in motion: CFEngine runs every 5 minutes by default (and is able to keep up with this frequency even for thousands of hosts), so it is is pretty good at keeping an eye on and fixing things that deviate. For a good explanation of the 5-minute cycle, see https://cfengine.com/blog/ten-reasons-for-5-minute-configuration-repair . A CFEngine promise will check every time (obeying also some locking and timeout rules to avoid overload) whether it is in the desired state, precisely because systems tend to change under our noses, so we can’t assume that what has been fixed will stay fixed.

    About idempotence vs convergence: as you rightly point out, they are not the same, and there are very important differences. Idempotence is an operation that leaves the system unmodified if it’s already in the desired state, whereas convergence is the property of a system of not making unnecessary changes (nor operations). The end result may be the same, but CFEngine emphasizes convergence over idempotence, and has A LOT of built-in logic to automatically avoid performing unnecessary operations.

    About the use of “classes” – we are slowly trying to refer to them as “contexts” instead of “classes”. As Aleksey mentioned, the use of “classes” in CFEngine predates the wide popularization of its use for OOP (although not its creation). Eventually the term should be replaced by “contexts”.

    Thanks again for the nice article, and all the best.

    • Great reply, thanks.

      By services I suppose it’s more like the concept of an init script? Ie, “WordPress depends on PHP, MySQL and Apache — are these running?”

      That’s part of the runtime model; but a true process manager inserts itself into the fork() hierarchy to monitor and manage services in real time.

      Architecturally that’d be a big step to take. But boy it would be nice from the sysadmin’s POV.

  6. Pingback: Idempotence vs Convergence in Configuration Management | Vertical Sysadmin

  7. Jeff says:

    Jacques, please turn on category RSS feeds for your blog so I can subscribe to “Systems” ? Enjoying the conversation here.

  8. Danny says:

    Just to pile on the CFEngine bandwagon here… :) I think of “class” as short for “classification”, as they are used to tag the local system. I think of a system as being a member of a specific class, which actually somewhat fits the OO usage of the term.

    But the real reason I posted was that I’m surprised to find that you found CFEngine’s syntax complicated, but not Puppet’s. CFEngine has one structure for everything:
    class:: promise key => value, key2 => value2;
    When I was learning Puppet, I’d get really annoyed at the way sometimes you use upper case, sometime lower case, sometimes quote things, other times no quotes, some file edits use this function, other file edits use some other module, etc. CFEngine’s simple rules made it much easier for me to pick up, and made it feel well-planned. Conversely, Puppet’s language inconsistency made me feel like there was a bunch of “whoops, we didn’t plan for this, but it’s too late now” in the design – not what I want in my management tools. :)

    The dependencies, however, you’re right-on about. I’m really looking forward to 3.4.0 where those will finally be available so I can stop with the hand waving of making a zillion classes just to sequence things.

    • I think of “class” as short for “classification”, as they are used to tag the local system.

      I still think it’s a poor choice of keyword. There’s been a long time to see the writing on the wall that “class” was a poor choice. Version 3 was apparently a total rewrite, including a do-over of the language.

      It’ll be one of those things that always comes up in conversation, like 1-based indexing in Lua.

      As for sytax, I found it unnecessarily fiddly; in fairness I find Puppet the same.

      IMNSHO it’d be easier to do something YAMLesque, eg

          class: promise
               key1: value1
               key2: value2
      

      There’s too much syntax, is my point. If doing key-value lookups was an occasional thing, then using the hashrocket syntax would be just fine. But for CFEngine (and Puppet) it is the most common operation.

      Much as C made = the assignment operator because assignment is more common than equality checking, I think CFEngine (and Puppet) could stand to remove unnecessary syntactic noise.

      I also noted that I felt it was impoverished — lacking in niceties from other languages. In this respect Chef, embedded as it is in Ruby, has the edge (but the model is, as I noted, not one that I like).

      Thanks for dropping in.

      • Interestingly, just a couple of weeks ago Andy Chase submitted a patch against CFEngine core to make it possible to write policy in YAML: https://groups.google.com/d/msg/help-cfengine/Kd2bHVjF-OU/ExauypMx5FwJ

        This has generated a lot of discussion, both internally and externally, and although the patch has not been incorporated into the code, it definitely is good food for thought.

        • Ah, now that’s interesting. I think it’d make CFEngine a lot more popular; especially since YAML is well-supported in other languages.

          If that patch gets into core (say in 3.5), look me up when it’s released, I’ll review CFEngine again.

      • Brian P O'Rourke says:

        I recently did a similar survey of these three, and found similar results – Puppet was close-enough to correct for most needs, and its overall user experience is far better than cfengine’s. That said, Puppet was too obscenely resource-intensive for my particular use-case, so I ended up on cfengine in the end.

        I certainly agree with you about the syntactical obtuseness of cfengine – it’s my biggest gripe.

        Regarding the word ‘class’ – it never struck me as odd. If you’ve done HTML/CSS, you’re familiar with a class that selects a group of style rules to an element – similarly a cfengine class is used to select a group of promises about system behavior.

        Granted, one would expect more overlap between OOP and cfengine than web markup and cfengine…

  9. Sean OMeara says:

    Hi Jaques,

    Nice article, but you’re a bit off the mark. Chef is definitely not a remote execution framework. Hosts manage themselves autonomously by pulling policy, rather than in remote execution, which implies an external orchestrator telling everyone what to do. Resource statements in Chef are “convergent operators”, just like in CFEngine and Puppet. They always do a check to see if they need to take corrective action. That’s the whole point.

    Re: “declarative”, resources provide a declarative interface to the subject under management. Out of the three tools, Puppet “looks” the most declarative, due to the english language semantics of “ensure” and other keywords in its DSL. All three tools use the “detect and repair” model.

    Just to clear some vocabulary misconception up really quick: “Convergent” is a stronger property than “Idempotent”. Idempotent is a word very much mis-used and abused in the CM space. Idempotent only means that you will have the same outcome after every application of a function. A template resource, for example, would be idempotent if it did the work of writing the same contents to a file during every trip through the control loop. It is convergent if it checks to see if it needs fixed or not before taking action.

    Chef recipes undergo a compilation phase that builds a “resource collection”, which is analogous to a Puppet manifest. Instead of sorting a DAG to determine the ordering, a set of Chef resources are added as the recipe is evaluated. The run_list is a set of these sets. This makes ordering easy to reason about.

    Finally, the “multiple runs” thing applies to all convergence based CM tools, and is a behavior of the operators cooperating over multiple runs, so long as they don’t conflict with each other. Chef actually chooses to not take advantage of this behavior by default, stopping the run when an error is encountered instead of moving on and letting subsequent runs straighten it out (CFEngine behavior), or by applying another independent tree (Puppet behavior). You do have the option to ignore_failure at a per-resource level and have it behave more like CFEngine if you want to.

    Things you’ll find interesting.
    http://static.usenix.org/event/lisa98/full_papers/burgess/burgess.pdf
    http://www.infrastructures.org/papers/turing/turing.html
    http://www.cs.tufts.edu/~couch/publications/aims-08-mael.pdf
    http://blog.afistfulofservers.net/post/2011/12/30/cfengine-puppet-and-chef-part-1/

    Hope that helps,

    -s

    • “Idempotent” appeared repeatedly in the Chef documentation; so I think I reasonably assumed that was the logical basis for the Chef approach.

      Instead of sorting a DAG to determine the ordering, a set of Chef resources are added as the recipe is evaluated.

      This and the focus on idempotency is why I characterised Chef as an indirect remote execution framework. Chef throws back a lot of the work back on me. I simply don’t want to do it. If I am going to have to worry about giving Chef idempotent components myself and if I have to schedule everything myself … then what exactly have I gained over a bog-ordinary REF? It seems like a lot of work for little advantage.

      • Ben Langfeld says:

        I think you’re still failing to understand Chef, Jacques. In terms of idempotency, Chef is very much the same as Puppet. One places a file on a system using Puppet like so:


        file { "/etc/hosts":
        contents => "127.0.0.1 localhost",
        ensure => present,
        }

        And with Chef, like so:


        file "/etc/hosts" do
        content "127.0.0.1 localhost"
        action :create
        end

        Actually, Chef is just pandering to people used to writing imperative Ruby here, and :create doesn’t really mean the imperative create, it means the same logically as Puppet. I believe this was poor choice of vocabulary on Chef’s part, but the function is the same.

        As far as scheduling goes, this is also very much the same between the two. With Puppet (discounting any recent announcements to copy what Chef does) one must specify dependencies explicitly between resources, eg:


        package { 'apache2':
        ensure => present,
        }

        file { '/etc/apache2/conf/apache2.conf':
        ...
        require => Package['apache2'], # Because otherwise the containing directory wouldn't exist yet
        notify => Service['apache2'], # To ensure the service gets restarted after placing the config file
        }

        service { 'apache2':
        ensure => running,
        }

        With Chef:


        package 'apache2' do
        action :install # Again, poor naming
        end

        file '/etc/apache2/conf/apache2.conf' do
        ... # No dependency on the package here, this is implicit because the package resource was defined before the file resource
        notifies :restart, 'service[apache2]'
        end

        service 'apache2' do
        action :start # More bad naming, functionally equivalent to Puppet
        end

        With Chef you have not had to do any more work, you have not had to write any Ruby that is any more complex than the Puppet DSL, you have not had to consider idempotency in any way differently than with Puppet and you have done no scheduling, at least in as much as you did none with Puppet.

        Does that clear up that Chef and Puppet are really very similar and don’t have any productivity differences as a result of these core concepts?

Leave a Reply

Your email address will not be published. Required fields are marked *