Working in IT Operations for nearly two decades, I haven't seen many transformational new practices. The way to improve is mostly just learning the same lessons people've been re-learning since computers were invented.
But there are one or two transformational ideas that have popped up during my time. One of those ideas is that Configuration Management is an anti-pattern. Its use is actively worse for the System as a whole, and there are better ways to manage Systems.
The following is my (admittedly weak) attempt to explain why you should not use Configuration Management (most of the time), and an alternative practice.
When somebody maintains a System, often they notice over time that they need to change that System every so often. And every so often a System fails.
Investigation into why it failed often shows the System wasn't running the way it was intended to. Somehow something changed (configuration, software versions, etc) or something else happened that differed from the last known good configuration.
In these cases, most people think something like "Hey, we should have some software that prevents this System's state from getting out of whack!"
Thus was born Configuration Management: tools to go around and fix System state.
Configuration Management systems have been around for years. I've used many of them. I started out using CFengine and CFengine2, and later moved on to Puppet and Ansible. I've even used "Enterprise" CM systems from Hewlett Packard; giant monstrosities that were full of features, but were clunky to use.
Modern CM systems work on the idea of a declared or desired state: you declare what you want, and the tool makes it happen. Oh, state isn't correct anymore? Don't worry: the tool will make it right, fixing the System to match what you declared you wanted. (Some CMs require a more explicit order of operations, but they're really not that different)
Seems logical, right? If the state's wrong, we need to fix it... And "declaring" System state seems simpler than describing every single operation needed, in order, to get back to the good System state.
Thing is, there's some fundamental problems with this model. Problems so fundamental that they sometimes negate the usefulness of the solution. But there's some better ways to go about the solution, which I'll cover later.
If a System is operating nominally, and it never changes, it shouldn't break down, right? Nothing seems to be happening to cause any change in the System, so the System should just keep ticking away.
But every System does change. Time itself will cause your System to change; as a ticking digital clock increments the seconds since the Unix epoch, eventually the integer, double, float, long, long long, or whatever data type the number is stored in will overflow, resetting the time, and possibly wreaking havoc on something. Computer parts will degrade over time; hard drives die, fans get clogged, capacitors blow. Cosmic rays shooting through the universe flip bits in RAM. Eventually your System will change, and it will break. That's entropy for ya.
On a more practical scale, simply using a System often involves changes:
For an automobile, that might be the pistons pulsing up and down, sloshing in oil, being shoved back and forth in a metal cylinder while explosions ripple through its frame thousands of times a minute. Eventually that oil degrades and needs to be changed, or the whole machine will quickly break down.
For a piece of software, there are many different kinds of constant change: allocation and deallocation of memory, network connections starting/stopping/timing out, input being taken, processed, and output generating, disks being written to. The configuration of the System is also often changed due to maintenance, bug fixes, feature adds.
So any System's state is always changing. That seems like a pretty good argument for CM, right? We need something to constantly fix all this mutating and drifting state. But it's not as cut and dry as "just fix the state".
One problem with trying to use CM to fix System state, is it's not very easy.
CM is basically a form of Low-code development. You use the CM's library of functions and "configure" them to do what you want. "Make this file to have X permission", "Put this file over there", etc.
The more complex the System is, the more "configuration" you have to write, in increasingly complex ways. CM configuration is written in a Domain Specific Language, which is basically a tailor-made programming language. The "configuration" is literally source code for an interpreted language (the CM); even if it doesn't seem like "real programming" in a "real language", the result is very similar.
Because it's code, it eventually has bugs, which eventually leads to bugs in the System. So this "configuration" (source code) needs to be tested on "test" Systems in order to prevent accidentally messing up the production Systems. Inevitably you spend a good chunk of time working on this code. You need to hire specialists who are familiar with the "language" and "tooling" so they can maintain it, run it, troubleshoot it.
In the end, the CM hasn't bought you much over a regular programmer writing custom code. I suppose the difference is, "real" programming is more challenging, and the CM tool comes bundled with pre-written, tested, purpose-built, portable functions. If you're maintaining lots of different kinds of systems, this can be advantageous. But in many cases, the end results aren't much different than writing everything from scratch.
So far we've learned that the System's state is always changing, and CM is code we write to try to continuously fix the System state.
Wouldn't it be nice if we didn't have to write and maintain this code? If we didn't have to wrestle with fixing the System constantly?
What if there were a way to freeze the System in a good state, exactly how we want it to run? When the State becomes unruly, we can zap the good state back in place and have the System run normally again. Is there any way we could achieve this state nirvana?
The idea goes like this:
This has a lot to do with the Cattle vs Pets analogy.
If you treat your Systems as Cattle, then it's very easy to quickly replace them, and use a versioned artifact to restore their state. (Pets vs Cattle analogy in short: servers treated like pets stick around for a long time, but require constant feeding, care, emergency visits to the vet... when what you want is cattle: interchangeable, nameless livestock that can be easily replaced on a whim. sorry, cattle... 😢)
What you gain from this method is an abundance of benefits:
This goes for containers, software packages, software configuration, etc. All of it can be saved and restored from a known good state. And the best part is, there's no coding a CM tool's configuration, no maintaining the CM software, etc. All you have to do to fix the System's state is restore the System to a known good artifact of its previous state.
With Immutable Infrastructure you can reduce your staff, reduce System complexity, increase System reliability, make changes more frequently (& with confidence), and reduce your overall cost.
This is all great in theory, but it's not always obvious how to implement it. Immutable Infrastructure is a philosophy more than a single practice. Different Systems need to be adapted to work Immutably.
Here are some guidelines for making it all work:
Build your artifacts, then test them, then version them, then deploy them, and roll back to the last known good artifact. This is almost always the same pattern for all Immutable Infrastructure.
Don't persist changes. If there's a change made to a System during its operation, it must not persist into the next restart of the System. If the System depends on persisted changes, that needs to be done in a way that is also immutable. Changes that persist, but are not immutable artifacts, will carry bugs with them that you may not be able to fix.
The Environment is an artifact too. If you built your System to work in one region/zone, and snapshotted it, and try to deploy that in a different region/zone, it might not work! The Environment is everything that changes between deployments; environment variables, credentials, regions/zones, IP addresses/hostnames, cluster namespaces, etc. If the Environment changes, you're gonna need to test your System with those changes. So keep track of the Environment along with your artifact, so that if a failure occurs, but the artifact hasn't changed, you can check if it's due to the Environment changing.
Tailor the Deployment to the System. Different Systems require different Deployment Methods. Some might need persistent disks; some might need each copy of the System replaced one at a time; some might need an Blue/Green replacement; etc.
There's no one way to do it. It's not even a new idea. Sure, the term is over 10 years old. But we were using Norton Ghost back in 2002 to do the same thing (imaging and restoring installs of OSes), and dd
on Unix before that. Even without VMs, today a bunch of bare-metal servers can have identical configuration, restored from an artifact every time they reboot, with network configuration provided on the fly from BOOTP/DHCP, remote hands from IPMI, commands executed over SSH, and immutable artifacts installed via RPM.
What are some concrete ways Immutable Infrastructure can replace CM?
VMs:
Using Configuration Management:
Using Immutable Infrastructure:
Containers:
Immutable Infrastructure is a transformational way to think about the operating of Systems. It provides stability, security, peace of mind, lowers cost, and makes automation simpler. But not everyone has fully adopted this practice. Cloud Computing can seem like a panacea for everyone's problems... But every single Cloud vendor has failed to adopt Immutable Infrastructure in their designs.
Take AWS for example. They developed a service called AWS S3, which purports to be a simple way to store and retrieve blobs of data using an API. It sounds like a great idea; unlimited storage, high performance, and a consistent control interface? What's not to love! I mean, APIs!!!
But the thing is, AWS S3 is not immutable. It may seem like it because you can control it using an API. But an API is just an interface, the way file i/o is an interface, or a network protocol, or a GUI. An API doesn't mean the System's state is immutable. You can change hundreds of different properties in an AWS S3 bucket, but there's no way to take an Immutable Artifact of a bucket, in a specific state, and re-apply it when its state changes.
In order to deal with this problem, tools like Terraform came about, and were called "Cloud Orchestrators" (because it sounds cooler than "Cloud Configuration Management Tools"). Terraform was built to create an AWS S3 bucket, check its state, and try to re-apply the old state if it changes.
But this introduces some familiar problems. We have to write code again, and maintain a tool. When we run the tool, sometimes it doesn't work or causes unknown bugs, and we have to fix it while the system is broken. And what's worse is, Terraform won't try to apply the correct System state if it's been changed by anything other than Terraform itself. So it's actually less useful than the older CM tools at doing what we want: fixing the System's state.
Most Cloud services today are not immutable, the tools we have to deal with this are worse than CM tools, and only the developers of Cloud systems can solve this dilemma. They must design their Systems to enable Immutable operations, as they do on VMs, Containers, etc. Until then, we will have the same problems we had with CM, just with different names.
In the mean time, we can attempt to move away from tools like Terraform. We can move back to libraries of common code that perform specific functions on multiple platforms, the way CM tools once did, and fix System state regardless of how it was changed.
The tools we need mostly don't exist yet. They require a lot of engineering power, and it's very hard to move an entire industry in the right direction. Hopefully the Cloud Computing industry can learn the right lessons and move away from stateful, difficult to manage services, for the benefit of its users and society in general.