CloudNative

GCP services DevOps engineers rely on

I have spent the better part of three years wrestling with Google Cloud Platform, and I am still not entirely convinced it wasn’t designed by a group of very clever people who occasionally enjoy a quiet laugh at the rest of us. The thing about GCP, you see, is that it works beautifully right up until the moment it doesn’t. Then it fails with such spectacular and Byzantine complexity that you find yourself questioning not just your career choices but the fundamental nature of causality itself.

My first encounter with Cloud Build was typical of this experience. I had been tasked with setting up a CI/CD pipeline for a microservices architecture, which is the modern equivalent of being told to build a Swiss watch while someone steadily drops marbles on your head. Jenkins had been our previous solution, a venerable old thing that huffed and puffed like a steam locomotive and required more maintenance than a Victorian greenhouse. Cloud Build promised to handle everything serverlessly, which is a word that sounds like it ought to mean something, but in practice simply indicates you won’t know where your code is running and you certainly won’t be able to SSH into it when things go wrong.

The miracle, when it arrived, was decidedly understated. I pushed some poorly written Go code to a repository and watched as Cloud Build sprang into life like a sleeper agent receiving instructions. It ran my tests, built a container, scanned it for vulnerabilities, and pushed it to storage. The whole process took four minutes and cost less than a cup of tea. I sat there in my home office, the triumph slowly dawning, feeling rather like a man who has accidentally trained his cat to make coffee. I had done almost nothing, yet everything had happened. This is the essential GCP magic, and it is deeply unnerving.

The vulnerability scanner is particularly wonderful in that quietly horrifying way. It examines your containers and produces a list of everything that could possibly go wrong, like a pilot’s pre-flight checklist written by a paranoid witchfinder general. On one memorable occasion, it flagged a critical vulnerability in a library I wasn’t even aware we were using. It turned out to be nested seven dependencies deep, like a Russian doll of potential misery. Fixing it required updating something else, which broke something else, which eventually led me to discover that our entire authentication layer was held together by a library last maintained in 2018 by someone who had subsequently moved to a commune in Oregon. The scanner was right, of course. It always is. It is the most anxious and accurate employee you will ever meet.

Google Kubernetes Engine or how I learned to stop worrying and love the cluster

If Cloud Build is the efficient butler, GKE is the robot overlord you find yourself oddly grateful to. My initial experience with Kubernetes was self-managed, which taught me many things, primarily that I do not have the temperament to manage Kubernetes. I spent weeks tuning etcd, debugging network overlays, and developing what I can only describe as a personal relationship with a persistent volume that refused to mount. It was less a technical exercise and more a form of digitally enhanced psychotherapy.

GKE’s Autopilot mode sidesteps all this by simply making the nodes disappear. You do not manage nodes. You do not upgrade nodes. You do not even, strictly speaking, know where the nodes are. They exist in the same conceptual space as socks that vanish from laundry cycles. You request resources, and they materialise, like summoning a very specific and obliging genie. The first time I enabled Autopilot, I felt I was cheating somehow, as if I had been given the answers to an exam I had not revised for.

The real genius is Workload Identity, a feature that allows pods to access Google services without storing secrets. Before this, secret management was a dark art involving base64 encoding and whispered incantations. We kept our API keys in Kubernetes secrets, which is rather like keeping your house keys under the doormat and hoping burglars are too polite to look there. Workload Identity removes all this by using magic, or possibly certificates, which are essentially the same thing in cloud computing. I demonstrated it to our security team, and their reaction was instructive. They smiled, which security people never do, and then they asked me to prove it was actually secure, which took another three days and several diagrams involving stick figures.

Istio integration completes the picture, though calling it integration suggests a gentle handshake when it is more like being embraced by a very enthusiastic octopus. It gives you observability, security, and traffic management at the cost of considerable complexity and a mild feeling that you have lost control of your own architecture. Our first Istio deployment doubled our pod count and introduced latency that made our application feel like it was wading through treacle. Tuning it took weeks and required someone with a master’s degree in distributed systems and the patience of a saint. When it finally worked, it was magnificent. Requests flowed like water, security policies enforced themselves with silent efficiency, and I felt like a man who had tamed a tiger through sheer persistence and a lot of treats.

Cloud Deploy and the gentle art of not breaking everything

Progressive delivery sounds like something a management consultant would propose during a particularly expensive lunch, but Cloud Deploy makes it almost sensible. The service orchestrates rollouts across environments with strategies like canary and blue-green, which are named after birds and colours because naming things is hard, and DevOps engineers have a certain whimsical desperation about them.

My first successful canary deployment felt like performing surgery on a patient who was also the anaesthetist. We routed 5 percent of traffic to the new version and watched our metrics like nervous parents at a school play. When errors spiked, I expected a frantic rollback procedure involving SSH and tarballs. Instead, I clicked a button, and everything reverted in thirty seconds. The old version simply reappeared, fully formed, like a magic trick performed by someone who actually understands magic. I walked around the office for the rest of the day with what my colleagues described as a smug grin, though I prefer to think of it as the justified expression of someone who has witnessed a minor miracle.

The integration with Cloud Build creates a pipeline so smooth it is almost suspicious. Code commits trigger builds, builds trigger deployments, deployments trigger monitoring alerts, and alerts trigger automated rollbacks. It is a closed loop, a perpetual motion machine of software delivery. I once watched this entire chain execute while I was making a sandwich. By the time I had finished my ham and pickle on rye, a critical bug had been introduced, detected, and removed from production without any human intervention. I was simultaneously impressed and vaguely concerned about my own obsolescence.

Artifact Registry where containers go to mature

Storing artifacts used to involve a self-hosted Nexus repository that required weekly sacrifices of disk space and RAM. Artifact Registry is Google’s answer to this, a fully managed service that stores Docker images, Helm charts, and language packages with the solemnity of a wine cellar for code.

The vulnerability scanning here is particularly thorough, examining every layer of your container with the obsessive attention of someone who alphabetises their spice rack. It once flagged a high-severity issue in a base image we had been using for six months. The vulnerability allowed arbitrary code execution, which is the digital equivalent of leaving your front door open with a sign saying “Free laptops inside.” We had to rebuild and redeploy forty services in two days. The scanner, naturally, had known about this all along but had been politely waiting for us to notice.

Geo-replication is another feature that seems obvious until you need it. Our New Zealand team was pulling images from a European registry, which meant every deployment involved sending gigabytes of data halfway around the world. This worked about as well as shouting instructions across a rugby field during a storm. Moving to a regional registry in New Zealand cut our deployment times by half and our egress fees by a third. It also taught me that cloud networking operates on principles that are part physics, part economics, and part black magic.

Cloud Operations Suite or how I learned to love the machine that watches me

Observability in GCP is orchestrated by the Cloud Operations Suite, formerly known as Stackdriver. The rebranding was presumably because Stackdriver sounded too much like a dating app for developers, which is a missed opportunity if you ask me.

The suite unifies logs, metrics, traces, and dashboards into a single interface that is both comprehensive and bewildering. The first time I opened Cloud Monitoring, I was presented with more graphs than a hedge fund’s annual report. CPU, memory, network throughput, disk IOPS, custom metrics, uptime checks, and SLO burn rates. It was beautiful and terrifying, like watching the inner workings of a living organism that you have created but do not fully understand.

Setting up SLOs felt like writing a promise to my future self. “I, a DevOps engineer of sound mind, do hereby commit to maintaining 99.9 percent availability.” The system then watches your service like a particularly judgmental deity and alerts you the moment you transgress. I once received a burn rate alert at 2 AM because a pod had been slightly slow for ten minutes. I lay in bed, staring at my phone, wondering whether to fix it or simply accept that perfection was unattainable and go back to sleep. I fixed it, of course. We always do.

The integration with BigQuery for long-term analysis is where things get properly clever. We export all our logs and run SQL queries to find patterns. This is essentially data archaeology, sifting through digital sediment to understand why something broke three weeks ago. I discovered that our highest error rates always occurred on Tuesdays between 2 and 3 PM. Why? A scheduled job that had been deprecated but never removed, running on a server everyone had forgotten about. Finding it felt like discovering a Roman coin in your garden, exciting but also slightly embarrassing that you hadn’t noticed it before.

Cloud Monitoring and Logging the digital equivalent of a nervous system

Cloud Logging centralises petabytes of data from services that generate logs with the enthusiasm of a teenager documenting their lunch. Querying this data feels like using a search engine that actually works, which is disconcerting when you’re used to grep and prayer.

I once spent an afternoon tracking down a memory leak using Cloud Profiler, a service that shows you exactly where your code is being wasteful with RAM. It highlighted a function that was allocating memory like a government department allocates paper clips, with cheerful abandon and no regard for consequences. The function turned out to be logging entire database responses for debugging purposes, in production, for six months. We had archived more debug data than actual business data. The developer responsible, when confronted, simply shrugged and said it had seemed like a good idea at the time. This is the eternal DevOps tragedy. Everything seems like a good idea at the time.

Uptime checks are another small miracle. We have probes hitting our endpoints from locations around the world, like a global network of extremely polite bouncers constantly asking, “Are you open?” When Mumbai couldn’t reach our service but London could, it led us to discover a regional DNS issue that would have taken days to diagnose otherwise. The probes had saved us, and they had done so without complaining once, which is more than can be said for the on-call engineer who had to explain it to management at 6 AM.

Cloud Functions and Cloud Run, where code goes to hide

Serverless computing in GCP comes in two flavours. Cloud Functions are for small, event-driven scripts, like having a very eager intern who only works when you clap. Cloud Run is for containerised applications that scale to zero, which is an economical way of saying they disappear when nobody needs them and materialise when they do, like an introverted ghost.

I use Cloud Functions for automation tasks that would otherwise require cron jobs on a VM that someone has to maintain. One function resizes GKE clusters based on Cloud Monitoring alerts. When CPU utilisation exceeds 80 percent for five minutes, the function spins up additional nodes. When it drops below 20 percent, it scales down. This is brilliant until you realise you’ve created a feedback loop and the cluster is now oscillating between one node and one hundred nodes every ten minutes. Tuning the thresholds took longer than writing the function, which is the serverless way.

Cloud Run hosts our internal tools, the dashboards, and debug interfaces that developers need but nobody wants to provision infrastructure for. Deploying is gloriously simple. You push a container, it runs. The cold start time is sub-second, which means Google has solved a problem that Lambda users have been complaining about for years, presumably by bargaining with physics itself. I once deployed a debugging tool during an incident response. It was live before the engineer who requested it had finished describing what they needed. Their expression was that of someone who had asked for a coffee and been given a flying saucer.

Terraform and Cloud Deployment Manager arguing with machines about infrastructure

Infrastructure as Code is the principle that you should be able to rebuild your entire environment from a text file, which is lovely in theory and slightly terrifying in practice. Terraform, using the GCP provider, is the de facto standard. It is also a source of endless frustration and occasional joy.

The state file is the heart of the problem. It is a JSON representation of your infrastructure that Terraform keeps in Cloud Storage, and it is the single source of truth until someone deletes it by accident, at which point the truth becomes rather more philosophical. We lock the state during applies, which prevents conflicts but also means that if an apply hangs, everyone is blocked. I have spent afternoons staring at a terminal, watching Terraform ponder the nature of a load balancer, like a stoned philosophy student contemplating a spoon.

Deployment Manager is Google’s native IaC tool, which uses YAML and is therefore slightly less powerful but considerably easier to read. I use it for simple projects where Terraform would be like using a sledgehammer to crack a nut, if the sledgehammer required you to understand graph theory. The two tools coexist uneasily, like cats who tolerate each other for the sake of the humans.

Drift detection is where things get properly philosophical. Terraform tells you when reality has diverged from your code, which happens more often than you’d think. Someone clicks something in the console, a service account is modified, a firewall rule is added for “just a quick test.” The plan output shows these changes like a disappointed teacher marking homework in red pen. You can either apply the correction or accept that your infrastructure has developed a life of its own and is now making decisions independently. Sometimes I let the drift stand, just to see what happens. This is how accidents become features.

IAM and Cloud Asset Inventory, the endless game of who can do what

Identity and Access Management in GCP is both comprehensive and maddening. Every API call is authenticated and authorised, which is excellent for security but means you spend half your life granting permissions to service accounts. A service account, for the uninitiated, is a machine pretending to be a person so it can ask Google for things. They are like employees who never take a holiday but also never buy you a birthday card.

Workload Identity Federation allows these synthetic employees to impersonate each other across clouds, which is identity management crossed with method acting. We use it to let our AWS workloads access GCP resources, a process that feels rather like introducing two friends who are suspicious of each other and speak different languages. When it works, it is seamless. When it fails, the error messages are so cryptic they may as well be in Linear B.

Cloud Asset Inventory catalogs every resource in your organisation, which is invaluable for audits and deeply unsettling when you realise just how many things you’ve created and forgotten about. I once ran a report and discovered seventeen unused load balancers, three buckets full of logs from a project that ended in 2023, and a Cloud SQL instance that had been running for six months with no connections. The bill was modest, but the sense of waste was profound. I felt like a hoarder being confronted with their own clutter.

For European enterprises, the GDPR compliance features are critical. We export audit logs to BigQuery and run queries to prove data residency. The auditors, when they arrived, were suspicious of everything, which is their job. They asked for proof that data never left the europe-west3 region. I showed them VPC Service Controls, which are like digital border guards that shoot packets trying to cross geographical boundaries. They seemed satisfied, though one of them asked me to explain Kubernetes, and I saw his eyes glaze over in the first thirty seconds. Some concepts are simply too abstract for mortal minds.

Eventarc and Cloud Scheduler the nervous system of the cloud

Eventarc routes events from over 100 sources to your serverless functions, creating event-driven architectures that are both elegant and impossible to debug. An event is a notification that something happened, somewhere, and now something else should happen somewhere else. It is causality at a distance, action at a remove.

I have an Eventarc trigger that fires when a vulnerability is found, sending a message to Pub/Sub, which fans out to multiple subscribers. One subscriber posts to Slack, another creates a ticket, and a third quarantines the image. It is a beautiful, asynchronous ballet that I cannot fully trace. When it fails, it fails silently, like a mime having a heart attack. The dead-letter queue catches the casualties, which I check weekly like a coroner reviewing unexplained deaths.

Cloud Scheduler handles cron jobs, which are the digital equivalent of remembering to take the bins out. We have schedules that scale down non-production environments at night, saving money and carbon. I once set the timezone incorrectly and scaled down the production cluster at midday. The outage lasted three minutes, but the shame lasted considerably longer. The team now calls me “the scheduler whisperer,” which is not the compliment it sounds like.

The real power comes from chaining these services. A Monitoring alert triggers Eventarc, which invokes a Cloud Function, which checks something via Scheduler, which then triggers another function to remediate. It is a Rube Goldberg machine built of code, more complex than it needs to be, but weirdly satisfying when it works. I have built systems that heal themselves, which is either the pinnacle of DevOps achievement or the first step towards Skynet. I prefer to think it is the former.

The map we all pretend to understand

Every DevOps journey, no matter how anecdotal, eventually requires what consultants call a “high-level architecture overview” and what I call a desperate attempt to comprehend the incomprehensible. During my second year on GCP, I created exactly such a diagram to explain to our CFO why we were spending $47,000 a month on something called “Cross-Regional Egress.” The CFO remained unmoved, but the diagram became my Rosetta Stone for navigating the platform’s ten core services.

I’ve reproduced it here partly because I spent three entire afternoons aligning boxes in Lucidchart, and partly because even the most narrative-driven among us occasionally needs to see the forest’s edge while wandering through the trees. Consider it the technical appendix you can safely ignore, unless you’re the poor soul actually implementing any of this.

There it is, in all its tabular glory. Five rows that represent roughly fifteen thousand hours of human effort, and at least three separate incidents involving accidentally deleted production namespaces. The arrows are neat and tidy, which is more than can be said for any actual implementation.

I keep a laminated copy taped to my monitor, not because I consult it; I have the contents memorised, along with the scars that accompany each service, but because it serves as a reminder that even the most chaotic systems can be reduced to something that looks orderly on PowerPoint. The real magic lives in the gaps between those tidy boxes, where service accounts mysteriously expire, where network policies behave like quantum particles, and where the monthly bill arrives with numbers that seem generated by a random number generator with a grudge.

A modest proposal for surviving GCP

That table represents the map. What follows is the territory, with all its muddy bootprints and unexpected cliffs.

After three years, I have learned that the best DevOps engineers are not the ones with the most certificates. They are the ones who have learned to read the runes, who know which logs matter and which can be ignored, who have developed an intuitive sense for when a deployment is about to fail and can smell a misconfigured IAM binding at fifty paces. They are part sysadmin, part detective, part wizard.

The platform makes many things possible, but it does not make them easy. It is infrastructure for grown-ups, which is to say it trusts you to make expensive mistakes and learn from them. My advice is to start small, automate everything, and keep a sense of humour. You will need it the first time you accidentally delete a production bucket and discover that the undo button is marked “open a support ticket and wait.”

Store your manifests in Git and let Cloud Deploy handle the applying. Define SLOs and let the machines judge you. Tag resources for cost allocation and prepare to be horrified by the results. Replicate artifacts across regions because the internet is not as reliable as we pretend. And above all, remember that the cloud is not magic. It is simply other people’s computers running other people’s code, orchestrated by APIs that are occasionally documented and frequently misunderstood.

We build on these foundations because they let us move faster, scale further, and sleep slightly better at night, knowing that somewhere in a data centre in Belgium, a robot is watching our servers and will wake us only if things get truly interesting.

That is the theory, anyway. In practice, I still keep my phone on loud, just in case.

Kubernetes leases or the art of waiting for the bathroom

If you looked inside a running Kubernetes cluster with a microscope, you would not see a perfectly choreographed ballet of binary code. You would see a frantic, crowded open-plan office staffed by thousands of employees who have consumed dangerous amounts of espresso. You have schedulers, controllers, and kubelets all sprinting around, frantically trying to update databases and move containers without crashing into each other.

It is a miracle that the whole thing does not collapse into a pile of digital rubble within seconds. Most human organizations of this size descend into bureaucratic infighting before lunch. Yet, somehow, Kubernetes keeps this digital circus from turning into a riot.

You might assume that the mechanism preventing this chaos is a highly sophisticated, cryptographic algorithm forged in the fires of advanced mathematics. It is not. The thing that keeps your cluster from eating itself is the distributed systems equivalent of a sticky note on a door. It is called a Lease.

And without this primitive, slightly passive-aggressive little object, your entire cloud infrastructure would descend into anarchy faster than you can type kubectl delete namespace.

The sticky note of power

To understand why a Lease is necessary, we have to look at the psychology of a Kubernetes controller. These components are, by design, incredibly anxious. They want to ensure that the desired state of the world matches the actual state.

The problem arises when you want high availability. You cannot just have one controller running because if it dies, your cluster stops working. So you run three replicas. But now you have a new problem. If all three replicas try to update the same routing table or create the same pod at the exact same moment, you get a “split-brain” scenario. This is the technical term for a psychiatric emergency where the left hand deletes what the right hand just created.

Kubernetes solves this with the Lease object. Technically, it is an API resource in the coordination.k8s.io group. Spiritually, it is a “Do Not Disturb” sign hung on a doorknob.

If you look at the YAML definition of a Lease, it is almost insultingly simple. It does not ask for a security clearance or a biometric scan. It essentially asks three questions:

  1. HolderIdentity: Who are you?
  2. LeaseDurationSeconds: How long are you going to be in there?
  3. RenewTime: When was the last time you shouted that you are still alive?

Here is what one looks like in the wild:

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: cluster-coordination-lock
  namespace: kube-system
spec:
  holderIdentity: "controller-pod-beta-09"
  leaseDurationSeconds: 15
  renewTime: "2023-10-27T10:04:05.000000Z"

In plain English, this document says: “Controller Beta-09 is holding the steering wheel. It has fifteen seconds to prove it has not died of a heart attack. If it stays silent for sixteen seconds, we are legally allowed to pry the wheel from its cold, dead fingers.”

An awkward social experiment

To really grasp the beauty of this system, we need to leave the server room and enter a shared apartment with a terrible design flaw. There is only one bathroom, the lock is broken, and there are five roommates who all drank too much water.

The bathroom is the “critical resource.” In a computerized world without Leases, everyone would just barge in whenever they felt the urge. This leads to what engineers call a “race condition” and what normal people call “an extremely embarrassing encounter.”

Since we cannot fix the lock, we install a whiteboard on the door. This is the Lease.

The rules of this apartment are strict but effective. When you walk up to the door, you write your name and the current time on the board. You have now acquired the lock. As long as your name is there and the timestamp is fresh, the other roommates will stand in the hallway, crossing their legs and waiting politely.

But here is where it gets stressful. You cannot just write your name and fall asleep in the tub. The system requires constant anxiety. Every few seconds, you have to crack the door open, reach out with a marker, and update the timestamp. This is the “heartbeat.” It tells the people waiting outside that you are still conscious and haven’t slipped in the shower.

If you faint, or if the WiFi cuts out and you cannot reach the whiteboard, you stop updating the time. The roommates outside watch the clock. Ten seconds pass. Fifteen seconds. At sixteen seconds, they do not knock to see if you are okay. They assume you are gone forever, wipe your name off the board, write their own, and barge in.

It is ruthless, but it ensures that the bathroom is never left empty just because the previous occupant vanished into the void.

The paranoia of leader election

The most critical use of this bathroom logic is something called Leader Election. This is the mechanism that keeps your kube-controller-manager and kube scheduler from turning into a bar fight.

You typically run multiple copies of these control plane components for redundancy. However, you absolutely cannot have five different schedulers trying to assign the same pod to five different nodes simultaneously. That would be like having five conductors trying to lead the same orchestra. You do not get music; you get noise and a lot of angry musicians.

So, the replicas hold an election. But it is not a democratic vote with speeches and ballots. It is a race to grab the marker.

The moment the controllers start up, they all rush toward the Lease object. The first one to write its name in the holderIdentity field becomes the Leader. The others, the candidates, do not go home. They stand in the corner, staring at the Lease, refreshing the page every two seconds, waiting for the Leader to fail.

There is something deeply human about this setup. The backup replicas are not “supporting” the leader. They are jealous understudies watching the lead actor, hoping he breaks a leg so they can take center stage.

If the Leader crashes or simply gets stuck in a network traffic jam, the renewTime stops updating. The lease expires. Immediately, the backups scramble to write their own name. The winner takes over the cluster duties instantly. It is seamless, automated, and driven entirely by the assumption that everyone else is unreliable.

Reducing the noise pollution

In the early days of Kubernetes, things were even messier. Nodes, the servers doing the actual work, had to prove they were alive by sending a massive status report to the API server every few seconds.

Imagine a receptionist who has to process a ten-page medical history form from every single employee every ten seconds, just to confirm they are at their desks. It was exhausting. The API server spent so much time reading these reports that it barely had time to do anything else.

Today, Kubernetes uses Leases for node heartbeats, too. Instead of the full medical report, the node just updates a Lease object. It is a quick, lightweight ping.

“I’m here.”

“Good.”

“Still here.”

“Great.”

This change reduced the computational cost of staying alive significantly. The API server no longer needs to know your blood pressure and cholesterol levels every ten seconds; it just needs to know you are breathing. It turns a bureaucratic nightmare into a simple check-in.

How to play with fire

The beauty of the Lease system is that it is just a standard Kubernetes object. You can see these invisible sticky notes right now. If you list the leases in the system namespace, you will see the invisible machinery that keeps the lights on:

kubectl get leases -n kube-system

You will see entries for the controller manager, the scheduler, and probably one for every node in your cluster. If you want to see who the current boss is, you can describe the lease:

kubectl describe lease kube-scheduler -n kube-system

You will see the holderIdentity. That is the name of the replica currently running the show.

Now, if you are feeling particularly chaotic, or if you just want to see the world burn, you can delete a Lease manually.

kubectl delete lease kube-scheduler -n kube-system

Please do not do this in production unless you enjoy panic attacks.

Deleting an active Lease is like ripping the “Occupied” sign off the bathroom door while someone is inside. You are effectively lying to the system. You are telling the backup controllers, “The leader is dead! Long live the new leader!”

The backups will rush in and elect a new leader. But the old leader, who was effectively just sitting there minding its own business, is still running. Suddenly, it realizes it has been fired without notice. Ideally, it steps down gracefully. But in the split second before it realizes what happened, you might have two controllers giving orders.

The system will heal itself, usually within seconds, but those few seconds are a period of profound confusion for everyone involved.

The survival of the loudest

Leases are the unsung heroes of the cloud native world. We like to talk about Service Meshes and eBPF and other shiny, complex technologies. But at the bottom of the stack, keeping the whole thing from exploding, is a mechanism as simple as a name on a whiteboard.

It works because it accepts a fundamental truth about distributed systems: nothing is reliable, everyone is going to crash eventually, and the only way to maintain order is to force components to shout “I am alive!” every few seconds.

Next time your cluster survives a node failure or a controller restart without you even noticing, spare a thought for the humble Lease. It is out there in the void, frantically renewing timestamps, protecting you from the chaos of a split-brain scenario. And that is frankly better than a lock on a bathroom door any day.

Docker didn’t die, it just moved to your laptop

Docker used to be the answer you gave when someone asked, “How do we ship this thing?” Now it’s more often the answer to a different question, “How do I run this thing locally without turning my laptop into a science fair project?”

That shift is not a tragedy. It’s not even a breakup. It’s more like Docker moved out of the busy downtown apartment called “production” and into a cozy suburb called “developer experience”, where the lawns are tidy, the tools are friendly, and nobody panics if you restart everything three times before lunch.

This article is about what changed, why it changed, and why Docker is still very much worth knowing, even if your production clusters rarely whisper its name anymore.

What we mean when we say Docker

One reason this topic gets messy is that “Docker” is a single word used to describe several different things, and those things have very different jobs.

  • Docker Desktop is the product that many developers actually interact with day to day, especially on macOS and Windows.
  • Docker Engine and the Docker daemon are the background machinery that runs containers on a host.
  • The Docker CLI and Dockerfile workflow are the human-friendly interface and the packaging format that people have built habits around.

When someone says “Docker is dying,” they usually mean “Docker Engine is no longer the default runtime in production platforms.” When someone says “Docker is everywhere,” they often mean “Docker Desktop and Dockerfile workflows are still the easiest way to get a containerized dev environment running quickly.”

Both statements can be true at the same time, which is annoying, because humans prefer their opinions to come in single-serving packages.

Docker’s rise and the good kind of magic

Docker didn’t become popular because it invented containers. Containers existed before Docker. Docker became popular because it made containers feel approachable.

It offered a developer experience that felt like a small miracle:

  • You could build images with a straightforward command.
  • You could run containers without a small dissertation on Linux namespaces.
  • You could push to registries and share a runnable artifact.
  • You could spin up multi-service environments with Docker Compose.

Docker took something that used to feel like “advanced systems programming” and turned it into “a thing you can demo on a Tuesday.”

If you were around for the era of XAMPP, WAMP, and “download this zip file, then pray,” Docker felt like a modern version of that, except it didn’t break as soon as you looked at it funny.

The plot twist in production

Here is the part where the story becomes less romantic.

Production infrastructure grew up.

Not emotionally, obviously. Infrastructure does not have feelings. It has outages. But it did mature in a very specific way: platforms started to standardize around container runtimes and interfaces that did not require Docker’s full bundled experience.

Docker was the friendly all-in-one kitchen appliance. Production systems wanted an industrial kitchen with separate appliances, separate controls, and fewer surprises.

Three forces accelerated the shift.

Licensing concerns changed the mood

Docker Desktop licensing changes made a lot of companies pause, not because engineers suddenly hated Docker, but because legal teams developed a new hobby.

The typical sequence went like this:

  1. Someone in finance asked, “How many Docker Desktop users do we have?”
  2. Someone in legal asked, “What exactly are we paying for?”
  3. Someone in infrastructure said, “We can probably do this with Podman or nerdctl.”

A tool can survive engineers complaining about it. Engineers complain about everything. The real danger is when procurement turns your favorite tool into a spreadsheet with a red cell.

The result was predictable: even developers who loved Docker started exploring alternatives, if only to reduce risk and friction.

The runtime world standardized without Docker

Modern container platforms increasingly rely on runtimes like containerd and interfaces like the Container Runtime Interface (CRI).

Kubernetes is a key example. Kubernetes removed the direct Docker integration path that many people depended on in earlier years, and the ecosystem moved toward CRI-native runtimes. The point was not to “ban Docker.” The point was to standardize around an interface designed specifically for orchestrators.

This is a subtle but important difference.

  • Docker is a complete experience, build, run, network, UX, opinions included.
  • Orchestrators prefer modular components, and they want to speak to a runtime through a stable interface.

The practical effect is what most teams feel today:

  • In many Kubernetes environments, the runtime is containerd, not Docker Engine.
  • Managed platforms such as ECS Fargate and other orchestrated services often run containers without involving Docker at all.

Docker, the daemon, became optional.

Security teams like control, and they do not like surprises

Security teams do not wake up in the morning and ask, “How can I ruin a developer’s day?” They wake up and ask, “How can I make sure the host does not become a piñata full of root access?”

Docker can be perfectly secure when used well. The problem is that it can also be spectacularly insecure when used casually.

Two recurring issues show up in real organizations:

  • The Docker socket is powerful. Expose it carelessly, and you are effectively offering a fast lane to host-level control.
  • The classic pattern of “just give developers sudo docker” can become a horror story with a polite ticket number.

Tools and workflows that separate concerns tend to make security people calmer.

  • Build tools such as BuildKit and buildah isolate image creation.
  • Rootless approaches, where feasible, reduce blast radius.
  • Runtime components can be locked down and audited more granularly.

This is not about blaming Docker. It’s about organizations preferring a setup where the sharp knives are stored in a drawer, not taped to the ceiling.

What Docker is now

Docker’s new role is less “the thing that runs production” and more “the thing that makes local development less painful.”

And that role is huge.

Docker still shines in areas where convenience matters most:

  • Local development environments
  • Quick reproducible demos
  • Multi-service stacks on a laptop
  • Cross-platform consistency on macOS, Windows, and Linux
  • Teams that need a simple standard for “how do I run this?”

If your job is to onboard new engineers quickly, Docker is still one of the best ways to avoid the dreaded onboarding ritual where a senior engineer says, “It works on my machine,” and the junior engineer quietly wonders if their machine has offended someone.

A small example that still earns its keep

Here is a minimal Docker Compose stack that demonstrates why Docker remains lovable for local development.

services:
  api:
    build: .
    ports:
      - "8080:8080"
    environment:
      DATABASE_URL: postgres://postgres:example@db:5432/app
    depends_on:
      - db

  db:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: example
      POSTGRES_DB: app
    ports:
      - "5432:5432"

This is not sophisticated. That is the point. It is the “plug it in and it works” power that made Docker famous.

Dockerfile is not the Docker daemon

This is where the confusion often peaks.

A Dockerfile is a packaging recipe. It is widely used. It remains a de facto standard, even when the runtime or build system is not Docker.

Many teams still write Dockerfiles, but build them using tooling that does not rely on the Docker daemon on the CI runner.

Here is a BuildKit example that builds and pushes an image without treating the Docker daemon as a requirement.

buildctl build \
  --frontend dockerfile.v0 \
  --local context=. \
  --local dockerfile=. \
  --output type=image,name=registry.example.com/app:latest,push=true

You can read this as “Dockerfile lives on, but Docker-as-a-daemon is no longer the main character.”

This separation matters because it changes how you design CI.

  • You can build images in environments where running a privileged Docker daemon is undesirable.
  • You can use builders that integrate better with Kubernetes or cloud-native pipelines.
  • You can reduce the amount of host-level power you hand out just to produce an artifact.

What replaced Docker in production pipelines

When teams say they are moving away from Docker in production, they rarely mean “we stopped using containers.” They mean the tooling around building and running containers is shifting.

Common patterns include:

  • containerd as the runtime in Kubernetes and other orchestrated environments
  • BuildKit for efficient builds and caching
  • kaniko for building images inside Kubernetes without a Docker daemon
  • ko for building and publishing Go applications as images without a Dockerfile
  • Buildpacks or Nixpacks for turning source code into runnable images using standardized build logic
  • Dagger and similar tools for defining CI pipelines that treat builds as portable graphs of steps

You do not need to use all of these. You just need to understand the trend.

Production platforms want:

  • Standard interfaces
  • Smaller, auditable components
  • Reduced privilege
  • Reproducible builds

Docker can participate in that world, but it no longer owns the whole stage.

A Kubernetes-friendly image build example

If you want a concrete example of the “no Docker daemon” approach, kaniko is a popular choice in cluster-native pipelines.

apiVersion: batch/v1
kind: Job
metadata:
  name: build-image-kaniko
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: kaniko
          image: gcr.io/kaniko-project/executor:latest
          args:
            - "--dockerfile=Dockerfile"
            - "--context=dir:///workspace"
            - "--destination=registry.example.com/app:latest"
          volumeMounts:
            - name: workspace
              mountPath: /workspace
      volumes:
        - name: workspace
          emptyDir: {}

This is intentionally simplified. In a real setup, you would bring your own workspace, your own auth mechanism, and your own caching strategy. But even in this small example, the idea is visible: build the image where it makes sense, without turning every CI runner into a tiny Docker host.

The practical takeaway for architects and platform teams

If you are designing platforms, the question is not “Should we ban Docker?” The question is “Where does Docker add value, and where does it create unnecessary coupling?”

A simple mental model helps.

  • Developer laptops benefit from a friendly tool that makes local environments predictable.
  • CI systems benefit from builder choices that reduce privilege and improve caching.
  • Production runtimes benefit from standardized interfaces and minimal moving parts.

Docker tends to dominate the first category, participates in the second, and is increasingly optional in the third.

If your team still uses Docker Engine on production hosts, that is not automatically wrong. It might be perfectly fine. The important thing is that you are doing it intentionally, not because “that’s how we’ve always done it.”

Why this is actually a success story

There is a temptation in tech to treat every shift as a funeral.

But Docker moving toward local development is not a collapse. It is a sign that the ecosystem absorbed Docker’s best ideas and made them normal.

The standardization of OCI images, the popularity of Dockerfile workflows, and the expectations around reproducible environments, all of that is Docker’s legacy living in the walls.

Docker is still the tool you reach for when you want to:

  • start fast
  • teach someone new
  • run a realistic stack on a laptop
  • avoid spending your afternoon installing the same dependencies in three different ways

That is not “less important.” That is foundational.

If anything, Docker’s new role resembles a very specific kind of modern utility.

It is like Visual Studio Code.

Everyone uses it. Everyone argues about it. It is not what you deploy to production, but it is the thing that makes building and testing your work feel sane.

Docker didn’t die.

It just moved to your laptop, brought snacks, and quietly let production run the serious machinery without demanding to be invited to every meeting.

Your Multi-Region strategy is a fantasy

The recent failure showed us the truth: your data is stuck, and active-active failover is a fantasy for 99% of us. Here’s a pragmatic high-availability strategy that actually works.

Well, that was an intense week.

When the great AWS outage of October 2025 hit, I did what every senior IT person does: I grabbed my largest coffee mug, opened our monitoring dashboard, and settled in to watch the world burn. us-east-1, the internet’s stubbornly persistent center of gravity, was having what you’d call a very bad day.

And just like clockwork, as the post-mortems rolled in, the old, tired refrain started up on social media and in Slack: “This is why you must be multi-region.”

I’m going to tell you the truth that vendors, conference speakers, and that one overly enthusiastic junior dev on your team won’t. For 99% of companies, “multi-region” is a lie.

It’s an expensive, complex, and dangerous myth sold as a silver bullet. And the recent outage just proved it.

The “Just Be Multi-Region” fantasy

On paper, it sounds so simple. It’s a lullaby for VPs.

You just run your app in us-east-1 (Virginia) and us-west-2 (Oregon). You put a shiny global load balancer in front, and if Virginia decides to spontaneously become an underwater volcano, poof! All your traffic seamlessly fails over to Oregon. Zero downtime. The SREs are heroes. Champagne for everyone.

This is a fantasy.

It’s a fantasy that costs millions of dollars and lures development teams into a labyrinth of complexity they will never escape. I’ve spent my career building systems that need to stay online. I’ve sat in the planning meetings and priced out the “real” cost. Let me tell you, true active-active multi-region isn’t just “hard”; it’s a completely different class of engineering.

And it’s one that your company almost certainly doesn’t need.

The three killers of Multi-Region dreams

It’s not the application servers. Spinning up EC2 instances or containers in another region is the easy part. That’s what we have Infrastructure as Code for. Any intern can do that.

The problem isn’t the compute. The problem is, and always has been, the data.

Killer 1: Data has gravity, and it’s a jerk

This is the single most important concept in cloud architecture. Data has gravity.

Your application code is a PDF. It’s stateless and lightweight. You can email it, copy it, and run it anywhere. Your 10TB PostgreSQL database is not a PDF. It’s the 300-pound antique oak desk the computer is sitting on. You can’t just “seamlessly fail it over” to another continent.

To have a true seamless failover, your data must be available in the second region at the exact moment of the failure. This means you need synchronous, real-time replication across thousands of miles.

Guess what that does to your write performance? It’s like trying to have a conversation with someone on Mars. The latency of a round-trip from Virginia to Oregon adds hundreds of milliseconds to every single database write. The application becomes unusably slow. Every time a user clicks “save,” they have to wait for a photon to physically travel across the country and back. Your users will hate it.

“Okay,” you say, “we’ll use asynchronous replication!”

Great. Now when us-east-1 fails, you’ve lost the last 5 minutes of data. Every transaction, every new user sign-up, every shopping cart order. Vanished. You’ve traded a “Recovery Time” of zero for a “Data Loss” that is completely unacceptable. Go explain to the finance department that you purposefully designed a system that throws away the most recent customer orders. I’ll wait.

This is the trap. Your compute is portable; your data is anchored.

Killer 2: The astronomical cost

I was on a project once where the CTO, fresh from a vendor conference, wanted a full active-active multi-region setup. We scoped it.

Running 2x the servers was fine. The real cost was the inter-region data transfer.

AWS (and all cloud providers) charge an absolute fortune for data moving between their regions. It’s the “hotel minibar” of cloud services. Every single byte your database replicates, every log, every file transfer… cha-ching.

Our projected bill for the data replication and the specialized services (like Aurora Global Databases or DynamoDB Global Tables) was three times the cost of the entire rest of the infrastructure.

You are paying a massive premium for a fleet of servers, databases, and network gateways that are sitting idle 99.9% of the time. It’s like buying the world’s most expensive gym membership and only going once every five years to “test” it. It’s an insurance policy so expensive, you can’t afford the disaster it’s meant to protect you from.

Killer 3: The crushing complexity

A multi-region system isn’t just two copies of your app. It’s a brand new, highly complex, slightly psychotic distributed system that you now have to feed and care for.

You now have to solve problems you never even thought about:

  • Global DNS failover: How does Route 53 know a region is down? Health checks fail. But what if the health check itself fails? What if the health check thinks Virginia is fine, but it’s just hallucinating?
  • Data write conflicts: This is the fun part. What if a user in New York (writing to us-east-1) and a user in California (writing to us-west-2) update the same record at the same time? Welcome to the world of split-brain. Who wins? Nobody. You now have two “canonical” truths, and your database is having an existential crisis. Your job just went from “Cloud Architect” to “Data Therapist.”
  • Testing: How do you even test a full regional failover? Do you have a big red “Kill Virginia” button? Are you sure you know what will happen when you press it? On a Tuesday afternoon? I didn’t think so.

You haven’t just doubled your infrastructure; you’ve 10x’d your architectural complexity.

But we have Kubernetes because we are Cloud Native

This was my favorite part of the October 2025 outage.

I saw so many teams that thought Kubernetes would save them. They had their fancy federated K8s clusters spanning multiple regions, YAML files as far as the eye could see.

And they still went down.

Why? Because Kubernetes doesn’t solve data gravity!

Your K8s cluster in us-west-2 dutifully spun up all your application pods. They woke up, stretched, and immediately started screaming: “WHERE IS MY DISK?!”

Your persistent volumes (PVs) are backed by EBS or EFS. That ‘E’ stands for ‘Elastic,’ not ‘Extradimensional.’ That disk is physically, stubbornly, regionally attached to Virginia. Your pods in Oregon can’t mount a disk that lives 3,000 miles away.

Unless you’ve invested in another layer of incredibly complex, eye-wateringly expensive storage replication software, your “cloud-native” K8s cluster was just a collection of very expensive, very confused applications shouting into the void for a database that was currently offline.

A pragmatic high availability strategy that actually works

So if multi-region is a lie, what do we do? Just give up? Go home? Take up farming?

Yes. You accept some downtime.

You stop chasing the “five nines” (99.999%) myth and start being honest with the business. Your goal is not “zero downtime.” Your goal is a tested and predictable recovery.

Here is the sane strategy.

1. Embrace Multi-AZ (The real HA)

This is what AWS actually means by “high availability.” Run your application across multiple Availability Zones (AZs) within a single region. An AZ is a physically separate data center. us-east-1a and us-east-1b are miles apart, with different power and network.

This is like having a backup generator for your house. Multi-region is like building an identical, fully-furnished duplicate house in another city just in case a meteor hits your first one.

Use a Multi-AZ RDS instance. Use an Auto Scaling Group that spans AZs. This protects you from 99% of common failures: a server rack dying, a network switch failing, or a construction crew cutting a fiber line. This should be your default. It’s cheap, it’s easy, and it works.

2. Focus on RTO and RPO

Stop talking about “nines” and start talking about two simple numbers:

  • RTO (Recovery Time Objective): How fast do we need to be back up?
  • RPO (Recovery Point Objective): How much data can we afford to lose?

Get a real answer from the business, not a fantasy. Is a 4-hour RTO and a 15-minute RPO acceptable? For almost everyone, the answer is yes.

3. Build a “Warm Standby” (The sane DR)

This is the strategy that actually works. It’s the “fire drill” plan, not the “build a duplicate city” plan.

  • Infrastructure: Your entire infrastructure is defined in Terraform or CloudFormation. You can rebuild it from scratch in any region with a single command.
  • Data: You take regular snapshots of your database (e.g., every 15 minutes) and automatically copy them to your disaster recovery region (us-west-2).
  • The plan: When us-east-1 dies, you declare a disaster. The on-call engineer runs the “Deploy-to-DR” script.

Here’s a taste of what that “sane” infrastructure-as-code looks like. You’re not paying for two of everything. You’re paying for a blueprint and a backup.

# main.tf (in your primary region module)
# This is just a normal server
resource "aws_instance" "app_server" {
  count         = 3 # Your normal production count
  ami           = "ami-0abcdef123456"
  instance_type = "t3.large"
  # ... other config
}

# dr.tf (in your DR region module)
# This server doesn't even exist... until you need it.
resource "aws_instance" "dr_app_server" {
  # This is the magic.
  # This resource is "off" by default (count = 0).
  # You flip one variable (is_disaster = true) to build it.
  count         = var.is_disaster ? 3 : 0
  provider      = aws.dr_region # Pointing to us-west-2
  ami           = "ami-0abcdef123456" # Same AMI
  instance_type = "t3.large"
  # ... other config
}

resource "aws_db_instance" "dr_database" {
  count                   = var.is_disaster ? 1 : 0
  provider                = aws.dr_region
  
  # Here it is: You build the new DB from the
  # latest snapshot you've been copying over.
  replicate_source_db     = var.latest_db_snapshot_arn
  
  instance_class          = "db.r5.large"
  # ... other config
}

You flip a single DNS record in Route 53 to point all traffic to the new load balancer in us-west-2.

Yes, you have downtime (your RTO of 2–4 hours). Yes, you might lose 15 minutes of data (your RPO).

But here’s the beautiful part: it actually works, it’s testable, and it costs a tiny fraction of an active-active setup.

The AWS outage in October 2025 wasn’t a lesson in the need for multi-region. It was a global, public, costly lesson in humility. It was a reminder to stop chasing mythical architectures that look good on a conference whiteboard and focus on building resilient, recoverable systems.

So, stop feeling guilty because your setup doesn’t span three continents. You’re not lazy; you’re pragmatic. You’re the sane one in a room full of people passionately arguing about the best way to build a teleporter for that 300-pound antique oak desk.

Let them have their complex, split-brain, data-therapy sessions. You’ve chosen a boring, reliable, testable “warm standby.” You’ve chosen to get some sleep.

The slow unceremonious death of EC2 Autoscaling

Let’s pour one out for an old friend.

AWS recently announced a small, seemingly boring new feature for EC2 Auto Scaling: the ability to cancel a pending instance refresh. If you squinted, you might have missed it. It sounds like a minor quality-of-life update, something to make a sysadmin’s Tuesday slightly less terrible.

But this isn’t a feature. It’s a gold watch. It’s the pat on the back and the “thanks for your service” speech at the awkward retirement party.

The EC2 Auto Scaling Group (ASG), the bedrock of cloud elasticity, the one tool we all reflexively reached for, is being quietly put out to pasture.

No, AWS hasn’t officially killed it. You can still spin one up, just like you can still technically send a fax. AWS will happily support it. But its days as the default, go-to solution for modern workloads are decisively over. The battle for the future of scaling has ended, and the ASG wasn’t the winner. The new default is serverless containers, hyper-optimized Spot fleets, and platforms so abstract they’re practically invisible.

If you’re still building your infrastructure around the ASG, you’re building a brand-new house with plumbing from 1985. It’s time to talk about why our old friend is retiring and meet the eager new hires who are already measuring the drapes in its office.

So why is the ASG getting the boot?

We loved the ASG. It was a revolutionary idea. But like that one brilliant relative everyone dreads sitting next to at dinner, it was also exhausting. Its retirement was long overdue, and the reasons are the same frustrations we’ve all been quietly grumbling about into our coffee for years.

It promised automation but gave us chores

The ASG’s sales pitch was simple: “I’ll handle the scaling!” But that promise came with a three-page, fine-print addendum of chores.

It was the operational overhead that killed us. We were promised a self-driving car and ended up with a stick-shift that required constant, neurotic supervision. We became part-time Launch Template librarians, meticulously versioning every tiny change. We became health-check philosophers, endlessly debating the finer points of ELB vs. EC2 health checks.

And then… the Lifecycle Hooks.

A “Lifecycle Hook” is a polite, clinical term for a Rube Goldberg machine of desperation. It’s a panic button that triggers a Lambda, which calls a Systems Manager script, which sends a carrier pigeon to… maybe… drain a connection pool before the instance is ruthlessly terminated. Trying to debug one at 3 AM was a rite of passage, a surefire way to lose precious engineering time and a little bit of your soul.

It moves at a glacial pace

The second nail in the coffin was its speed. Or rather, the complete lack of it.

The ASG scales at the speed of a full VM boot. In our world of spiky, unpredictable traffic, that’s an eternity. It’s like pre-heating a giant, industrial pizza oven for 45 minutes just to toast a single slice of bread. By the time your new instance is booted, configured, service-discovered, and finally “InService,” the spike in traffic has already come and gone, leaving you with a bigger bill and a cohort of very annoyed users.

It’s an expensive insurance policy

The ASG model is fundamentally wasteful. You run a “warm” fleet, paying for idle capacity just in case you need it. It’s like paying rent on a 5-bedroom house for your family of three, just in case 30 cousins decide to visit unannounced.

This “scale-up” model was slow, and the “scale-down” was even worse, riddled with fears of terminating the wrong instance and triggering a cascading failure. We ended up over-provisioning to avoid the pain of scaling, which completely defeats the purpose of “auto-scaling.”

The eager interns taking over the desk

So, the ASG has cleared out its desk. Who’s moving in? It turns out there’s a whole line of replacements, each one leaner, faster, and blissfully unconcerned with managing a “fleet.”

1. The appliance Fargate and Cloud Run

First up is the “serverless container”. This is the hyper-efficient new hire who just says, “Give me the Dockerfile. I’ll handle the rest.”

With AWS Fargate or Google’s Cloud Run, you don’t have a fleet. You don’t manage VMs. You don’t patch operating systems. You don’t even think about an instance. You just define a task, give it some CPU and memory, and tell it how many copies you want. It scales from zero to a thousand in seconds.

This is the appliance model. When you buy a toaster, you don’t worry about wiring the heating elements or managing its power supply. You just put in bread and get toast. Fargate is the toaster. The ASG was the “build-your-own-toaster” kit that came with a 200-page manual on electrical engineering.

Just look at the cognitive load. This is what it takes to get a basic ASG running via the CLI:

# The "Old Way": Just one of the many steps...
aws autoscaling create-auto-scaling-group \
    --auto-scaling-group-name my-legacy-asg \
    --launch-template "LaunchTemplateName=my-launch-template,Version='1'" \
    --min-size 1 \
    --max-size 5 \
    --desired-capacity 2 \
    --vpc-zone-identifier "subnet-0571c54b67EXAMPLE,subnet-0c1f4e4776EXAMPLE" \
    --health-check-type ELB \
    --health-check-grace-period 300 \
    --tag "Key=Name,Value=My-ASG-Instance,PropagateAtLaunch=true"

You still need to define the launch template, the subnets, the load balancer, the health checks…

Now, here’s the core of a Fargate task definition. It’s just a simple JSON file:

// The "New Way": A snippet from a Fargate Task Definition
{
  "family": "my-modern-app",
  "containerDefinitions": [
    {
      "name": "my-container",
      "image": "nginx:latest",
      "cpu": 256,
      "memory": 512,
      "portMappings": [
        {
          "containerPort": 80,
          "hostPort": 80
        }
      ]
    }
  ],
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512"
}

You define what you need, and the platform handles everything else.

2. The extreme couponer Spot fleets

For workloads that are less “instant spike” and more “giant batch job,” we have the “optimized fleet”. This is the high-stakes, high-reward world of Spot Instances.

Spot used to be terrifying. AWS could pull the plug with two minutes’ notice, and your entire workload would evaporate. But now, with Spot Fleets and diversification, it’s the smartest tool in the box. You can tell AWS, “I need 1,000 vCPUs, and I don’t care what instance types you give me, just find the cheapest ones.”

The platform then builds a diversified fleet for you across multiple instance types and Availability Zones, making it incredibly resilient to any single Spot pool termination. It’s perfect for data processing, CI/CD runners, and any batch job that can be interrupted and resumed. The ASG was always too rigid for this kind of dynamic, cost-driven scaling.

3. The paranoid security guard MicroVMs

Then there’s the truly weird stuff: Firecracker. This is the technology that powers AWS Lambda and Fargate. It’s a “MicroVM” that gives you the iron-clad security isolation of a full virtual machine but with the lightning-fast startup speed of a container.

We’re talking boot times of under 125 milliseconds. This is for when you need to run thousands of tiny, separate, untrusted workloads simultaneously without them ever being able to see each other. It’s the ultimate “multi-tenant” dream, giving every user their own tiny, disposable, fire-walled VM in the blink of an eye.

4. The invisible platform Edge runtimes

Finally, we have the platforms that are so abstract they’re “scaled to invisibility”. This is the world of Edge. Think Lambda@Edge or CloudFront Functions.

With these, you’re not even scaling in a region anymore. Your logic, your code, is automatically replicated and executed at hundreds of Points of Presence around the globe, as close to the end-user as possible. The entire concept of a “fleet” or “instance” just… disappears. The logic scales with the request.

Life after the funeral. How to adapt

Okay, the eulogy is over. The ASG is in its rocking chair on the porch. What does this mean for us, the builders? It’s time to sort through the old belongings and modernize the house.

Go full Marie Kondo on your architecture

First, you need to re-evaluate. Open up your AWS console and take a hard look at every single ASG you’re running. Be honest. Ask the tough questions:

  • Does this workload really need to be stateful?
  • Do I really need VM-level control, or am I just clinging to it for comfort?
  • Is this a stateless web app that I’ve just been too lazy to containerize?

If it doesn’t spark joy (or isn’t a snowflake legacy app that’s impossible to change), thank it for its service and plan its migration.

Stop shopping for engines, start shopping for cars

The most important shift is this: Pick the runtime, not the infrastructure.

For too long, our first question was, “What EC2 instance type do I need?” That’s the wrong question. That’s like trying to build a new car by starting at the hardware store to buy pistons.

The right question is, “What’s the best runtime for my workload?”

  • Is it a simple, event-driven piece of logic? That’s a Function (Lambda).
  • Is it a stateless web app in a container? That’s a Serverless Container (Fargate).
  • Is it a massive, interruptible batch job? That’s an Optimized Fleet (Spot).
  • Is it a cranky, stateful monolith that needs a pet VM? Only then do you fall back to an Instance (EC2, maybe even with an ASG).

Automate logic, not instance counts

Your job is no longer to be a VM mechanic. Your team’s skills need to shift. Stop manually tuning desired_capacity and start designing event-driven systems.

Focus on scaling logic, not servers. Your scaling trigger shouldn’t be “CPU is at 80%.” It should be “The SQS queue depth is greater than 100” or “API latency just breached 200ms”. Let the platform, be it Lambda, Fargate, or a KEDA-powered Kubernetes cluster, figure out how to add more processing power.

Was it really better in the old days?

Of course, this move to abstraction isn’t without trade-offs. We’re gaining a lot, but we’re also losing something.

The gain is obvious: We get our nights and weekends back. We get drastically reduced operational overhead, faster scaling, and for most stateless workloads, a much lower bill.

The loss is control. You can’t SSH into a Fargate container. You can’t run a custom kernel module on Lambda. For those few, truly special, high-customization legacy workloads, this is a dealbreaker. They will be the ASG’s loyal companions in the retirement home.

But for everything else? The ASG is a relic. It was a brilliant, necessary solution for the problems of 2010. But the problems of 2025 and beyond are different. The cloud has evolved to scale logic, functions, and containers, not just nodes.

The king isn’t just dead. The very concept of a throne has been replaced by a highly efficient, distributed, and slightly impersonal serverless committee. And frankly, it’s about time.

Why your Kubernetes pods and EBS volumes refuse to reconnect

You’re about to head out for lunch. One last, satisfying glance at the monitoring dashboard, all systems green. Perfect. You return an hour later, coffee in hand, to a cascade of alerts. Your application is down. At the heart of the chaos is a single, cryptic message from Kubernetes, and it’s in a mood.

Warning: 1 node(s) had volume node affinity conflict.

You stare at the message. “Volume node affinity conflict” sounds less like a server error and more like something a therapist would say about a couple that can’t agree on which city to live in. You grab your laptop. One of your critical application pods has been evicted from its node and now sits stubbornly in a Pending state, refusing to start anywhere else.

Welcome to the quiet, simmering nightmare of running stateful applications on a multi-availability zone Kubernetes cluster. Your pods and your storage are having a domestic dispute, and you’re the unlucky counselor who has to fix it before the morning stand-up.

Meet the unhappy couple

To understand why your infrastructure is suddenly giving you the silent treatment, you need to understand the two personalities at the heart of this conflict.

First, we have the Pod. Think of your Pod as a freewheeling digital nomad. It’s lightweight, agile, and loves to travel. If its current home (a Node) gets too crowded or suddenly vanishes in a puff of cloud provider maintenance, the Kubernetes scheduler happily finds it a new place to live on another node. The Pod packs its bags in a microsecond and moves on, no questions asked. It believes in flexibility and a minimalist lifestyle.

Then, there’s the EBS volume. If the Pod is a nomad, the Amazon EBS Volume is a resolute homebody. It’s a hefty, 20GB chunk of your application’s precious data. It’s incredibly reliable and fast, but it has one non-negotiable trait: it is physically, metaphorically, and spiritually attached to one single place. That place is an AWS Availability Zone (AZ), which is just a fancy term for a specific data center. An EBS volume created in us-west-2a lives in us-west-2a, and it would rather be deleted than move to us-west-2b. It finds the very idea of travel vulgar.

You can already see the potential for drama. The free-spirited Pod gets evicted and is ready to move to a lovely new node in us-west-2b. But its data, its entire life story, is sitting back in us-west-2a, refusing to budge. The Pod can’t function without its data, so it just sits there, Pending, forever waiting for a reunion that will never happen.

The brute force solution that creates new problems

When faced with this standoff, our first instinct is often to play the role of a strict parent. “You two will stay together, and that’s final!” In Kubernetes, this is called the nodeSelector.

You can edit your Deployment and tell the Pod, in no uncertain terms, that it is only allowed to live in the same neighborhood as its precious volume.

# deployment-with-nodeslector.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: stateful-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-stateful-app
  template:
    metadata:
      labels:
        app: my-stateful-app
    spec:
      nodeSelector:
        # "You will ONLY live in this specific zone!"
        topology.kubernetes.io/zone: us-west-2b
      containers:
        - name: my-app-container
          image: nginx:1.25.3
          volumeMounts:
            - name: app-data
              mountPath: /var/www/html
      volumes:
        - name: app-data
          persistentVolumeClaim:
            claimName: my-app-pvc

This works. Kind of. The Pod is now shackled to the us-west-2b availability zone. If it gets rescheduled, the scheduler will only consider other nodes within that same AZ. The affinity conflict is solved.

But you’ve just traded one problem for a much scarier one. You’ve effectively disabled the “multi-AZ” resilience for this application. If us-west-2b experiences an outage or simply runs out of compute resources, your pod has nowhere to go. It will remain Pending, not because of a storage spat, but because you’ve locked it in a house that’s just run out of oxygen. This isn’t a solution; it’s just picking a different way to fail.

The elegant fix of intelligent patience

So, how do we get our couple to cooperate without resorting to digital handcuffs? The answer lies in changing not where they live, but how they decide to move in together.

The real hero of our story is a little-known StorageClass parameter: volumeBindingMode: WaitForFirstConsumer.

By default, when you ask for a PersistentVolumeClaim, Kubernetes provisions the EBS volume immediately. It’s like buying a heavy, immovable sofa before you’ve even chosen an apartment. The delivery truck drops it in us-west-2a, and now you’re forced to find an apartment in that specific neighborhood.

WaitForFirstConsumer flips the script entirely. It tells Kubernetes: “Hold on. Don’t buy the sofa yet. First, let the Pod (the ‘First Consumer’) find an apartment it likes.”

Here’s how this intelligent process unfolds:

  1. You request a volume with a PersistentVolumeClaim.
  2. The StorageClass, configured with WaitForFirstConsumer, does… nothing. It waits.
  3. The Kubernetes scheduler, now free from any storage constraints, analyzes all your nodes across all your availability zones. It finds the best possible node for your Pod based on resources and other policies. Let’s say it picks a node in us-west-2c.
  4. Only after the Pod has been assigned a home on that node does the StorageClass get the signal. It then dutifully provisions a brand-new EBS volume in that exact same zone, us-west-2c.

The Pod and its data are born together, in the same place, at the same time. No conflict. No drama. It’s a match made in cloud heaven.

Here is what this “patient” StorageClass looks like:

# storageclass-patient.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-sc-wait
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  fsType: ext4
# This is the magic line.
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

Your PersistentVolumeClaim simply needs to reference it:

# persistentvolumeclaim.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-app-pvc
spec:
  # Reference the patient StorageClass
  storageClassName: ebs-sc-wait
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi

And now, your Deployment can be blissfully unaware of zones, free to roam as a true digital nomad should.

# deployment-liberated.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: stateful-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-stateful-app
  template:
    metadata:
      labels:
        app: my-stateful-app
    spec:
      # No nodeSelector! The pod is free!
      containers:
        - name: my-app-container
          image: nginx:1.25.3
          volumeMounts:
            - name: app-data
              mountPath: /var/www/html
      volumes:
        - name: app-data
          persistentVolumeClaim:
            claimName: my-app-pvc

Let your infrastructure work for you

The moral of the story is simple. Don’t fight the brilliant, distributed nature of Kubernetes with rigid, zonal constraints. You chose a multi-AZ setup for resilience, so don’t let your storage configuration sabotage it.

By using WaitForFirstConsumer, which, thankfully, is the default in modern versions of the AWS EBS CSI Driver, you allow the scheduler to do its job properly. Your pods and volumes can finally have a healthy, lasting relationship, happily migrating together wherever the cloud winds take them.

And you? You can go back to sleep.

Playing detective with dead Kubernetes nodes

It arrives without warning, a digital tap on the shoulder that quickly turns into a full-blown alarm. Maybe you’re mid-sentence in a meeting, or maybe you’re just enjoying a rare moment of quiet. Suddenly, a shriek from your phone cuts through everything. It’s the on-call alert, flashing a single, dreaded message: NodeNotReady.

Your beautifully orchestrated city of containers, a masterpiece of modern engineering, now has a major power outage in one of its districts. One of your worker nodes, a once-diligent and productive member of the cluster, has gone completely silent. It’s not responding to calls, it’s not picking up new work, and its existing jobs are in limbo. In the world of Kubernetes, this isn’t just a technical issue; it’s a ghosting of the highest order.

Before you start questioning your life choices or sacrificing a rubber chicken to the networking gods, take a deep breath. Put on your detective’s trench coat. We have a case to solve.

First on the scene, the initial triage

Every good investigation starts by surveying the crime scene and asking the most basic question: What the heck happened here? In our world, this means a quick and clean interrogation of the Kubernetes API server. It’s time for a roll call.

kubectl get nodes -o wide

This little command is your first clue. It lines up all your nodes and points a big, accusatory finger at the one in the Not Ready state.

NAME                    STATUS     ROLES    AGE   VERSION   INTERNAL-IP      EXTERNAL-IP     OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
k8s-master-1            Ready      master   90d   v1.28.2   10.128.0.2       34.67.123.1     Ubuntu 22.04.1 LTS   5.15.0-78-generic   containerd://1.6.9
k8s-worker-node-7b5d    NotReady   <none>   45d   v1.28.2   10.128.0.5       35.190.45.6     Ubuntu 22.04.1 LTS   5.15.0-78-generic   containerd://1.6.9
k8s-worker-node-fg9h    Ready      <none>   45d   v1.28.2   10.128.0.4       35.190.78.9     Ubuntu 22.04.1 LTS   5.15.0-78-generic   containerd://1.6.9

There’s our problem child: k8s-worker-node-7b5d. Now that we’ve identified our silent suspect, it’s time to pull it into the interrogation room for a more personal chat.

kubectl describe node k8s-worker-node-7b5d

The output of describe is where the juicy gossip lives. You’re not just looking at specs; you’re looking for a story. Scroll down to the Conditions and, most importantly, the Events section at the bottom. This is where the node often leaves a trail of breadcrumbs explaining exactly why it decided to take an unscheduled vacation.

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Mon, 13 Oct 2025 09:55:12 +0200   Mon, 13 Oct 2025 09:45:30 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 13 Oct 2025 09:55:12 +0200   Mon, 13 Oct 2025 09:45:30 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 13 Oct 2025 09:55:12 +0200   Mon, 13 Oct 2025 09:45:30 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Mon, 13 Oct 2025 09:55:12 +0200   Mon, 13 Oct 2025 09:50:05 +0200   KubeletNotReady              container runtime network not ready: CNI plugin reporting error: rpc error: code = Unavailable desc = connection error

Events:
  Type     Reason                   Age                  From                       Message
  ----     ------                   ----                 ----                       -------
  Normal   Starting                 25m                  kubelet                    Starting kubelet.
  Warning  ContainerRuntimeNotReady 5m12s (x120 over 25m) kubelet                    container runtime network not ready: CNI plugin reporting error: rpc error: code = Unavailable desc = connection error

Aha! Look at that. The Events log is screaming for help. A repeating warning, ContainerRuntimeNotReady, points to a CNI (Container Network Interface) plugin having a full-blown tantrum. We’ve moved from a mystery to a specific lead.

The usual suspects, a rogues’ gallery

When a node goes quiet, the culprit is usually one of a few repeat offenders. Let’s line them up.

1. The silent saboteur network issues

This is the most common villain. Your node might be perfectly healthy, but if it can’t talk to the control plane, it might as well be on a deserted island. Think of the control plane as the central office trying to call its remote employee (the node). If the phone line is cut, the office assumes the employee is gone. This can be caused by firewall rules blocking ports, misconfigured VPC routes, or a DNS server that’s decided to take the day off.

2. The overworked informant, the kubelet

The kubelet is the control plane’s informant on every node. It’s a tireless little agent that reports on the node’s health and carries out orders. But sometimes, this agent gets sick. It might have crashed, stalled, or is struggling with misconfigured credentials (like expired TLS certificates) and can’t authenticate with the mothership. If the informant goes silent, the node is immediately marked as a person of interest.

You can check on its health directly on the node:

# SSH into the problematic node
ssh user@<node-ip>

# Check the kubelet's vital signs
systemctl status kubelet

A healthy output should say active (running). Anything else, and you’ve found a key piece of evidence.

3. The glutton resource exhaustion

Your node has a finite amount of CPU, memory, and disk space. If a greedy application (or a swarm of them) consumes everything, the node itself can become starved. The kubelet and other critical system daemons need resources to breathe. Without them, they suffocate and stop reporting in. It’s like one person eating the entire buffet, leaving nothing for the hosts of the party.

A quick way to check for gluttons is with:

kubectl top node <your-problem-child-node-name>

If you see CPU or memory usage kissing 100%, you’ve likely found your culprit.

The forensic toolkit: digging deeper

If the initial triage and lineup didn’t reveal the killer, it’s time to break out the forensic tools and get our hands dirty.

Sifting Through the Diary with journalctl

The journalctl command is your window into the kubelet’s soul (or, more accurately, its log files). This is where it writes down its every thought, fear, and error.

# On the node, tail the kubelet's logs for clues
journalctl -u kubelet -f --since "10 minutes ago"

Look for recurring error messages, failed connection attempts, or anything that looks suspiciously out of place.

Quarantining the patient with drain

Before you start performing open-heart surgery on the node, it’s wise to evacuate the civilians. The kubectl drain command gracefully evicts all the pods from the node, allowing them to be rescheduled elsewhere.

kubectl drain k8s-worker-node-7b5d --ignore-daemonsets --delete-local-data

This isolates the patient, letting you work without causing a city-wide service outage.

Confirming the phone lines with curl

Don’t just trust the error messages. Verify them. From the problematic node, try to contact the API server directly. This tells you if the fundamental network path is even open.

# From the problem node, try to reach the API server endpoint
curl -k https://<api-server-ip>:<port>/healthz

If you get ok, the basic connection is fine. If it times out or gets rejected, you’ve confirmed a networking black hole.

Crime prevention: keeping your nodes out of trouble

Solving the case is satisfying, but a true detective also works to prevent future crimes.

  • Set up a neighborhood watch: Implement robust monitoring with tools like Prometheus and Grafana. Set up alerts for high resource usage, disk pressure, and node status changes. It’s better to spot a prowler before they break in.
  • Install self-healing robots: Most cloud providers (GKE, EKS, AKS) offer node auto-repair features. If a node fails its health checks, the platform will automatically attempt to repair it or replace it. Turn this on. It’s your 24/7 robotic police force.
  • Enforce city zoning laws: Use resource requests and limits on your deployments. This prevents any single application from building a resource-hogging skyscraper that blocks the sun for everyone else.
  • Schedule regular health checkups: Keep your cluster components, operating systems, and container runtimes updated. Many Not Ready mysteries are caused by long-solved bugs that you could have avoided with a simple patch.

The case is closed for now

So there you have it. The rogue node is back in line, the pods are humming along, and the city of containers is once again at peace. You can hang up your trench coat, put your feet up, and enjoy that lukewarm coffee you made three hours ago. The mystery is solved.

But let’s be honest. Debugging a Not Ready node is less like a thrilling Sherlock Holmes novel and more like trying to figure out why your toaster only toasts one side of the bread. It’s a methodical, often maddening, process of elimination. You start with grand theories of network conspiracies and end up discovering the culprit was a single, misplaced comma in a YAML file, the digital equivalent of the butler tripping over the rug.

So the next time an alert yanks you from your peaceful existence, don’t panic. Remember that you are a digital detective, a whisperer of broken machines. Your job is to patiently ask the right questions until the silent, uncooperative suspect finally confesses. After all, in the world of Kubernetes, a node is never truly dead. It’s just being dramatic and waiting for a good detective to find the clues, and maybe, just maybe, restart its kubelet. The city is safe… until the next time. And there is always a next time.

Your Kubernetes rollback is lying

The PagerDuty alert screams. The new release, born just minutes ago with such promising release notes, is coughing up blood in production. The team’s Slack channel is a frantic mess of flashing red emojis. Someone, summoning the voice of a panicked adult, yells the magic word: “ROLLBACK!”

And so, Helm, our trusty tow-truck operator, rides in with a smile, waving its friendly green check marks. The dashboards, those silent accomplices, beam with the serene glow of healthy metrics. Kubernetes probes, ever so polite, confirm that the resurrected pods are, in fact, “breathing.”

Then, production face-plants. Hard.

The feeling is like putting a cartoon-themed bandage on a burst water pipe and then wondering, with genuine surprise, why the living room has become a swimming pool. This article is the autopsy of those “perfect” rollbacks. We’re going to uncover why your monitoring is a pathological liar, how network traffic becomes a double agent, and what to do so that the next time Helm gives you a thumbs-up, you can actually believe it.

A state that refuses to time-travel

The first, most brutal lie a rollback tells you is that it can turn back time. A helm rollback is like the “rewind” button on an old VCR remote; it diligently rewinds the tape (your YAML manifests), but it has absolutely no power to make the actors on screen younger.

Your application’s state is one of those stubborn actors.

While your ConfigMaps and Secrets might dutifully revert to their previous versions, your data lives firmly in the present. If your new release included a database migration that added a column, rolling back the application code doesn’t magically make that column disappear. Now your old code is staring at a database schema from the future, utterly confused, like a medieval blacksmith being handed an iPad.

The same goes for PersistentVolumeClaims, external caches like Redis, or messages sitting in a Kafka queue. The rollback command whispers sweet nothings about returning to a “known good state,” but it’s only talking about itself. The rest of your universe has moved on, and it refuses to travel back with you.

The overly polite doorman

The second culprit in our investigation is the Kubernetes probe. Think of the readinessProbe as an overly polite doorman at a fancy party. Its job is to check if a guest (your pod) is ready to enter. But its definition of “ready” can be dangerously optimistic.

Many applications, especially those running on the JVM, have what we’ll call a “warming up” period. When a pod starts, the process is running, the HTTP port is open, and it will happily respond to a simple /health check. The doorman sees a guest in a tuxedo and says, “Looks good to me!” and opens the door.

What the doorman doesn’t see is that this guest is still stretching, yawning, and trying to remember where they are. The application’s caches are cold, its connection pools are empty, and its JIT compiler is just beginning to think about maybe, possibly, optimizing some code. The first few dozen requests it receives will be painfully slow or, worse, time out completely.

So while your readinessProbe is giving you a green light, your first wave of users is getting a face full of errors. For these sleepy applications, you need a more rigorous bouncer.

A startupProbe is that bouncer. It gives the app a generous amount of time to get its act together before even letting the doorman (readiness and liveness probes) start their shift.

# This probe gives our sleepy JVM app up to 5 minutes to wake up.
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
startupProbe:
  httpGet:
    path: /health/ready
    port: 8080
  # Kubelet will try 30 times with a 10-second interval (300 seconds).
  # If the app isn't ready by then, the pod will be restarted.
  failureThreshold: 30
  periodSeconds: 10

Without it, your rollback creates a fleet of pods that are technically alive but functionally useless, and Kubernetes happily sends them a flood of unsuspecting users.

Traffic, the double agent

And that brings us to our final suspect: the network traffic itself. In a modern setup using a service mesh like Istio or Linkerd, traffic routing is a sophisticated dance. But even the most graceful dancer can trip.

When you roll back, a new ReplicaSet is created with the old pod specification. The service mesh sees these new pods starting up, asks the doorman (readinessProbe) if they’re good to go, gets an enthusiastic “yes!”, and immediately starts sending them a percentage of live production traffic.

This is where all our problems converge. Your service mesh, in its infinite efficiency, has just routed 50% of your user traffic to a platoon of sleepy, confused pods that are trying to talk to a database from the future.

Let’s look at the evidence. This VirtualService, which we now call “The 50/50 Disaster Splitter,” was routing traffic with criminal optimism.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: checkout-api-vs
  namespace: prod-eu-central
spec:
  hosts:
    - "checkout.api.internal"
  http:
    - route:
        - destination:
            host: checkout-api-svc
            subset: v1-stable
          weight: 50 # 50% to the (theoretically) working pods
        - destination:
            host: checkout-api-svc
            subset: v1-rollback
          weight: 50 # 50% to the pods we just dragged from the past

The service mesh isn’t malicious. It’s just an incredibly efficient tool that is very good at following bad instructions. It sees a green light and hits the accelerator.

A survival guide that won’t betray you

So, you’re in the middle of a fire, and the “break glass in case of emergency” button is a lie. What do you do? You need a playbook that acknowledges reality.

Step 0: Breathe and isolate the blast radius

Before you even think about rolling back, stop the bleeding. The fastest way to do that is often at the traffic level. Use your service mesh or ingress controller to immediately shift 100% of traffic back to the last known good version. Don’t wait for new pods to start. This is a surgical move that takes seconds and gives you breathing room.

Step 1: Declare an incident and gather the detectives

Get the right people on a call. Announce that this is not a “quick rollback” but an incident investigation. Your goal is to understand why the release failed, not just to hit the undo button.

Step 2: Perform the autopsy (while the system is stable)

With traffic safely routed away from the wreckage, you can now investigate. Check the logs of the failed pods. Look at the database. Is there a schema mismatch? A bad configuration? This is where you find the real killer.

Step 3: Plan the counter-offensive (which might not be a rollback)

Sometimes, the safest path forward is a roll forward. A small hotfix that corrects the issue might be faster and less risky than trying to force the old code to work with a new state. A rollback should be a deliberate, planned action, not a panic reflex. If you must roll back, do it with the knowledge you’ve gained from your investigation.

Step 4: The deliberate, cautious rollback

If you’ve determined a rollback is the only way, do it methodically.

  1. Scale down the broken deployment:
    kubectl scale deployment/checkout-api –replicas=0
  2. Execute the Helm rollback:
    helm rollback checkout-api 1 -n prod-eu-central
  3. Watch the new pods like a hawk: Monitor their logs and key metrics as they come up. Don’t trust the green check marks.
  4. Perform a Canary Release: Once the new pods look genuinely healthy, use your service mesh to send them 1% of the traffic. Then 10%. Then 50%. Then 100%. You are now in control, not the blind optimism of the automation.

The truth will set you free

A Kubernetes rollback isn’t a time machine. It’s a YAML editor with a fancy title. It doesn’t understand your data, it doesn’t appreciate your app’s need for a morning coffee, and it certainly doesn’t grasp the nuances of traffic routing under pressure.

Treating a rollback as a simple, safe undo button is the fastest way to turn a small incident into a full-blown outage. By understanding the lies it tells, you can build a process that trusts human investigation over deceptive green lights. So the next time a deployment goes sideways, don’t just reach for the rollback lever. Reach for your detective’s hat instead.

Confessions of a recovering GitOps addict

There’s a moment in every tech trend’s lifecycle when the magic starts to wear off. It’s like realizing the artisanal, organic, free-range coffee you’ve been paying eight dollars for just tastes like… coffee. For me, and many others in the DevOps trenches, that moment has arrived for GitOps.

We once hailed it as the silver bullet, the grand unifier, the one true way. Now, I’m here to tell you that the romance is over. And something much more practical is taking its place.

The alluring promise of a perfect world

Let’s be honest, we all fell hard for GitOps. The promise was intoxicating. A single source of truth for our entire infrastructure, nestled right in the warm, familiar embrace of Git. Pull Requests became the sacred gates through which all changes must pass. CI/CD pipelines were our holy scrolls, and tools like ArgoCD and Flux were the messiahs delivering us from the chaos of manual deployments.

It was a world of perfect order. Every change was audited, every state was declared, and every rollback was just a git revert away. It felt clean. It felt right. It felt… professional. For a while, it was the hero we desperately needed.

The tyranny of the pull request

But paradise had a dark side, and it was paved with endless YAML files. The first sign of trouble wasn’t a catastrophic failure, but a slow, creeping bureaucracy that we had built for ourselves.

Need to update a single, tiny secret? Prepare for the ritual. First, the offering: a Pull Request. Then, the prayer for the high priests (your colleagues) to grant their blessing (the approval). Then, the sacrifice (the merge). And finally, the tense vigil, watching ArgoCD’s sync status like it’s a heart monitor, praying it doesn’t flatline.

The lag became a running joke. Your change is merged… but has it landed in production? Who knows! The sync bot seems to be having a bad day. When everything is on fire at 2 AM, Git is like that friend who proudly tells you, “Well, according to my notes, the plan was for there not to be a fire.” Thanks, Git. Your record of intent is fascinating, but I need a fire hose, not a historian.

We hit our wall during what should have been a routine update.

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: auth-service
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: auth-service-container
        image: our-app:v1.12.4
        envFrom:
        - secretRef:
            name: production-credentials

A simple change to the production-credentials secret required updating an encrypted file, PR-ing it, and then explaining in the commit message something like, “bumping secret hash for reasons”. Nobody understood it. Infrastructure changes started to require therapy sessions just to get merged.

And then, the tools fought back

When a system creates more friction than it removes, a rebellion is inevitable. And the rebels have arrived, not with pitchforks, but with smarter, more flexible tools.

First, the idea that developers should be fluent in YAML began to die. Internal Developer Platforms (IDPs) like Backstage and Port started giving developers what they always wanted: self-service with guardrails. Instead of wrestling with YAML syntax, they click a button in a portal to provision a database or spin up a new environment. Git becomes a log of what happened, not a bottleneck to make things happen.

Second, we remembered that pushing things can be good. The pull-based model was trendy, but let’s face it: push is immediate. Push is observable. We’ve gone back to CI pipelines pushing manifests directly into clusters, but this time they’re wearing body armor.

# This isn't your old wild-west kubectl apply
# It's a command wrapped in an approval system, with observability baked in.
deploy-cli --service auth-service --env production --approve

The change is triggered precisely when we want it, not when a bot feels like syncing. Finally, we started asking a radical question: why are we describing infrastructure in a static markup language when we could be programming it? Tools like Pulumi and Crossplane entered the scene. Instead of hundreds of lines of YAML, we’re writing code that feels alive.

import * as aws from "@pulumi/aws";

// Create an S3 bucket with versioning enabled.
const bucket = new aws.s3.Bucket("user-uploads-bucket", {
    versioning: {
        enabled: true,
    },
    acl: "private",
});

Infrastructure can now react to events, be composed into reusable modules, and be written in a language with types and logic. YAML simply can’t compete with that.

A new role for the abdicated king

So, is GitOps dead? No, that’s just clickbait. But it has been demoted. It’s no longer the king ruling every action; it’s more like a constitutional monarch, a respected elder statesman.

It’s fantastic for auditing, for keeping a high-level record of intended state, and for infrastructure teams that thrive on rigid discipline. But for high-velocity product teams, it’s become a beautifully crafted anchor when what we need is a motor.

We’ve moved from “Let’s define everything in Git” to “Let’s ship faster, safer, and saner with the right tools for the job.”

Our current stack is a hybrid, a practical mix of the old and new:

  • Backstage to abstract away complexity for developers.
  • Push-based pipelines with strong guardrails for immediate, observable deployments.
  • Pulumi for typed, programmable, and composable infrastructure.
  • Minimal GitOps for what it does best: providing a clear, auditable trail of our intentions.

GitOps wasn’t a mistake; it was the strict but well-meaning grandparent of infrastructure management. It taught us discipline and the importance of getting approval before touching anything important. But now that we’re grown up, that level of supervision feels less like helpful guidance and more like having someone watch over your shoulder while you type, constantly asking, “Are you sure you want to save that file?” The world is moving on to flexibility, developer-first platforms, and code you can read without a decoder ring. If you’re still spending your nights appeasing the YAML gods with Pull Request sacrifices for trivial changes… you’re not just living in the past, you’re practically a fossil.

When docker compose stopped being magic

There was a time, not so long ago, when docker-compose up felt like performing a magic trick. You’d scribble a few arcane incantations into a YAML file and, poof, your entire development stack would spring to life. The database, the cache, your API, the frontend… all humming along obediently on localhost. Docker Compose wasn’t just a tool; it was the trusty Swiss Army knife in every developer’s pocket, the reliable friend who always had your back.

Until it didn’t.

Our breakup wasn’t a single, dramatic event. It was a slow fade, the kind of awkward drifting apart that happens when one friend grows and the other… well, the other is perfectly happy staying exactly where they are. It began with small annoyances, then grew into full-blown arguments. We eventually realized we were spending more time trying to fix our relationship with YAML than actually building things.

So, with a heavy heart and a sigh of relief, we finally said goodbye.

The cracks begin to show

As our team and infrastructure matured, our reliable friend started showing some deeply annoying habits. The magic tricks became frustratingly predictable failures.

  • Our services started giving each other the silent treatment. The networking between containers became as fragile and unpredictable as a Wi-Fi connection on a cross-country train. One moment they were chatting happily, the next they wouldn’t be caught dead in the same virtual network.
  • It was worse at keeping secrets than a gossip columnist. The lack of native, secure secret handling was, to put it mildly, a joke. We were practically writing passwords on sticky notes and hoping for the best.
  • It developed a severe case of multiple personality disorder. The same docker-compose.yml file would behave like a well-mannered gentleman on one developer’s machine, a rebellious teenager in staging, and a complete, raving lunatic in production. Consistency was not its strong suit.
  • The phrase “It works on my machine” became a ritualistic chant. We’d repeat it, hoping to appease the demo gods, but they are a fickle bunch and rarely listened. We needed reliability, not superstition.

We had to face the truth. Our old friend just couldn’t keep up.

Moving on to greener pastures

The final straw was the realization that we had become full-time YAML therapists. It was time to stop fixing and start building again. We didn’t just dump Compose; we replaced it, piece by piece, with tools that were actually designed for the world we live in now.

For real infrastructure, we chose real code

For our production and staging environments, we needed a serious, long-term commitment. We found it in the AWS Cloud Development Kit (CDK). Instead of vaguely describing our needs in YAML and hoping for the best, we started declaring our infrastructure with the full power and grace of TypeScript.

We went from a hopeful plea like this:

# docker-compose.yml
services:
  api:
    build: .
    ports:
      - "8080:8080"
    depends_on:
      - database
  database:
    image: "postgres:14-alpine"

To a confident, explicit declaration like this:

// lib/api-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ecs_patterns from 'aws-cdk-lib/aws-ecs-patterns';

// ... inside your Stack class
const vpc = /* your existing VPC */;
const cluster = new ecs.Cluster(this, 'ApiCluster', { vpc });

// Create a load-balanced Fargate service and make it public
new ecs_patterns.ApplicationLoadBalancedFargateService(this, 'ApiService', {
  cluster: cluster,
  cpu: 256,
  memoryLimitMiB: 512,
  desiredCount: 2, // Let's have some redundancy
  taskImageOptions: {
    image: ecs.ContainerImage.fromRegistry("your-org/your-awesome-api"),
    containerPort: 8080,
  },
  publicLoadBalancer: true,
});

It’s reusable, it’s testable, and it’s cloud-native by default. No more crossed fingers.

For local development, we found a better roommate

Onboarding new developers had become a nightmare of outdated README files and environment-specific quirks. For local development, we needed something that just worked, every time, on every machine. We found our perfect new roommate in Dev Containers.

Now, we ship a pre-configured development environment right inside the repository. A developer opens the project in VS Code, it spins up the container, and they’re ready to go.

Here’s the simple recipe in .devcontainer/devcontainer.json:

{
  "name": "Node.js & PostgreSQL",
  "dockerComposeFile": "docker-compose.yml", // Yes, we still use it here, but just for this!
  "service": "app",
  "workspaceFolder": "/workspace",

  // Forward the ports you need
  "forwardPorts": [3000, 5432],

  // Run commands after the container is created
  "postCreateCommand": "npm install",

  // Add VS Code extensions
  "extensions": [
    "dbaeumer.vscode-eslint",
    "esbenp.prettier-vscode"
  ]
}

It’s fast, it’s reproducible, and our onboarding docs have been reduced to: “1. Install Docker. 2. Open in VS Code.”

To speak every Cloud language, we hired a translator

As our ambitions grew, we needed to manage resources across different cloud providers without learning a new dialect for each one. Crossplane became our universal translator. It lets us manage our infrastructure, whether it’s on AWS, GCP, or Azure, using the language we already speak fluently: the Kubernetes API.

Want a managed database in AWS? You don’t write Terraform. You write a Kubernetes manifest.

# rds-instance.yaml
apiVersion: database.aws.upbound.io/v1beta1
kind: RDSInstance
metadata:
  name: my-production-db
spec:
  forProvider:
    region: eu-west-1
    instanceClass: db.t3.small
    masterUsername: admin
    allocatedStorage: 20
    engine: postgres
    engineVersion: "14.5"
    skipFinalSnapshot: true
    # Reference to a secret for the password
    masterPasswordSecretRef:
      namespace: crossplane-system
      name: my-db-password
      key: password
  providerConfigRef:
    name: aws-provider-config

It’s declarative, auditable, and fits perfectly into a GitOps workflow.

For the creative grind, we got a better workflow

The constant cycle of code, build, push, deploy, test, repeat for our microservices was soul-crushing. Docker Compose never did this well. We needed something that could keep up with our creative flow. Skaffold gave us the instant gratification we craved.

One command, skaffold dev, and suddenly we had:

  • Live code syncing to our development cluster.
  • Automatic container rebuilds and redeployments when files change.
  • A unified configuration for both development and production pipelines.

No more editing three different files and praying. Just code.

The slow fade was inevitable

Docker Compose was a fantastic tool for a simpler time. It was perfect when our team was small, our application was a monolith, and “production” was just a slightly more powerful laptop.

But the world of software development has moved on. We now live in an era of distributed systems, cloud-native architecture, and relentless automation. We didn’t just stop using Docker Compose. We outgrew it. And we replaced it with tools that weren’t just built for the present, but are ready for the future.