PlatformEngineering

GCP services DevOps engineers rely on

I have spent the better part of three years wrestling with Google Cloud Platform, and I am still not entirely convinced it wasn’t designed by a group of very clever people who occasionally enjoy a quiet laugh at the rest of us. The thing about GCP, you see, is that it works beautifully right up until the moment it doesn’t. Then it fails with such spectacular and Byzantine complexity that you find yourself questioning not just your career choices but the fundamental nature of causality itself.

My first encounter with Cloud Build was typical of this experience. I had been tasked with setting up a CI/CD pipeline for a microservices architecture, which is the modern equivalent of being told to build a Swiss watch while someone steadily drops marbles on your head. Jenkins had been our previous solution, a venerable old thing that huffed and puffed like a steam locomotive and required more maintenance than a Victorian greenhouse. Cloud Build promised to handle everything serverlessly, which is a word that sounds like it ought to mean something, but in practice simply indicates you won’t know where your code is running and you certainly won’t be able to SSH into it when things go wrong.

The miracle, when it arrived, was decidedly understated. I pushed some poorly written Go code to a repository and watched as Cloud Build sprang into life like a sleeper agent receiving instructions. It ran my tests, built a container, scanned it for vulnerabilities, and pushed it to storage. The whole process took four minutes and cost less than a cup of tea. I sat there in my home office, the triumph slowly dawning, feeling rather like a man who has accidentally trained his cat to make coffee. I had done almost nothing, yet everything had happened. This is the essential GCP magic, and it is deeply unnerving.

The vulnerability scanner is particularly wonderful in that quietly horrifying way. It examines your containers and produces a list of everything that could possibly go wrong, like a pilot’s pre-flight checklist written by a paranoid witchfinder general. On one memorable occasion, it flagged a critical vulnerability in a library I wasn’t even aware we were using. It turned out to be nested seven dependencies deep, like a Russian doll of potential misery. Fixing it required updating something else, which broke something else, which eventually led me to discover that our entire authentication layer was held together by a library last maintained in 2018 by someone who had subsequently moved to a commune in Oregon. The scanner was right, of course. It always is. It is the most anxious and accurate employee you will ever meet.

Google Kubernetes Engine or how I learned to stop worrying and love the cluster

If Cloud Build is the efficient butler, GKE is the robot overlord you find yourself oddly grateful to. My initial experience with Kubernetes was self-managed, which taught me many things, primarily that I do not have the temperament to manage Kubernetes. I spent weeks tuning etcd, debugging network overlays, and developing what I can only describe as a personal relationship with a persistent volume that refused to mount. It was less a technical exercise and more a form of digitally enhanced psychotherapy.

GKE’s Autopilot mode sidesteps all this by simply making the nodes disappear. You do not manage nodes. You do not upgrade nodes. You do not even, strictly speaking, know where the nodes are. They exist in the same conceptual space as socks that vanish from laundry cycles. You request resources, and they materialise, like summoning a very specific and obliging genie. The first time I enabled Autopilot, I felt I was cheating somehow, as if I had been given the answers to an exam I had not revised for.

The real genius is Workload Identity, a feature that allows pods to access Google services without storing secrets. Before this, secret management was a dark art involving base64 encoding and whispered incantations. We kept our API keys in Kubernetes secrets, which is rather like keeping your house keys under the doormat and hoping burglars are too polite to look there. Workload Identity removes all this by using magic, or possibly certificates, which are essentially the same thing in cloud computing. I demonstrated it to our security team, and their reaction was instructive. They smiled, which security people never do, and then they asked me to prove it was actually secure, which took another three days and several diagrams involving stick figures.

Istio integration completes the picture, though calling it integration suggests a gentle handshake when it is more like being embraced by a very enthusiastic octopus. It gives you observability, security, and traffic management at the cost of considerable complexity and a mild feeling that you have lost control of your own architecture. Our first Istio deployment doubled our pod count and introduced latency that made our application feel like it was wading through treacle. Tuning it took weeks and required someone with a master’s degree in distributed systems and the patience of a saint. When it finally worked, it was magnificent. Requests flowed like water, security policies enforced themselves with silent efficiency, and I felt like a man who had tamed a tiger through sheer persistence and a lot of treats.

Cloud Deploy and the gentle art of not breaking everything

Progressive delivery sounds like something a management consultant would propose during a particularly expensive lunch, but Cloud Deploy makes it almost sensible. The service orchestrates rollouts across environments with strategies like canary and blue-green, which are named after birds and colours because naming things is hard, and DevOps engineers have a certain whimsical desperation about them.

My first successful canary deployment felt like performing surgery on a patient who was also the anaesthetist. We routed 5 percent of traffic to the new version and watched our metrics like nervous parents at a school play. When errors spiked, I expected a frantic rollback procedure involving SSH and tarballs. Instead, I clicked a button, and everything reverted in thirty seconds. The old version simply reappeared, fully formed, like a magic trick performed by someone who actually understands magic. I walked around the office for the rest of the day with what my colleagues described as a smug grin, though I prefer to think of it as the justified expression of someone who has witnessed a minor miracle.

The integration with Cloud Build creates a pipeline so smooth it is almost suspicious. Code commits trigger builds, builds trigger deployments, deployments trigger monitoring alerts, and alerts trigger automated rollbacks. It is a closed loop, a perpetual motion machine of software delivery. I once watched this entire chain execute while I was making a sandwich. By the time I had finished my ham and pickle on rye, a critical bug had been introduced, detected, and removed from production without any human intervention. I was simultaneously impressed and vaguely concerned about my own obsolescence.

Artifact Registry where containers go to mature

Storing artifacts used to involve a self-hosted Nexus repository that required weekly sacrifices of disk space and RAM. Artifact Registry is Google’s answer to this, a fully managed service that stores Docker images, Helm charts, and language packages with the solemnity of a wine cellar for code.

The vulnerability scanning here is particularly thorough, examining every layer of your container with the obsessive attention of someone who alphabetises their spice rack. It once flagged a high-severity issue in a base image we had been using for six months. The vulnerability allowed arbitrary code execution, which is the digital equivalent of leaving your front door open with a sign saying “Free laptops inside.” We had to rebuild and redeploy forty services in two days. The scanner, naturally, had known about this all along but had been politely waiting for us to notice.

Geo-replication is another feature that seems obvious until you need it. Our New Zealand team was pulling images from a European registry, which meant every deployment involved sending gigabytes of data halfway around the world. This worked about as well as shouting instructions across a rugby field during a storm. Moving to a regional registry in New Zealand cut our deployment times by half and our egress fees by a third. It also taught me that cloud networking operates on principles that are part physics, part economics, and part black magic.

Cloud Operations Suite or how I learned to love the machine that watches me

Observability in GCP is orchestrated by the Cloud Operations Suite, formerly known as Stackdriver. The rebranding was presumably because Stackdriver sounded too much like a dating app for developers, which is a missed opportunity if you ask me.

The suite unifies logs, metrics, traces, and dashboards into a single interface that is both comprehensive and bewildering. The first time I opened Cloud Monitoring, I was presented with more graphs than a hedge fund’s annual report. CPU, memory, network throughput, disk IOPS, custom metrics, uptime checks, and SLO burn rates. It was beautiful and terrifying, like watching the inner workings of a living organism that you have created but do not fully understand.

Setting up SLOs felt like writing a promise to my future self. “I, a DevOps engineer of sound mind, do hereby commit to maintaining 99.9 percent availability.” The system then watches your service like a particularly judgmental deity and alerts you the moment you transgress. I once received a burn rate alert at 2 AM because a pod had been slightly slow for ten minutes. I lay in bed, staring at my phone, wondering whether to fix it or simply accept that perfection was unattainable and go back to sleep. I fixed it, of course. We always do.

The integration with BigQuery for long-term analysis is where things get properly clever. We export all our logs and run SQL queries to find patterns. This is essentially data archaeology, sifting through digital sediment to understand why something broke three weeks ago. I discovered that our highest error rates always occurred on Tuesdays between 2 and 3 PM. Why? A scheduled job that had been deprecated but never removed, running on a server everyone had forgotten about. Finding it felt like discovering a Roman coin in your garden, exciting but also slightly embarrassing that you hadn’t noticed it before.

Cloud Monitoring and Logging the digital equivalent of a nervous system

Cloud Logging centralises petabytes of data from services that generate logs with the enthusiasm of a teenager documenting their lunch. Querying this data feels like using a search engine that actually works, which is disconcerting when you’re used to grep and prayer.

I once spent an afternoon tracking down a memory leak using Cloud Profiler, a service that shows you exactly where your code is being wasteful with RAM. It highlighted a function that was allocating memory like a government department allocates paper clips, with cheerful abandon and no regard for consequences. The function turned out to be logging entire database responses for debugging purposes, in production, for six months. We had archived more debug data than actual business data. The developer responsible, when confronted, simply shrugged and said it had seemed like a good idea at the time. This is the eternal DevOps tragedy. Everything seems like a good idea at the time.

Uptime checks are another small miracle. We have probes hitting our endpoints from locations around the world, like a global network of extremely polite bouncers constantly asking, “Are you open?” When Mumbai couldn’t reach our service but London could, it led us to discover a regional DNS issue that would have taken days to diagnose otherwise. The probes had saved us, and they had done so without complaining once, which is more than can be said for the on-call engineer who had to explain it to management at 6 AM.

Cloud Functions and Cloud Run, where code goes to hide

Serverless computing in GCP comes in two flavours. Cloud Functions are for small, event-driven scripts, like having a very eager intern who only works when you clap. Cloud Run is for containerised applications that scale to zero, which is an economical way of saying they disappear when nobody needs them and materialise when they do, like an introverted ghost.

I use Cloud Functions for automation tasks that would otherwise require cron jobs on a VM that someone has to maintain. One function resizes GKE clusters based on Cloud Monitoring alerts. When CPU utilisation exceeds 80 percent for five minutes, the function spins up additional nodes. When it drops below 20 percent, it scales down. This is brilliant until you realise you’ve created a feedback loop and the cluster is now oscillating between one node and one hundred nodes every ten minutes. Tuning the thresholds took longer than writing the function, which is the serverless way.

Cloud Run hosts our internal tools, the dashboards, and debug interfaces that developers need but nobody wants to provision infrastructure for. Deploying is gloriously simple. You push a container, it runs. The cold start time is sub-second, which means Google has solved a problem that Lambda users have been complaining about for years, presumably by bargaining with physics itself. I once deployed a debugging tool during an incident response. It was live before the engineer who requested it had finished describing what they needed. Their expression was that of someone who had asked for a coffee and been given a flying saucer.

Terraform and Cloud Deployment Manager arguing with machines about infrastructure

Infrastructure as Code is the principle that you should be able to rebuild your entire environment from a text file, which is lovely in theory and slightly terrifying in practice. Terraform, using the GCP provider, is the de facto standard. It is also a source of endless frustration and occasional joy.

The state file is the heart of the problem. It is a JSON representation of your infrastructure that Terraform keeps in Cloud Storage, and it is the single source of truth until someone deletes it by accident, at which point the truth becomes rather more philosophical. We lock the state during applies, which prevents conflicts but also means that if an apply hangs, everyone is blocked. I have spent afternoons staring at a terminal, watching Terraform ponder the nature of a load balancer, like a stoned philosophy student contemplating a spoon.

Deployment Manager is Google’s native IaC tool, which uses YAML and is therefore slightly less powerful but considerably easier to read. I use it for simple projects where Terraform would be like using a sledgehammer to crack a nut, if the sledgehammer required you to understand graph theory. The two tools coexist uneasily, like cats who tolerate each other for the sake of the humans.

Drift detection is where things get properly philosophical. Terraform tells you when reality has diverged from your code, which happens more often than you’d think. Someone clicks something in the console, a service account is modified, a firewall rule is added for “just a quick test.” The plan output shows these changes like a disappointed teacher marking homework in red pen. You can either apply the correction or accept that your infrastructure has developed a life of its own and is now making decisions independently. Sometimes I let the drift stand, just to see what happens. This is how accidents become features.

IAM and Cloud Asset Inventory, the endless game of who can do what

Identity and Access Management in GCP is both comprehensive and maddening. Every API call is authenticated and authorised, which is excellent for security but means you spend half your life granting permissions to service accounts. A service account, for the uninitiated, is a machine pretending to be a person so it can ask Google for things. They are like employees who never take a holiday but also never buy you a birthday card.

Workload Identity Federation allows these synthetic employees to impersonate each other across clouds, which is identity management crossed with method acting. We use it to let our AWS workloads access GCP resources, a process that feels rather like introducing two friends who are suspicious of each other and speak different languages. When it works, it is seamless. When it fails, the error messages are so cryptic they may as well be in Linear B.

Cloud Asset Inventory catalogs every resource in your organisation, which is invaluable for audits and deeply unsettling when you realise just how many things you’ve created and forgotten about. I once ran a report and discovered seventeen unused load balancers, three buckets full of logs from a project that ended in 2023, and a Cloud SQL instance that had been running for six months with no connections. The bill was modest, but the sense of waste was profound. I felt like a hoarder being confronted with their own clutter.

For European enterprises, the GDPR compliance features are critical. We export audit logs to BigQuery and run queries to prove data residency. The auditors, when they arrived, were suspicious of everything, which is their job. They asked for proof that data never left the europe-west3 region. I showed them VPC Service Controls, which are like digital border guards that shoot packets trying to cross geographical boundaries. They seemed satisfied, though one of them asked me to explain Kubernetes, and I saw his eyes glaze over in the first thirty seconds. Some concepts are simply too abstract for mortal minds.

Eventarc and Cloud Scheduler the nervous system of the cloud

Eventarc routes events from over 100 sources to your serverless functions, creating event-driven architectures that are both elegant and impossible to debug. An event is a notification that something happened, somewhere, and now something else should happen somewhere else. It is causality at a distance, action at a remove.

I have an Eventarc trigger that fires when a vulnerability is found, sending a message to Pub/Sub, which fans out to multiple subscribers. One subscriber posts to Slack, another creates a ticket, and a third quarantines the image. It is a beautiful, asynchronous ballet that I cannot fully trace. When it fails, it fails silently, like a mime having a heart attack. The dead-letter queue catches the casualties, which I check weekly like a coroner reviewing unexplained deaths.

Cloud Scheduler handles cron jobs, which are the digital equivalent of remembering to take the bins out. We have schedules that scale down non-production environments at night, saving money and carbon. I once set the timezone incorrectly and scaled down the production cluster at midday. The outage lasted three minutes, but the shame lasted considerably longer. The team now calls me “the scheduler whisperer,” which is not the compliment it sounds like.

The real power comes from chaining these services. A Monitoring alert triggers Eventarc, which invokes a Cloud Function, which checks something via Scheduler, which then triggers another function to remediate. It is a Rube Goldberg machine built of code, more complex than it needs to be, but weirdly satisfying when it works. I have built systems that heal themselves, which is either the pinnacle of DevOps achievement or the first step towards Skynet. I prefer to think it is the former.

The map we all pretend to understand

Every DevOps journey, no matter how anecdotal, eventually requires what consultants call a “high-level architecture overview” and what I call a desperate attempt to comprehend the incomprehensible. During my second year on GCP, I created exactly such a diagram to explain to our CFO why we were spending $47,000 a month on something called “Cross-Regional Egress.” The CFO remained unmoved, but the diagram became my Rosetta Stone for navigating the platform’s ten core services.

I’ve reproduced it here partly because I spent three entire afternoons aligning boxes in Lucidchart, and partly because even the most narrative-driven among us occasionally needs to see the forest’s edge while wandering through the trees. Consider it the technical appendix you can safely ignore, unless you’re the poor soul actually implementing any of this.

There it is, in all its tabular glory. Five rows that represent roughly fifteen thousand hours of human effort, and at least three separate incidents involving accidentally deleted production namespaces. The arrows are neat and tidy, which is more than can be said for any actual implementation.

I keep a laminated copy taped to my monitor, not because I consult it; I have the contents memorised, along with the scars that accompany each service, but because it serves as a reminder that even the most chaotic systems can be reduced to something that looks orderly on PowerPoint. The real magic lives in the gaps between those tidy boxes, where service accounts mysteriously expire, where network policies behave like quantum particles, and where the monthly bill arrives with numbers that seem generated by a random number generator with a grudge.

A modest proposal for surviving GCP

That table represents the map. What follows is the territory, with all its muddy bootprints and unexpected cliffs.

After three years, I have learned that the best DevOps engineers are not the ones with the most certificates. They are the ones who have learned to read the runes, who know which logs matter and which can be ignored, who have developed an intuitive sense for when a deployment is about to fail and can smell a misconfigured IAM binding at fifty paces. They are part sysadmin, part detective, part wizard.

The platform makes many things possible, but it does not make them easy. It is infrastructure for grown-ups, which is to say it trusts you to make expensive mistakes and learn from them. My advice is to start small, automate everything, and keep a sense of humour. You will need it the first time you accidentally delete a production bucket and discover that the undo button is marked “open a support ticket and wait.”

Store your manifests in Git and let Cloud Deploy handle the applying. Define SLOs and let the machines judge you. Tag resources for cost allocation and prepare to be horrified by the results. Replicate artifacts across regions because the internet is not as reliable as we pretend. And above all, remember that the cloud is not magic. It is simply other people’s computers running other people’s code, orchestrated by APIs that are occasionally documented and frequently misunderstood.

We build on these foundations because they let us move faster, scale further, and sleep slightly better at night, knowing that somewhere in a data centre in Belgium, a robot is watching our servers and will wake us only if things get truly interesting.

That is the theory, anyway. In practice, I still keep my phone on loud, just in case.

Kubernetes the toxic coworker my team couldn’t fire

The Slack notification arrived with the heavy, damp enthusiasm of a wet dog jumping into your lap while you are wearing a tuxedo. It was late on a Thursday, the specific hour when ambitious caffeine consumption turns into existential regret, and the message was brief.

“I don’t think I can do this anymore. Not the coding. The infrastructure. I’m out.”

This wasn’t a junior developer overwhelmed by the concept of recursion. This was my lead backend engineer. A human Swiss Army knife who had spent nine years navigating the dark alleys of distributed systems and could stare down a production outage with the heart rate of a sleeping tortoise. He wasn’t leaving because of burnout from long hours, or an equity dispute, or even because someone microwaved fish in the breakroom.

He was leaving because of Kubernetes.

Specifically, he was leaving because the tool we had adopted to “simplify” our lives had slowly morphed into a second, unpaid job that required the patience of a saint and the forensic skills of a crime scene investigator. We had turned his daily routine of shipping features into a high-stakes game of operation where touching the wrong YAML indentation caused the digital equivalent of a sewer backup.

It was a wake-up call that hit me harder than the realization that the Tupperware at the back of my fridge has evolved its own civilization. We treat Kubernetes like a badge of honor, a maturity medal we pin to our chests. But the dirty secret everyone is too polite to whisper at conferences is that we have invited a chaotic, high-maintenance tyrant into our homes and given it the master bedroom.

When the orchestrator becomes a lifestyle disease

We tend to talk about “cognitive load” in engineering with the same sterile detachment we use to discuss disk space or latency. It sounds clean. Manageable. But in practice, the cognitive load imposed by a raw, unabstracted Kubernetes setup is less like a hard drive filling up and more like trying to cook a five-course gourmet meal while a badger is gnawing on your ankle.

The promise was seductive. We were told that Kubernetes would be the universal adapter for the cloud. It would be the operating system of the internet. And in a way, it is. But it is an operating system that requires you to assemble the kernel by hand every morning before you can open your web browser.

My star engineer didn’t want to leave. He just wanted to write code that solved business problems. Instead, he found himself spending 40% of his week debugging ingress controllers that behaved like moody teenagers (silent, sullen, and refusing to do what they were told) and wrestling with pod eviction policies that seemed based on the whim of a vengeful god rather than logic.

We had fallen into the classic trap of Resume Driven Development. We handed application developers the keys to the cluster and told them they were now “DevOps empowered.” In reality, this is like handing a teenager the keys to a nuclear submarine because they once successfully drove a golf cart. It doesn’t empower them. It terrifies them.

(And let’s be honest, most backend developers look at a Kubernetes manifest with the same mix of confusion and horror that I feel when looking at my own tax returns.)

The archaeological dig of institutional knowledge

The problem with complexity is that it rarely announces itself with a marching band. It accumulates silently, like dust bunnies under a bed, or plaque in an artery.

When we audited our setup after the resignation, we found that our cluster had become a museum of good intentions gone wrong. We found Helm charts that were so customized they effectively constituted a new, undocumented programming language. We found sidecar containers attached to pods for reasons nobody could remember, sucking up resources like barnacles on the hull of a ship, serving no purpose other than to make the diagrams look impressive.

This is what I call “Institutional Knowledge Debt.” It represents the sort of fungal growth that occurs when you let complexity run wild. You know it is there, evolving its own ecosystem, but as long as you don’t look at it directly, you don’t have to acknowledge that it might be sentient.

The “Bus Factor” in our team (the number of people who can get hit by a bus before the project collapses) had reached a terrifying number: one. And that one person had just quit. We had built a system where deploying a hotfix required a level of tribal knowledge usually reserved for initiating members into a secret society.

YAML is just a ransom note with better indentation

If you want to understand why developers hate modern infrastructure, look no further than the file format we use to define it. YAML.

We found files in our repository that were less like configuration instructions and more like love letters written by a stalker: intense, repetitive, and terrifyingly vague about their actual intentions.

The fragility of it is almost impressive. A single misplaced space, a tab character where a space should be, or a dash that looked at you the wrong way, and the entire production environment simply decides to take the day off. It is absurd that in an era of AI assistants and quantum computing, our billion-dollar industries hinge on whether a human being pressed the spacebar two times or four times.

Debugging these files is not engineering. It is hermeneutics. It is reading tea leaves. You stare at the CrashLoopBackOff error message, which is the system’s way of saying “I am unhappy, but I will not tell you why,” and you start making sacrifices to the gods of indentation.

My engineer didn’t hate the logic. He hated the medium. He hated that his intellect was being wasted on the digital equivalent of untangling Christmas lights.

We built a platform to stop the bleeding

The solution to this mess was not to hire “better” engineers who memorized the entire Kubernetes API documentation. That is a strategy akin to buying larger pants instead of going on a diet. It accommodates the problem, but it doesn’t solve it.

We had to perform an exorcism. But not a dramatic one with spinning heads. A boring, bureaucratic one.

We embraced Platform Engineering. Now, that is a buzzword that usually makes my eyes roll back into my head so far I can see my own frontal lobe, but in this case, it was the only way out. We decided to treat the platform as a product and our developers as the customers, customers who are easily confused and frighten easily.

We took the sharp objects away.

We built “Golden Paths.” In plain English, this means we created templates that work. If a developer wants to deploy a microservice, they don’t need to write a 400-line YAML manifesto. They fill out a form that asks five questions: What is it called? How much memory does it need? Who do we call if it breaks?

We hid the Kubernetes API behind a curtain. We stopped asking application developers to care about PodDisruptionBudgets or AffinityRules. Asking a Java developer to configure node affinity is like asking a passenger on an airplane to help calibrate the landing gear. It is not their job, and if they are doing it, something has gone terribly wrong.

Boring is the only metric that matters

After three months of stripping away the complexity, something strange happened. The silence.

The Slack channel dedicated to deployment support, previously a scrolling wall of panic and “why is my pod pending?” screenshots, went quiet. Deployments became boring.

And let me tell you, in the world of infrastructure, boring is the new sexy. Boring means things work. Boring means I can sleep through the night without my phone buzzing across the nightstand like an angry hornet.

Kubernetes is a marvel of engineering. It is powerful, scalable, and robust. But it is also a dense, hostile environment for humans. It is an industrial-grade tool. You don’t put an industrial lathe in your home kitchen to slice carrots, and you shouldn’t force every developer to operate a raw Kubernetes cluster just to serve a web page.

If you are hiring brilliant engineers, you are paying for their ability to solve logic puzzles and build features. If you force them to spend half their week fighting with infrastructure, you are effectively paying a surgeon to mop the hospital floors.

So look at your team. Look at their eyes. If they look tired, not from the joy of creation but from the fatigue of fighting their own tools, you might have a problem. That star engineer isn’t planning their next feature. They are drafting their resignation letter, and it probably won’t be written in YAML.

Docker didn’t die, it just moved to your laptop

Docker used to be the answer you gave when someone asked, “How do we ship this thing?” Now it’s more often the answer to a different question, “How do I run this thing locally without turning my laptop into a science fair project?”

That shift is not a tragedy. It’s not even a breakup. It’s more like Docker moved out of the busy downtown apartment called “production” and into a cozy suburb called “developer experience”, where the lawns are tidy, the tools are friendly, and nobody panics if you restart everything three times before lunch.

This article is about what changed, why it changed, and why Docker is still very much worth knowing, even if your production clusters rarely whisper its name anymore.

What we mean when we say Docker

One reason this topic gets messy is that “Docker” is a single word used to describe several different things, and those things have very different jobs.

  • Docker Desktop is the product that many developers actually interact with day to day, especially on macOS and Windows.
  • Docker Engine and the Docker daemon are the background machinery that runs containers on a host.
  • The Docker CLI and Dockerfile workflow are the human-friendly interface and the packaging format that people have built habits around.

When someone says “Docker is dying,” they usually mean “Docker Engine is no longer the default runtime in production platforms.” When someone says “Docker is everywhere,” they often mean “Docker Desktop and Dockerfile workflows are still the easiest way to get a containerized dev environment running quickly.”

Both statements can be true at the same time, which is annoying, because humans prefer their opinions to come in single-serving packages.

Docker’s rise and the good kind of magic

Docker didn’t become popular because it invented containers. Containers existed before Docker. Docker became popular because it made containers feel approachable.

It offered a developer experience that felt like a small miracle:

  • You could build images with a straightforward command.
  • You could run containers without a small dissertation on Linux namespaces.
  • You could push to registries and share a runnable artifact.
  • You could spin up multi-service environments with Docker Compose.

Docker took something that used to feel like “advanced systems programming” and turned it into “a thing you can demo on a Tuesday.”

If you were around for the era of XAMPP, WAMP, and “download this zip file, then pray,” Docker felt like a modern version of that, except it didn’t break as soon as you looked at it funny.

The plot twist in production

Here is the part where the story becomes less romantic.

Production infrastructure grew up.

Not emotionally, obviously. Infrastructure does not have feelings. It has outages. But it did mature in a very specific way: platforms started to standardize around container runtimes and interfaces that did not require Docker’s full bundled experience.

Docker was the friendly all-in-one kitchen appliance. Production systems wanted an industrial kitchen with separate appliances, separate controls, and fewer surprises.

Three forces accelerated the shift.

Licensing concerns changed the mood

Docker Desktop licensing changes made a lot of companies pause, not because engineers suddenly hated Docker, but because legal teams developed a new hobby.

The typical sequence went like this:

  1. Someone in finance asked, “How many Docker Desktop users do we have?”
  2. Someone in legal asked, “What exactly are we paying for?”
  3. Someone in infrastructure said, “We can probably do this with Podman or nerdctl.”

A tool can survive engineers complaining about it. Engineers complain about everything. The real danger is when procurement turns your favorite tool into a spreadsheet with a red cell.

The result was predictable: even developers who loved Docker started exploring alternatives, if only to reduce risk and friction.

The runtime world standardized without Docker

Modern container platforms increasingly rely on runtimes like containerd and interfaces like the Container Runtime Interface (CRI).

Kubernetes is a key example. Kubernetes removed the direct Docker integration path that many people depended on in earlier years, and the ecosystem moved toward CRI-native runtimes. The point was not to “ban Docker.” The point was to standardize around an interface designed specifically for orchestrators.

This is a subtle but important difference.

  • Docker is a complete experience, build, run, network, UX, opinions included.
  • Orchestrators prefer modular components, and they want to speak to a runtime through a stable interface.

The practical effect is what most teams feel today:

  • In many Kubernetes environments, the runtime is containerd, not Docker Engine.
  • Managed platforms such as ECS Fargate and other orchestrated services often run containers without involving Docker at all.

Docker, the daemon, became optional.

Security teams like control, and they do not like surprises

Security teams do not wake up in the morning and ask, “How can I ruin a developer’s day?” They wake up and ask, “How can I make sure the host does not become a piñata full of root access?”

Docker can be perfectly secure when used well. The problem is that it can also be spectacularly insecure when used casually.

Two recurring issues show up in real organizations:

  • The Docker socket is powerful. Expose it carelessly, and you are effectively offering a fast lane to host-level control.
  • The classic pattern of “just give developers sudo docker” can become a horror story with a polite ticket number.

Tools and workflows that separate concerns tend to make security people calmer.

  • Build tools such as BuildKit and buildah isolate image creation.
  • Rootless approaches, where feasible, reduce blast radius.
  • Runtime components can be locked down and audited more granularly.

This is not about blaming Docker. It’s about organizations preferring a setup where the sharp knives are stored in a drawer, not taped to the ceiling.

What Docker is now

Docker’s new role is less “the thing that runs production” and more “the thing that makes local development less painful.”

And that role is huge.

Docker still shines in areas where convenience matters most:

  • Local development environments
  • Quick reproducible demos
  • Multi-service stacks on a laptop
  • Cross-platform consistency on macOS, Windows, and Linux
  • Teams that need a simple standard for “how do I run this?”

If your job is to onboard new engineers quickly, Docker is still one of the best ways to avoid the dreaded onboarding ritual where a senior engineer says, “It works on my machine,” and the junior engineer quietly wonders if their machine has offended someone.

A small example that still earns its keep

Here is a minimal Docker Compose stack that demonstrates why Docker remains lovable for local development.

services:
  api:
    build: .
    ports:
      - "8080:8080"
    environment:
      DATABASE_URL: postgres://postgres:example@db:5432/app
    depends_on:
      - db

  db:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: example
      POSTGRES_DB: app
    ports:
      - "5432:5432"

This is not sophisticated. That is the point. It is the “plug it in and it works” power that made Docker famous.

Dockerfile is not the Docker daemon

This is where the confusion often peaks.

A Dockerfile is a packaging recipe. It is widely used. It remains a de facto standard, even when the runtime or build system is not Docker.

Many teams still write Dockerfiles, but build them using tooling that does not rely on the Docker daemon on the CI runner.

Here is a BuildKit example that builds and pushes an image without treating the Docker daemon as a requirement.

buildctl build \
  --frontend dockerfile.v0 \
  --local context=. \
  --local dockerfile=. \
  --output type=image,name=registry.example.com/app:latest,push=true

You can read this as “Dockerfile lives on, but Docker-as-a-daemon is no longer the main character.”

This separation matters because it changes how you design CI.

  • You can build images in environments where running a privileged Docker daemon is undesirable.
  • You can use builders that integrate better with Kubernetes or cloud-native pipelines.
  • You can reduce the amount of host-level power you hand out just to produce an artifact.

What replaced Docker in production pipelines

When teams say they are moving away from Docker in production, they rarely mean “we stopped using containers.” They mean the tooling around building and running containers is shifting.

Common patterns include:

  • containerd as the runtime in Kubernetes and other orchestrated environments
  • BuildKit for efficient builds and caching
  • kaniko for building images inside Kubernetes without a Docker daemon
  • ko for building and publishing Go applications as images without a Dockerfile
  • Buildpacks or Nixpacks for turning source code into runnable images using standardized build logic
  • Dagger and similar tools for defining CI pipelines that treat builds as portable graphs of steps

You do not need to use all of these. You just need to understand the trend.

Production platforms want:

  • Standard interfaces
  • Smaller, auditable components
  • Reduced privilege
  • Reproducible builds

Docker can participate in that world, but it no longer owns the whole stage.

A Kubernetes-friendly image build example

If you want a concrete example of the “no Docker daemon” approach, kaniko is a popular choice in cluster-native pipelines.

apiVersion: batch/v1
kind: Job
metadata:
  name: build-image-kaniko
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: kaniko
          image: gcr.io/kaniko-project/executor:latest
          args:
            - "--dockerfile=Dockerfile"
            - "--context=dir:///workspace"
            - "--destination=registry.example.com/app:latest"
          volumeMounts:
            - name: workspace
              mountPath: /workspace
      volumes:
        - name: workspace
          emptyDir: {}

This is intentionally simplified. In a real setup, you would bring your own workspace, your own auth mechanism, and your own caching strategy. But even in this small example, the idea is visible: build the image where it makes sense, without turning every CI runner into a tiny Docker host.

The practical takeaway for architects and platform teams

If you are designing platforms, the question is not “Should we ban Docker?” The question is “Where does Docker add value, and where does it create unnecessary coupling?”

A simple mental model helps.

  • Developer laptops benefit from a friendly tool that makes local environments predictable.
  • CI systems benefit from builder choices that reduce privilege and improve caching.
  • Production runtimes benefit from standardized interfaces and minimal moving parts.

Docker tends to dominate the first category, participates in the second, and is increasingly optional in the third.

If your team still uses Docker Engine on production hosts, that is not automatically wrong. It might be perfectly fine. The important thing is that you are doing it intentionally, not because “that’s how we’ve always done it.”

Why this is actually a success story

There is a temptation in tech to treat every shift as a funeral.

But Docker moving toward local development is not a collapse. It is a sign that the ecosystem absorbed Docker’s best ideas and made them normal.

The standardization of OCI images, the popularity of Dockerfile workflows, and the expectations around reproducible environments, all of that is Docker’s legacy living in the walls.

Docker is still the tool you reach for when you want to:

  • start fast
  • teach someone new
  • run a realistic stack on a laptop
  • avoid spending your afternoon installing the same dependencies in three different ways

That is not “less important.” That is foundational.

If anything, Docker’s new role resembles a very specific kind of modern utility.

It is like Visual Studio Code.

Everyone uses it. Everyone argues about it. It is not what you deploy to production, but it is the thing that makes building and testing your work feel sane.

Docker didn’t die.

It just moved to your laptop, brought snacks, and quietly let production run the serious machinery without demanding to be invited to every meeting.

Let IAM handle the secrets you can avoid

There are two kinds of secrets in cloud security.

The first kind is the legitimate kind: a third-party API token, a password for something you do not control, a certificate you cannot simply wish into existence.

The second kind is the kind we invent because we are in a hurry: long-lived access keys, copied into a config file, then copied into a Docker image, then copied into a ticket, then copied into the attacker’s weekend plans.

This article is about refusing to participate in that second category.

Not because secrets are evil. Because static credentials are the “spare house key under the flowerpot” of AWS. Convenient, popular, and a little too generous with access for something that can be photographed.

The goal is not “no secrets exist.” The goal is no secrets live in code, in images, or in long-lived credentials.

If you do that, your security posture stops depending on perfect human behavior, which is great because humans are famously inconsistent. (We cannot all be trusted with a jar of cookies, and we definitely cannot all be trusted with production AWS keys.)

Why this works in real life

AWS already has a mechanism designed to prevent your applications from holding permanent credentials: IAM roles and temporary credentials (STS).

When your Lambda runs with an execution role, AWS hands it short-lived credentials automatically. They rotate on their own. There is nothing to copy, nothing to stash, nothing to rotate in a spreadsheet named FINAL-final-rotation-plan.xlsx.

What remains are the unavoidable secrets, usually tied to systems outside AWS. For those, you store them in AWS Secrets Manager and retrieve them at runtime. Not at build time. Not at deploy time. Not by pasting them into an environment variable and calling it “secure” because you used uppercase letters.

This gives you a practical split:

  • Avoidable secrets are replaced by IAM roles and temporary credentials
  • Unavoidable secrets go into Secrets Manager, encrypted and tightly scoped

The architecture in one picture

A simple flow to keep in mind:

  1. A Lambda function runs with an IAM execution role
  2. The function fetches one third-party API key from Secrets Manager at runtime
  3. The function calls the third-party API and writes results to DynamoDB
  4. Network access to Secrets Manager stays private through a VPC interface endpoint (when the Lambda runs in a VPC)

The best part is what you do not see.

No access keys. No “temporary” keys that have been temporary since 2021. No secrets baked into ZIPs or container layers.

What this protects you from

This pattern is not a magic spell. It is a seatbelt.

It helps reduce the chance of:

  • Credentials leaking through Git history, build logs, tickets, screenshots, or well-meaning copy-paste
  • Forgotten key rotation schedules that quietly become “never.”
  • Overpowered policies that turn a small bug into a full account cleanup
  • Unnecessary public internet paths for sensitive AWS API calls

Now let’s build it, step by step, with code snippets that are intentionally sanitized.

Step 1 build an IAM execution role with tight policies

The execution role is the front door key your Lambda carries.

If you give it access to everything, it will eventually use that access, if only because your future self will forget why it was there and leave it in place “just in case.”

Keep it boring. Keep it small.

Here is an example IAM policy for a Lambda that only needs to:

  • write to one DynamoDB table
  • read one secret from Secrets Manager
  • decrypt using one KMS key (optional, depending on how you configure encryption)
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "WriteToOneTable",
      "Effect": "Allow",
      "Action": [
        "dynamodb:PutItem",
        "dynamodb:UpdateItem"
      ],
      "Resource": "arn:aws:dynamodb:eu-west-1:111122223333:table/app-results-prod"
    },
    {
      "Sid": "ReadOneSecret",
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetSecretValue"
      ],
      "Resource": "arn:aws:secretsmanager:eu-west-1:111122223333:secret:thirdparty/weather-api-key-*"
    },
    {
      "Sid": "DecryptOnlyThatKey",
      "Effect": "Allow",
      "Action": [
        "kms:Decrypt"
      ],
      "Resource": "arn:aws:kms:eu-west-1:111122223333:key/12345678-90ab-cdef-1234-567890abcdef",
      "Condition": {
        "StringEquals": {
          "kms:ViaService": "secretsmanager.eu-west-1.amazonaws.com"
        }
      }
    }
  ]
}

A few notes that save you from future regret:

  • The secret ARN ends with -* because Secrets Manager appends a random suffix.
  • The KMS condition helps ensure the key is used only through Secrets Manager, not as a general-purpose decryption service.
  • You can skip the explicit kms:Decrypt statement if you use the AWS-managed key and accept the default behavior, but customer-managed keys are common in regulated environments.

Step 2 store the unavoidable secret properly

Secrets Manager is not a place to dump everything. It is a place to store what you truly cannot avoid.

A third-party API key is a perfect example because IAM cannot replace it. AWS cannot assume a role in someone else’s SaaS.

Use a JSON secret so you can extend it later without creating a new secret every time you add a field.

{
  "api_key": "REDACTED-EXAMPLE-TOKEN"
}

If you like the CLI (and I do, because buttons are too easy to misclick), create the secret like this:

aws secretsmanager create-secret \
  --name "thirdparty/weather-api-key" \
  --description "Token for the Weatherly API used by the ingestion Lambda" \
  --secret-string '{"api_key":"REDACTED-EXAMPLE-TOKEN"}' \
  --region eu-west-1

Then configure:

  • encryption with a customer-managed KMS key if required
  • rotation if the provider supports it (rotation is amazing when it is real, and decorative when the vendor does not allow it)

If the vendor does not support rotation, you still benefit from central storage, access control, audit logging, and removing the secret from code.

Step 3 lock down secret access with a resource policy

Identity-based policies on the Lambda role are necessary, but resource policies are a nice extra lock.

Think of it like this: your role policy is the key. The resource policy is the bouncer who checks the wristband.

Here is a resource policy that allows only one role to read the secret.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowOnlyIngestionRole",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111122223333:role/lambda-ingestion-prod"
      },
      "Action": "secretsmanager:GetSecretValue",
      "Resource": "*"
    },
    {
      "Sid": "DenyEverythingElse",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "secretsmanager:GetSecretValue",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalArn": "arn:aws:iam::111122223333:role/lambda-ingestion-prod"
        }
      }
    }
  ]
}

This is intentionally strict. Strict is good. Strict is how you avoid writing apology emails.

Step 4 keep Secrets Manager traffic private with a VPC endpoint

If your Lambda runs inside a VPC, it will not automatically have internet access. That is often the point.

In that case, you do not want the function reaching Secrets Manager through a NAT gateway if you can avoid it. NAT works, but it is like walking your valuables through a crowded shopping mall because the back door is locked.

Use an interface VPC endpoint for Secrets Manager.

Here is a Terraform example (sanitized) that creates the endpoint and limits access using a dedicated security group.

resource "aws_security_group" "secrets_endpoint_sg" {
  name        = "secrets-endpoint-sg"
  description = "Allow HTTPS from Lambda to Secrets Manager endpoint"
  vpc_id      = aws_vpc.main.id

  ingress {
    from_port       = 443
    to_port         = 443
    protocol        = "tcp"
    security_groups = [aws_security_group.lambda_sg.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_vpc_endpoint" "secretsmanager" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.eu-west-1.secretsmanager"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = [aws_subnet.private_a.id, aws_subnet.private_b.id]
  private_dns_enabled = true
  security_group_ids  = [aws_security_group.secrets_endpoint_sg.id]
}

If your Lambda is not in a VPC, you do not need this step. The function will reach Secrets Manager over AWS’s managed network path by default.

If you want to go further, consider adding a DynamoDB gateway endpoint too, so your function can write to DynamoDB without touching the public internet.

Step 5 retrieve the secret at runtime without turning logs into a confession

This is where many teams accidentally reinvent the problem.

They remove the secret from the code, then log it. Or they put it in an environment variable because “it is not in the repository,” which is a bit like saying “the spare key is not under the flowerpot, it is under the welcome mat.”

The clean approach is:

  • store only the secret name (not the secret value) as configuration
  • retrieve the value at runtime
  • cache it briefly to reduce calls and latency
  • never print it, even when debugging, especially when debugging

Here is a Python example for AWS Lambda with a tiny TTL cache.

import json
import os
import time
import boto3

_secrets_client = boto3.client("secretsmanager")
_cached_value = None
_cached_until = 0

SECRET_ID = os.getenv("THIRDPARTY_SECRET_ID", "thirdparty/weather-api-key")
CACHE_TTL_SECONDS = int(os.getenv("SECRET_CACHE_TTL_SECONDS", "300"))


def _get_api_key() -> str:
    global _cached_value, _cached_until

    now = int(time.time())
    if _cached_value and now < _cached_until:
        return _cached_value

    resp = _secrets_client.get_secret_value(SecretId=SECRET_ID)
    payload = json.loads(resp["SecretString"])

    api_key = payload["api_key"]
    _cached_value = api_key
    _cached_until = now + CACHE_TTL_SECONDS
    return api_key


def lambda_handler(event, context):
    api_key = _get_api_key()

    # Use the key without ever logging it
    results = call_weatherly_api(api_key=api_key, city=event.get("city", "Seville"))

    write_to_dynamodb(results)

    return {
        "status": "ok",
        "items": len(results) if hasattr(results, "__len__") else 1
    }

This snippet is intentionally short. The important part is the pattern:

  • minimal secret access
  • controlled cache
  • zero secret output

If you prefer a library, AWS provides a Secrets Manager caching client for some runtimes, and AWS Lambda Powertools can help with structured logging. Use them if they fit your stack.

Step 6 make security noisy with logs and alarms

Security without visibility is just hope with a nicer font.

At a minimum:

  • enable CloudTrail in the account
  • ensure Secrets Manager events are captured
  • alert on unusual secret access patterns

A simple and practical approach is a CloudWatch metric filter for GetSecretValue events coming from unexpected principals. Another is to build a dashboard showing:

  • Lambda errors
  • Secrets Manager throttles
  • sudden spikes in secret reads

Here is a tiny Terraform example that keeps your Lambda logs from living forever (because storage is forever, but your attention span is not).

resource "aws_cloudwatch_log_group" "lambda_logs" {
  name              = "/aws/lambda/lambda-ingestion-prod"
  retention_in_days = 14
}

Also consider:

  • IAM Access Analyzer to spot risky resource policies
  • AWS Config rules or guardrails if your organization uses them
  • an alarm on unexpected NAT data processing if you intended to keep traffic private

Common mistakes I have made, so you do not have to

I am listing these because I have either done them personally or watched them happen in slow motion.

  1. Using a wildcard secret policy
    secretsmanager:GetSecretValue on * feels convenient until it is a breach multiplier.
  2. Putting secret values into environment variables
    Environment variables are not evil, but they are easy to leak through debugging, dumps, tooling, or careless logging. Store secret names there, not secret contents.
  3. Retrieving secrets at build time
    Build logs live forever in the places you forget to clean. Runtime retrieval keeps secrets out of build systems.
  4. Logging too much while debugging
    The fastest way to leak a secret is to print it “just once.” It will not be just once.
  5. Skipping the endpoint and relying on NAT by accident
    The NAT gateway is not evil either. It is just an expensive and unnecessary hallway if a private door exists.

A two minute checklist you can steal

  • Your Lambda uses an IAM execution role, not access keys
  • The role policy scopes Secrets Manager access to one secret ARN pattern
  • The secret has a resource policy that only allows the expected role
  • Secrets are encrypted with KMS when required
  • The secret value is never stored in code, images, build logs, or environment variables
  • If Lambda runs in a VPC, you use an interface VPC endpoint for Secrets Manager
  • You have CloudTrail enabled and you can answer “who accessed this secret” without guessing

Extra thoughts

If you remove long-lived credentials from your applications, you remove an entire class of problems.

You stop rotating keys that should never have existed in the first place.

You stop pretending that “we will remember to clean it up later” is a security strategy.

And you get a calmer life, which is underrated in engineering.

Let IAM handle the secrets you can avoid.

Then let Secrets Manager handle the secrets you cannot.

And let your code do what it was meant to do: process data, not babysit keys like they are a toddler holding a permanent marker.

Ingress and egress on EKS made understandable

Getting traffic in and out of a Kubernetes cluster isn’t a magic trick. It’s more like running the city’s most exclusive nightclub. It’s a world of logistics, velvet ropes, bouncers, and a few bureaucratic tollbooths on the way out. Once you figure out who’s working the front door and who’s stamping passports at the exit, the rest is just good manners.

Let’s take a quick tour of the establishment.

A ninety-second tour of the premises

There are really only two journeys you need to worry about in this club.

Getting In: A hopeful guest (the client) looks up the address (DNS), arrives at the front door, and is greeted by the head bouncer (Load Balancer). The bouncer checks the guest list and directs them to the right party room (Service), where they can finally meet up with their friend (the Pod).

Getting Out: One of our Pods needs to step out for some fresh air. It gets an escort from the building’s internal security (the Node’s ENI), follows the designated hallways (VPC routing), and is shown to the correct exit—be it the public taxi stand (NAT Gateway), a private car service (VPC Endpoint), or a connecting tunnel to another venue (Transit Gateway).

The secret sauce in EKS is that our Pods aren’t just faceless guests; the AWS VPC CNI gives them real VPC IP addresses. This means the building’s security rules, Security Groups, route tables, and NACLs aren’t just theoretical policies. They are the very real guards and locked doors that decide whether a packet’s journey ends in success or a silent, unceremonious death.

Getting past the velvet rope

In Kubernetes, Ingress is the set of rules that governs the front door. But rules on paper are useless without someone to enforce them. That someone is a controller, a piece of software that translates your guest list into actual, physical bouncers in AWS.

The head of security for EKS is the AWS Load Balancer Controller. You hand it an Ingress manifest, and it sets up the door staff.

  • For your standard HTTP web traffic, it deploys an Application Load Balancer (ALB). Think of the ALB as a meticulous, sharp-dressed bouncer who doesn’t just check your name. It inspects your entire invitation (the HTTP request), looks at the specific event you’re trying to attend (/login or /api/v1), and only then directs you to the right room.
  • For less chatty protocols like raw TCP, UDP, or when you need sheer, brute-force throughput, it calls in a Network Load Balancer (NLB). The NLB is the big, silent type. It checks that you have a ticket and shoves you toward the main hall. It’s incredibly fast but doesn’t get involved in the details.

This whole operation can be made public or private. For internal-only events, the controller sets up an internal ALB or NLB and uses a private Route 53 zone, hiding the party from the public internet entirely.

The modern VIP system

The classic Ingress system works, but it can feel a bit like managing your guest list with a stack of sticky notes. The rules for routing, TLS, and load balancer behavior are all crammed into a single resource, creating a glorious mess of annotations.

This is where the Gateway API comes in. It’s the successor to Ingress, designed by people who clearly got tired of deciphering annotation soup. Its genius lies in separating responsibilities.

  • The Platform team (the club owners) manages the Gateway. They decide where the entrances are, what protocols are allowed (HTTP, TCP), and handle the big-picture infrastructure like TLS certificates.
  • The Application teams (the party hosts) manage Routes (HTTPRoute, TCPRoute, etc.). They just point to an existing Gateway and define the rules for their specific application, like “send traffic for app.example.com/promo to my service.”

This creates a clean separation of duties, offers richer features for traffic management without resorting to custom annotations, and makes your setup far more portable across different environments.

The art of the graceful exit

So, your Pods are happily running inside the club. But what happens when they need to call an external API, pull an image, or talk to a database? They need to get out. This is egress, and it’s mostly about navigating the building’s corridors and exits.

  • The public taxi stand: For general internet access from private subnets, Pods are sent to a NAT Gateway. It works, but it’s like a single, expensive taxi stand for the whole neighborhood. Every trip costs money, and if it gets too busy, you’ll see it on your bill. Pro tip: Put one NAT in each Availability Zone to avoid paying extra for your Pods to take a cross-town cab just to get to the taxi stand.
  • The private car service: When your Pods need to talk to other AWS services (like S3, ECR, or Secrets Manager), sending them through the public internet is a waste of time and money. Use
    VPC endpoints instead. Think of this as a pre-booked black car service. It creates a private, secure tunnel directly from your VPC to the AWS service. It’s faster, cheaper, and the traffic never has to brave the public internet.
  • The diplomatic passport: The worst way to let Pods talk to AWS APIs is by attaching credentials to the node itself. That’s like giving every guest in the club a master key. Instead, we use
    IRSA (IAM Roles for Service Accounts). This elegantly binds an IAM role directly to a Pod’s service account. It’s the equivalent of issuing your Pod a diplomatic passport. It can present its credentials to AWS services with full authority, no shared keys required.

Setting the house rules

By default, Kubernetes networking operates with the cheerful, chaotic optimism of a free-for-all music festival. Every Pod can talk to every other Pod. In production, this is not a feature; it’s a liability. You need to establish some house rules.

Your two main tools for this are Security Groups and NetworkPolicy.

Security Groups are your Pod’s personal bodyguards. They are stateful and wrap around the Pod’s network interface, meticulously checking every incoming and outgoing connection against a list you define. They are an AWS-native tool and very precise.

NetworkPolicy, on the other hand, is the club’s internal security team. You need to hire a third-party firm like Calico or Cilium to enforce these rules in EKS, but once you do, you can create powerful rules like “Pods in the ‘database’ room can only accept connections from Pods in the ‘backend’ room on port 5432.”

The most sane approach is to start with a default deny policy. This is the bouncer’s universal motto: “If your name’s not on the list, you’re not getting in.” Block all egress by default, then explicitly allow only the connections your application truly needs.

A few recipes from the bartender

Full configurations are best kept in a Git repository, but here are a few cocktail recipes to show the key ingredients.

Recipe 1: Public HTTPS with a custom domain. This Ingress manifest tells the AWS Load Balancer Controller to set up a public-facing ALB, listen on port 443, use a specific TLS certificate from ACM, and route traffic for app.yourdomain.com to the webapp service.

# A modern Ingress for your web application
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: webapp-ingress
  annotations:
    # Set the bouncer to be public
    alb.ingress.kubernetes.io/scheme: internet-facing
    # Talk to Pods directly for better performance
    alb.ingress.kubernetes.io/target-type: ip
    # Listen for secure traffic
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    # Here's the TLS certificate to wear
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789012:certificate/your-cert-id
spec:
  ingressClassName: alb
  rules:
    - host: app.yourdomain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: webapp-service
                port:
                  number: 8080

Recipe 2: A diplomatic passport for S3 access. This gives our Pod a ServiceAccount annotated with an IAM role ARN. Any Pod that uses this service account can now talk to AWS APIs (like S3) with the permissions granted by that role, thanks to IRSA.

# The ServiceAccount with its IAM credentials
apiVersion: v1
kind: ServiceAccount
metadata:
  name: s3-reader-sa
  annotations:
    # This is the diplomatic passport: the ARN of the IAM role
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/EKS-S3-Reader-Role
---
# The Deployment that uses the passport
apiVersion: apps/v1
kind: Deployment
metadata:
  name: report-generator
spec:
  replicas: 1
  selector:
    matchLabels: { app: reporter }
  template:
    metadata:
      labels: { app: reporter }
    spec:
      # Use the service account we defined above
      serviceAccountName: s3-reader-sa
      containers:
        - name: processor
          image: your-repo/report-generator:v1.5.2
          ports:
            - containerPort: 8080

A short closing worth remembering

When you boil it all down, Ingress is just the etiquette you enforce at the front door. Egress is the paperwork required for a clean exit. In EKS, the etiquette is defined by Kubernetes resources, while the paperwork is pure AWS networking. Neither one cares about your intentions unless you write them down clearly.

So, draw the path for traffic both ways, pick the right doors for the job, give your Pods a proper identity, and set the tolls where they make sense. If you do, the cluster will behave, the bill will behave, and your on-call shifts might just start tasting a lot more like sleep.

Stop building cathedrals in Terraform

It’s 9 AM on a Tuesday. You, a reasonably caffeinated engineer, open a pull request to add a single tag to an S3 bucket. A one-line change. You run terraform plan and watch in horror as your screen scrolls with a novel’s worth of green, yellow, and red text. Two hundred and seventeen resources want to be updated.

Welcome to a special kind of archaeological dig. Somewhere, buried three folders deep, a “reusable” module you haven’t touched in six months has decided to redecorate your entire production environment. The brochure promised elegance and standards. The reality is a Tuesday spent doing debugging, cardio, and praying to the Git gods.

Small teams, in particular, fall into this trap. You don’t need to build a glorious cathedral of abstractions just to hang a picture on the wall. You need a hammer, a nail, and enough daylight to see what you’re doing.

The allure of the perfect system

Let’s be honest, custom Terraform modules are seductive. They whisper sweet nothings in your ear about the gospel of DRY (Don’t Repeat Yourself). They promise a future where every resource is a perfect, standardized snowflake, lovingly stamped out from a single, blessed template. It’s the engineering equivalent of having a perfectly organized spice rack where all the labels face forward.

In theory, it’s beautiful. In practice, for a small, fast-moving team, it’s a tax. A heavy one. An indirection tax.

What starts as a neat wrapper today becomes a Matryoshka doll of complexity by next quarter. Inputs multiply. Defaults are buried deeper than state secrets. Soon, flipping a single boolean in a variables.tf file feels like rewiring a nuclear submarine with the lights off. The module is no longer serving you; you are now its humble servant.

It’s like buying one of those hyper-specific kitchen gadgets, like a banana slicer. Yes, it slices bananas. Perfectly. But now you own a piece of plastic whose only job is to do something a knife you already owned could do just fine. That universal S3 module you built is the junk drawer of your infrastructure. Sure, it holds everything, but now you have to rummage past a broken can opener and three instruction manuals just to find a spoon.

A heuristic for staying sane

So, what’s the alternative? Anarchy? Copy-pasting HCL like a digital barbarian? Yes. Sort of.

Here’s a simple, sanity-preserving heuristic:

Duplicate once without shame. Duplicate twice with comments. On the third time, and only then, consider extracting a module.

Until you hit that third, clear, undeniable repetition of a pattern, plain HCL is your best friend. It wins on speed, clarity, and keeping the blast radius of any change predictably small. You avoid abstracting a solution before you even fully understand the problem.

Let’s see it in action. You need a simple, private S3 bucket for your new service.

The cathedral-builder’s approach might look like this:

# service-alpha/main.tf

module "service_alpha_bucket" {
  source = "git::ssh://git@github.com/your-org/terraform-modules.git//s3/private-bucket?ref=v1.4.2"

  bucket_name      = "service-alpha-data-logs-2025"
  enable_versioning = true
  force_destroy    = false # Safety first!
  lifecycle_days   = 90
  tags = {
    Service   = "alpha"
    ManagedBy = "Terraform"
  }
}

It looks clean, but what happens when you need to add a specific replication rule? Or a weird CORS policy for a one-off integration? You’re off to another repository to wage war with the module’s maintainer (who is probably you, from six months ago).

Now, the boring, sane, ship-it-today approach:

# service-alpha/main.tf

resource "aws_s3_bucket" "data_bucket" {
  bucket = "service-alpha-data-logs-2025"

  tags = {
    Service   = "alpha"
    ManagedBy = "Terraform"
  }
}

resource "aws_s3_bucket_versioning" "data_bucket_versioning" {
  bucket = aws_s3_bucket.data_bucket.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "data_bucket_lifecycle" {
  bucket = aws_s3_bucket.data_bucket.id

  rule {
    id     = "log-expiration"
    status = "Enabled"
    expiration {
      days = 90
    }
  }
}

resource "aws_s3_bucket_public_access_block" "data_bucket_access" {
  bucket                  = aws_s3_bucket.data_bucket.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

Is it more lines of code? Yes. Is it gloriously, beautifully, and unapologetically obvious? Absolutely. You can read it, understand it, and change it in thirty seconds. No context switching. No spelunking through another codebase. Just a bucket, doing bucket things.

Where a module is not a swear word

Okay, I’m not a total monster. Modules have their place. They are the right tool when you are building the foundations, not the furniture.

A module earns its keep when it defines a stable, slow-moving, and genuinely complex pattern that you truly want to be identical everywhere. Think of it like the plumbing and electrical wiring of a house. You don’t reinvent it for every room.

Good candidates for a module include:

  • VPC and core networking: The highway system of your cloud. Build it once, build it well, and then leave it alone.
  • Kubernetes cluster baselines: The core EKS/GKE/AKS setup, IAM roles, and node group configurations.
  • Security and telemetry agents: The non-negotiable stuff that absolutely must run on every single instance.
  • IAM roles for CI/CD: A standardized way for your deployment pipeline to get the permissions it needs.

The key difference? These things change on a scale of months or years, not days or weeks.

Your escape plan from module purgatory

What if you’re reading this and nodding along in despair, already trapped in a gilded cage of your own abstractions? Don’t panic. There’s a way out, and it doesn’t require a six-month migration project.

  • Freeze everything: First, go to every service that uses the problematic module and pin the version number. ref=v1.4.2. No more floating on main. You’ve just stopped the bleeding.
  • Take inventory: In one service, run a Terraform state list to see the exact resources managed by the module.
  • Perform the adoption: This is the magic trick. Write the plain HCL code for those resources directly in your service’s configuration. Then, tell Terraform that the old resource (inside the module) and your new resource (the plain HCL) are actually the same thing. You do this with a moved block or the Terraform state mv command.

Let’s say your module created a bucket. The state address is module.service_alpha_bucket.aws_s3_bucket.this[0]. Your new plain resource is aws_s3_bucket.data_bucket.

You would run:

terraform state mv 'module.service_alpha_bucket.aws_s3_bucket.this[0]' aws_s3_bucket.data_bucket

  • Verify and obliterate: Run terraform plan. It should come back with “No changes. Your infrastructure matches the configuration.” The plan is clean. You are now free. Delete the module block, pop the champagne, and submit your PR. Repeat for other services, one at a time. No heroics.

Fielding objections from the back row

When you propose this radical act of simplicity, someone will inevitably raise their hand.

  • “But we need standards!” You absolutely do. Standardize on things that matter: tags, naming conventions, and security policies. Enforce them with tools like tflint, checkov, and OPA/Gatekeeper. A linter yelling at you in a PR is infinitely better than a module silently deploying the wrong thing everywhere.
  • “What about junior developers? They need a paved road!” They do. A haunted mega-module with 50 input variables is not a paved road; it’s a labyrinth with a minotaur. A better “paved road” is a folder of well-documented, copy-pasteable examples of plain HCL for common tasks.
  • “Compliance will have questions!” Good. Let them. A tiny, focused, version-pinned module for your IAM boundary policy is a fantastic answer. A sprawling, do-everything wrapper module that changes every week is a compliance nightmare waiting to happen.

The gospel of ‘Good Enough’ for now

Stop trying to solve tomorrow’s problems today. That perfect, infinitely configurable abstraction you’re dreaming of is a solution in search of a problem you don’t have yet.

Don’t optimize for DRY. Optimize for change.

Small teams don’t need fewer lines of HCL; they need fewer places to look when something breaks at 3 PM on a Friday. They need clarity, not cleverness. Keep your power tools for the heavy-duty work. Save the cathedral for when you’ve actually founded a religion.

For now, ship the bucket, and go get lunch.

Your Kubernetes rollback is lying

The PagerDuty alert screams. The new release, born just minutes ago with such promising release notes, is coughing up blood in production. The team’s Slack channel is a frantic mess of flashing red emojis. Someone, summoning the voice of a panicked adult, yells the magic word: “ROLLBACK!”

And so, Helm, our trusty tow-truck operator, rides in with a smile, waving its friendly green check marks. The dashboards, those silent accomplices, beam with the serene glow of healthy metrics. Kubernetes probes, ever so polite, confirm that the resurrected pods are, in fact, “breathing.”

Then, production face-plants. Hard.

The feeling is like putting a cartoon-themed bandage on a burst water pipe and then wondering, with genuine surprise, why the living room has become a swimming pool. This article is the autopsy of those “perfect” rollbacks. We’re going to uncover why your monitoring is a pathological liar, how network traffic becomes a double agent, and what to do so that the next time Helm gives you a thumbs-up, you can actually believe it.

A state that refuses to time-travel

The first, most brutal lie a rollback tells you is that it can turn back time. A helm rollback is like the “rewind” button on an old VCR remote; it diligently rewinds the tape (your YAML manifests), but it has absolutely no power to make the actors on screen younger.

Your application’s state is one of those stubborn actors.

While your ConfigMaps and Secrets might dutifully revert to their previous versions, your data lives firmly in the present. If your new release included a database migration that added a column, rolling back the application code doesn’t magically make that column disappear. Now your old code is staring at a database schema from the future, utterly confused, like a medieval blacksmith being handed an iPad.

The same goes for PersistentVolumeClaims, external caches like Redis, or messages sitting in a Kafka queue. The rollback command whispers sweet nothings about returning to a “known good state,” but it’s only talking about itself. The rest of your universe has moved on, and it refuses to travel back with you.

The overly polite doorman

The second culprit in our investigation is the Kubernetes probe. Think of the readinessProbe as an overly polite doorman at a fancy party. Its job is to check if a guest (your pod) is ready to enter. But its definition of “ready” can be dangerously optimistic.

Many applications, especially those running on the JVM, have what we’ll call a “warming up” period. When a pod starts, the process is running, the HTTP port is open, and it will happily respond to a simple /health check. The doorman sees a guest in a tuxedo and says, “Looks good to me!” and opens the door.

What the doorman doesn’t see is that this guest is still stretching, yawning, and trying to remember where they are. The application’s caches are cold, its connection pools are empty, and its JIT compiler is just beginning to think about maybe, possibly, optimizing some code. The first few dozen requests it receives will be painfully slow or, worse, time out completely.

So while your readinessProbe is giving you a green light, your first wave of users is getting a face full of errors. For these sleepy applications, you need a more rigorous bouncer.

A startupProbe is that bouncer. It gives the app a generous amount of time to get its act together before even letting the doorman (readiness and liveness probes) start their shift.

# This probe gives our sleepy JVM app up to 5 minutes to wake up.
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
startupProbe:
  httpGet:
    path: /health/ready
    port: 8080
  # Kubelet will try 30 times with a 10-second interval (300 seconds).
  # If the app isn't ready by then, the pod will be restarted.
  failureThreshold: 30
  periodSeconds: 10

Without it, your rollback creates a fleet of pods that are technically alive but functionally useless, and Kubernetes happily sends them a flood of unsuspecting users.

Traffic, the double agent

And that brings us to our final suspect: the network traffic itself. In a modern setup using a service mesh like Istio or Linkerd, traffic routing is a sophisticated dance. But even the most graceful dancer can trip.

When you roll back, a new ReplicaSet is created with the old pod specification. The service mesh sees these new pods starting up, asks the doorman (readinessProbe) if they’re good to go, gets an enthusiastic “yes!”, and immediately starts sending them a percentage of live production traffic.

This is where all our problems converge. Your service mesh, in its infinite efficiency, has just routed 50% of your user traffic to a platoon of sleepy, confused pods that are trying to talk to a database from the future.

Let’s look at the evidence. This VirtualService, which we now call “The 50/50 Disaster Splitter,” was routing traffic with criminal optimism.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: checkout-api-vs
  namespace: prod-eu-central
spec:
  hosts:
    - "checkout.api.internal"
  http:
    - route:
        - destination:
            host: checkout-api-svc
            subset: v1-stable
          weight: 50 # 50% to the (theoretically) working pods
        - destination:
            host: checkout-api-svc
            subset: v1-rollback
          weight: 50 # 50% to the pods we just dragged from the past

The service mesh isn’t malicious. It’s just an incredibly efficient tool that is very good at following bad instructions. It sees a green light and hits the accelerator.

A survival guide that won’t betray you

So, you’re in the middle of a fire, and the “break glass in case of emergency” button is a lie. What do you do? You need a playbook that acknowledges reality.

Step 0: Breathe and isolate the blast radius

Before you even think about rolling back, stop the bleeding. The fastest way to do that is often at the traffic level. Use your service mesh or ingress controller to immediately shift 100% of traffic back to the last known good version. Don’t wait for new pods to start. This is a surgical move that takes seconds and gives you breathing room.

Step 1: Declare an incident and gather the detectives

Get the right people on a call. Announce that this is not a “quick rollback” but an incident investigation. Your goal is to understand why the release failed, not just to hit the undo button.

Step 2: Perform the autopsy (while the system is stable)

With traffic safely routed away from the wreckage, you can now investigate. Check the logs of the failed pods. Look at the database. Is there a schema mismatch? A bad configuration? This is where you find the real killer.

Step 3: Plan the counter-offensive (which might not be a rollback)

Sometimes, the safest path forward is a roll forward. A small hotfix that corrects the issue might be faster and less risky than trying to force the old code to work with a new state. A rollback should be a deliberate, planned action, not a panic reflex. If you must roll back, do it with the knowledge you’ve gained from your investigation.

Step 4: The deliberate, cautious rollback

If you’ve determined a rollback is the only way, do it methodically.

  1. Scale down the broken deployment:
    kubectl scale deployment/checkout-api –replicas=0
  2. Execute the Helm rollback:
    helm rollback checkout-api 1 -n prod-eu-central
  3. Watch the new pods like a hawk: Monitor their logs and key metrics as they come up. Don’t trust the green check marks.
  4. Perform a Canary Release: Once the new pods look genuinely healthy, use your service mesh to send them 1% of the traffic. Then 10%. Then 50%. Then 100%. You are now in control, not the blind optimism of the automation.

The truth will set you free

A Kubernetes rollback isn’t a time machine. It’s a YAML editor with a fancy title. It doesn’t understand your data, it doesn’t appreciate your app’s need for a morning coffee, and it certainly doesn’t grasp the nuances of traffic routing under pressure.

Treating a rollback as a simple, safe undo button is the fastest way to turn a small incident into a full-blown outage. By understanding the lies it tells, you can build a process that trusts human investigation over deceptive green lights. So the next time a deployment goes sideways, don’t just reach for the rollback lever. Reach for your detective’s hat instead.

Terraform scales better without a centralized remote state

It’s 4:53 PM on a Friday. You’re pushing a one-line change to an IAM policy. A change so trivial, so utterly benign, that you barely give it a second thought. You run terraform apply, lean back in your chair, and dream of the weekend. Then, your terminal returns a greeting from the abyss: Error acquiring state lock.

Somewhere across the office, or perhaps across the country, a teammate has just started a plan on their own, seemingly innocuous change. You are now locked in a digital standoff. The weekend is officially on hold. Your shared Terraform state file, once a symbol of collaboration and a single source of truth, has become a temperamental roommate who insists on using the kitchen right when you need to make dinner. And they’re a very, very slow cook.

Our Terraform honeymoon phase

It wasn’t always like this. Most of us start our Terraform journey in a state of blissful simplicity. Remember those early days? A single, elegant main.tf file, a tidy remote backend in an S3 bucket, and a DynamoDB table to handle the locking. It was the infrastructure equivalent of a brand-new, minimalist apartment. Everything had its place. Deployments were clean, predictable, and frankly, a little bit boring.

Our setup looked something like this, a testament to a simpler time:

# in main.tf
terraform {
  backend "s3" {
    bucket         = "our-glorious-infra-state-prod"
    key            = "global/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-lock-prod"
    encrypt        = true
  }
}

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
  # ... and so on
}

It worked beautifully. Until it didn’t. The problem with minimalist apartments is that they don’t stay that way. You add a person, then another. You buy more furniture. Soon, you’re tripping over things, and that one clean kitchen becomes a chaotic battlefield of conflicting needs.

The kitchen gets crowded

As our team and infrastructure grew, our once-pristine state file started to resemble a chaotic shared kitchen during rush hour. The initial design, meant for a single chef, was now buckling under the pressure of a full restaurant staff.

The state lock standoff

The first and most obvious symptom was the state lock. It’s less of a technical “race condition” and more of a passive-aggressive duel between two colleagues who both need the only good frying pan at the exact same time. The result? Burnt food, frayed nerves, and a CI/CD pipeline that spends most of its time waiting in line.

The mystery of the shared spice rack

With everyone working out of the same state file, we lost any sense of ownership. It became a communal spice rack where anyone could move, borrow, or spill things. You’d reach for the salt (a production security group) only to find someone had replaced it with sugar (a temporary rule for a dev environment). Every Terraform apply felt like a gamble. You weren’t just deploying your change; you were implicitly signing off on the current, often mysterious, state of the entire kitchen.

The pre-apply prayer

This led to a pervasive culture of fear. Before running an apply, engineers would perform a ritualistic dance of checks, double-checks, and frantic Slack messages: “Hey, is anyone else touching prod right now?” The Terraform plan output would scroll for pages, a cryptic epic poem of changes, 95% of which had nothing to do with you. You’d squint at the screen, whispering a little prayer to the DevOps gods that you wouldn’t accidentally tear down the customer database because of a subtle dependency you missed.

The domino effect of a single spilled drink

Worst of all was the tight coupling. Our infrastructure became a house of cards. A team modifying a network ACL for their new microservice could unintentionally sever connectivity for a legacy monolith nobody had touched in years. It was the architectural equivalent of trying to change a lightbulb and accidentally causing the entire building’s plumbing to back up.

An uncomfortable truth appears

For a while, we blamed Terraform. We complained about its limitations, its verbosity, and its sharp edges. But eventually, we had to face an uncomfortable truth: the tool wasn’t the problem. We were. Our devotion to the cult of the single centralized state—the idea that one file to rule them all was the pinnacle of infrastructure management—had turned our single source of truth into a single point of failure.

The great state breakup

The solution was as terrifying as it was liberating: we had to break up with our monolithic state. It was time to move out of the chaotic shared house and give every team their own well-equipped studio apartment.

Giving everyone their own kitchenette

First, we dismantled the monolith. We broke our single Terraform configuration into dozens of smaller, isolated stacks. Each stack managed a specific component or application, like a VPC, a Kubernetes cluster, or a single microservice’s infrastructure. Each had its own state file.

Our directory structure transformed from a single folder into a federation of independent projects:

infra/
├── networking/
│   ├── vpc.tf
│   └── backend.tf      # Manages its own state for the VPC
├── databases/
│   ├── rds-main.tf
│   └── backend.tf      # Manages its own state for the primary RDS
└── services/
    ├── billing-api/
    │   ├── ecs-service.tf
    │   └── backend.tf  # Manages state for just the billing API
    └── auth-service/
        ├── iam-roles.tf
        └── backend.tf  # Manages state for just the auth service

The state lock standoffs vanished overnight. Teams could work in parallel without tripping over each other. The blast radius of any change was now beautifully, reassuringly small.

Letting infrastructure live with its application

Next, we embraced GitOps patterns. Instead of a central infrastructure repository, we decided that infrastructure code should live with the application it supports. It just makes sense. The code for an API and the infrastructure it runs on are a tightly coupled couple; they should live in the same house. This meant code reviews for application features and infrastructure changes happened in the same pull request, by the same team.

Tasting the soup before serving it

Finally, we made surprises a thing of the past by validating plans before they ever reached the main branch. We set up simple CI workflows that would run a Terraform plan on every pull request. No more mystery meat deployments. The plan became a clear, concise contract of what was about to happen, reviewed and approved before merge.

A snippet from our GitHub Actions workflow looked like this:

name: 'Terraform Plan Validation'
on:
  pull_request:
    paths:
      - 'infra/**'
      - '.github/workflows/terraform-plan.yml'

jobs:
  plan:
    name: 'Terraform Plan'
    runs-on: ubuntu-latest
    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Setup Terraform
      uses: hashicorp/setup-terraform@v3
      with:
        terraform_version: 1.5.0

    - name: Terraform Init
      run: terraform init -backend=false

    - name: Terraform Plan
      run: terraform plan -no-color

Stories from the other side

This wasn’t just a theoretical exercise. A fintech firm we know split its monolithic repo into 47 micro-stacks. Their deployment speed shot up by 70%, not because they wrote code faster, but because they spent less time waiting and untangling conflicts. Another startup moved from a central Terraform setup to the AWS CDK (TypeScript), embedding infra in their app repos. They cut their time-to-deploy in half, freeing their SRE team from being gatekeepers and allowing them to become enablers.

Guardrails not gates

Terraform is still a phenomenally powerful tool. But the way we use it has to evolve. A centralized remote state, when not designed for scale, becomes a source of fragility, not strength. Just because you can put all your eggs in one basket doesn’t mean you should, especially when everyone on the team needs to carry that basket around.

The most scalable thing you can do is let teams build independently. Give them ownership, clear boundaries, and the tools to validate their work. Build guardrails to keep them safe, not gates to slow them down. Your Friday evenings will thank you for it.

Confessions of a recovering GitOps addict

There’s a moment in every tech trend’s lifecycle when the magic starts to wear off. It’s like realizing the artisanal, organic, free-range coffee you’ve been paying eight dollars for just tastes like… coffee. For me, and many others in the DevOps trenches, that moment has arrived for GitOps.

We once hailed it as the silver bullet, the grand unifier, the one true way. Now, I’m here to tell you that the romance is over. And something much more practical is taking its place.

The alluring promise of a perfect world

Let’s be honest, we all fell hard for GitOps. The promise was intoxicating. A single source of truth for our entire infrastructure, nestled right in the warm, familiar embrace of Git. Pull Requests became the sacred gates through which all changes must pass. CI/CD pipelines were our holy scrolls, and tools like ArgoCD and Flux were the messiahs delivering us from the chaos of manual deployments.

It was a world of perfect order. Every change was audited, every state was declared, and every rollback was just a git revert away. It felt clean. It felt right. It felt… professional. For a while, it was the hero we desperately needed.

The tyranny of the pull request

But paradise had a dark side, and it was paved with endless YAML files. The first sign of trouble wasn’t a catastrophic failure, but a slow, creeping bureaucracy that we had built for ourselves.

Need to update a single, tiny secret? Prepare for the ritual. First, the offering: a Pull Request. Then, the prayer for the high priests (your colleagues) to grant their blessing (the approval). Then, the sacrifice (the merge). And finally, the tense vigil, watching ArgoCD’s sync status like it’s a heart monitor, praying it doesn’t flatline.

The lag became a running joke. Your change is merged… but has it landed in production? Who knows! The sync bot seems to be having a bad day. When everything is on fire at 2 AM, Git is like that friend who proudly tells you, “Well, according to my notes, the plan was for there not to be a fire.” Thanks, Git. Your record of intent is fascinating, but I need a fire hose, not a historian.

We hit our wall during what should have been a routine update.

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: auth-service
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: auth-service-container
        image: our-app:v1.12.4
        envFrom:
        - secretRef:
            name: production-credentials

A simple change to the production-credentials secret required updating an encrypted file, PR-ing it, and then explaining in the commit message something like, “bumping secret hash for reasons”. Nobody understood it. Infrastructure changes started to require therapy sessions just to get merged.

And then, the tools fought back

When a system creates more friction than it removes, a rebellion is inevitable. And the rebels have arrived, not with pitchforks, but with smarter, more flexible tools.

First, the idea that developers should be fluent in YAML began to die. Internal Developer Platforms (IDPs) like Backstage and Port started giving developers what they always wanted: self-service with guardrails. Instead of wrestling with YAML syntax, they click a button in a portal to provision a database or spin up a new environment. Git becomes a log of what happened, not a bottleneck to make things happen.

Second, we remembered that pushing things can be good. The pull-based model was trendy, but let’s face it: push is immediate. Push is observable. We’ve gone back to CI pipelines pushing manifests directly into clusters, but this time they’re wearing body armor.

# This isn't your old wild-west kubectl apply
# It's a command wrapped in an approval system, with observability baked in.
deploy-cli --service auth-service --env production --approve

The change is triggered precisely when we want it, not when a bot feels like syncing. Finally, we started asking a radical question: why are we describing infrastructure in a static markup language when we could be programming it? Tools like Pulumi and Crossplane entered the scene. Instead of hundreds of lines of YAML, we’re writing code that feels alive.

import * as aws from "@pulumi/aws";

// Create an S3 bucket with versioning enabled.
const bucket = new aws.s3.Bucket("user-uploads-bucket", {
    versioning: {
        enabled: true,
    },
    acl: "private",
});

Infrastructure can now react to events, be composed into reusable modules, and be written in a language with types and logic. YAML simply can’t compete with that.

A new role for the abdicated king

So, is GitOps dead? No, that’s just clickbait. But it has been demoted. It’s no longer the king ruling every action; it’s more like a constitutional monarch, a respected elder statesman.

It’s fantastic for auditing, for keeping a high-level record of intended state, and for infrastructure teams that thrive on rigid discipline. But for high-velocity product teams, it’s become a beautifully crafted anchor when what we need is a motor.

We’ve moved from “Let’s define everything in Git” to “Let’s ship faster, safer, and saner with the right tools for the job.”

Our current stack is a hybrid, a practical mix of the old and new:

  • Backstage to abstract away complexity for developers.
  • Push-based pipelines with strong guardrails for immediate, observable deployments.
  • Pulumi for typed, programmable, and composable infrastructure.
  • Minimal GitOps for what it does best: providing a clear, auditable trail of our intentions.

GitOps wasn’t a mistake; it was the strict but well-meaning grandparent of infrastructure management. It taught us discipline and the importance of getting approval before touching anything important. But now that we’re grown up, that level of supervision feels less like helpful guidance and more like having someone watch over your shoulder while you type, constantly asking, “Are you sure you want to save that file?” The world is moving on to flexibility, developer-first platforms, and code you can read without a decoder ring. If you’re still spending your nights appeasing the YAML gods with Pull Request sacrifices for trivial changes… you’re not just living in the past, you’re practically a fossil.

When docker compose stopped being magic

There was a time, not so long ago, when docker-compose up felt like performing a magic trick. You’d scribble a few arcane incantations into a YAML file and, poof, your entire development stack would spring to life. The database, the cache, your API, the frontend… all humming along obediently on localhost. Docker Compose wasn’t just a tool; it was the trusty Swiss Army knife in every developer’s pocket, the reliable friend who always had your back.

Until it didn’t.

Our breakup wasn’t a single, dramatic event. It was a slow fade, the kind of awkward drifting apart that happens when one friend grows and the other… well, the other is perfectly happy staying exactly where they are. It began with small annoyances, then grew into full-blown arguments. We eventually realized we were spending more time trying to fix our relationship with YAML than actually building things.

So, with a heavy heart and a sigh of relief, we finally said goodbye.

The cracks begin to show

As our team and infrastructure matured, our reliable friend started showing some deeply annoying habits. The magic tricks became frustratingly predictable failures.

  • Our services started giving each other the silent treatment. The networking between containers became as fragile and unpredictable as a Wi-Fi connection on a cross-country train. One moment they were chatting happily, the next they wouldn’t be caught dead in the same virtual network.
  • It was worse at keeping secrets than a gossip columnist. The lack of native, secure secret handling was, to put it mildly, a joke. We were practically writing passwords on sticky notes and hoping for the best.
  • It developed a severe case of multiple personality disorder. The same docker-compose.yml file would behave like a well-mannered gentleman on one developer’s machine, a rebellious teenager in staging, and a complete, raving lunatic in production. Consistency was not its strong suit.
  • The phrase “It works on my machine” became a ritualistic chant. We’d repeat it, hoping to appease the demo gods, but they are a fickle bunch and rarely listened. We needed reliability, not superstition.

We had to face the truth. Our old friend just couldn’t keep up.

Moving on to greener pastures

The final straw was the realization that we had become full-time YAML therapists. It was time to stop fixing and start building again. We didn’t just dump Compose; we replaced it, piece by piece, with tools that were actually designed for the world we live in now.

For real infrastructure, we chose real code

For our production and staging environments, we needed a serious, long-term commitment. We found it in the AWS Cloud Development Kit (CDK). Instead of vaguely describing our needs in YAML and hoping for the best, we started declaring our infrastructure with the full power and grace of TypeScript.

We went from a hopeful plea like this:

# docker-compose.yml
services:
  api:
    build: .
    ports:
      - "8080:8080"
    depends_on:
      - database
  database:
    image: "postgres:14-alpine"

To a confident, explicit declaration like this:

// lib/api-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ecs_patterns from 'aws-cdk-lib/aws-ecs-patterns';

// ... inside your Stack class
const vpc = /* your existing VPC */;
const cluster = new ecs.Cluster(this, 'ApiCluster', { vpc });

// Create a load-balanced Fargate service and make it public
new ecs_patterns.ApplicationLoadBalancedFargateService(this, 'ApiService', {
  cluster: cluster,
  cpu: 256,
  memoryLimitMiB: 512,
  desiredCount: 2, // Let's have some redundancy
  taskImageOptions: {
    image: ecs.ContainerImage.fromRegistry("your-org/your-awesome-api"),
    containerPort: 8080,
  },
  publicLoadBalancer: true,
});

It’s reusable, it’s testable, and it’s cloud-native by default. No more crossed fingers.

For local development, we found a better roommate

Onboarding new developers had become a nightmare of outdated README files and environment-specific quirks. For local development, we needed something that just worked, every time, on every machine. We found our perfect new roommate in Dev Containers.

Now, we ship a pre-configured development environment right inside the repository. A developer opens the project in VS Code, it spins up the container, and they’re ready to go.

Here’s the simple recipe in .devcontainer/devcontainer.json:

{
  "name": "Node.js & PostgreSQL",
  "dockerComposeFile": "docker-compose.yml", // Yes, we still use it here, but just for this!
  "service": "app",
  "workspaceFolder": "/workspace",

  // Forward the ports you need
  "forwardPorts": [3000, 5432],

  // Run commands after the container is created
  "postCreateCommand": "npm install",

  // Add VS Code extensions
  "extensions": [
    "dbaeumer.vscode-eslint",
    "esbenp.prettier-vscode"
  ]
}

It’s fast, it’s reproducible, and our onboarding docs have been reduced to: “1. Install Docker. 2. Open in VS Code.”

To speak every Cloud language, we hired a translator

As our ambitions grew, we needed to manage resources across different cloud providers without learning a new dialect for each one. Crossplane became our universal translator. It lets us manage our infrastructure, whether it’s on AWS, GCP, or Azure, using the language we already speak fluently: the Kubernetes API.

Want a managed database in AWS? You don’t write Terraform. You write a Kubernetes manifest.

# rds-instance.yaml
apiVersion: database.aws.upbound.io/v1beta1
kind: RDSInstance
metadata:
  name: my-production-db
spec:
  forProvider:
    region: eu-west-1
    instanceClass: db.t3.small
    masterUsername: admin
    allocatedStorage: 20
    engine: postgres
    engineVersion: "14.5"
    skipFinalSnapshot: true
    # Reference to a secret for the password
    masterPasswordSecretRef:
      namespace: crossplane-system
      name: my-db-password
      key: password
  providerConfigRef:
    name: aws-provider-config

It’s declarative, auditable, and fits perfectly into a GitOps workflow.

For the creative grind, we got a better workflow

The constant cycle of code, build, push, deploy, test, repeat for our microservices was soul-crushing. Docker Compose never did this well. We needed something that could keep up with our creative flow. Skaffold gave us the instant gratification we craved.

One command, skaffold dev, and suddenly we had:

  • Live code syncing to our development cluster.
  • Automatic container rebuilds and redeployments when files change.
  • A unified configuration for both development and production pipelines.

No more editing three different files and praying. Just code.

The slow fade was inevitable

Docker Compose was a fantastic tool for a simpler time. It was perfect when our team was small, our application was a monolith, and “production” was just a slightly more powerful laptop.

But the world of software development has moved on. We now live in an era of distributed systems, cloud-native architecture, and relentless automation. We didn’t just stop using Docker Compose. We outgrew it. And we replaced it with tools that weren’t just built for the present, but are ready for the future.