CloudArchitecture

What it really takes to run AI workloads on AWS

A surprising number of AI platforms begin life with a question that sounds reasonable in a standup and catastrophic in a postmortem, something along the lines of “Can we just stick a GPU behind an API?” You can. You probably shouldn’t. AI workloads are not ordinary web services wearing a thicker coat. They behave differently, fail differently, scale differently, and cost differently, and an architecture that ignores those differences will eventually let you know, usually on a Sunday.

This article is not about how to train a model. It is about building an AWS architecture that can host AI workloads safely, scale them reliably, and keep the monthly bill within shouting distance of the original estimate.

Why AI workloads change the architecture conversation

Treating an AI workload as “the same thing, but with bigger instances” is a classic and very expensive mistake. Inference latency matters in milliseconds. Accelerator choice (GPU, Trainium, Inferentia) affects both performance and invoice. Traffic spikes are unpredictable because humans, not schedulers, trigger them. Model lifecycle and data lineage become first-class design concerns. Governance stops being a compliance checkbox and becomes the seatbelt that keeps sensitive information from ending up inside a prompt log.

Put differently, AI adds several new axes of failure to the usual cloud architecture, and pretending otherwise is how teams rediscover the limits of their CloudWatch alerting at 3 am.

Start with the use case, not the model

Before anyone opens the Bedrock console, the first design decision should be the business problem. A chatbot for internal knowledge, a document summarization pipeline, a fraud detection scorer, and an image generation service have almost nothing in common architecturally, even if they all happen to involve transformer models.

From the use case, derive the architectural drivers (latency budget, throughput, data sensitivity, availability target, model accuracy requirements, cost ceiling). These drivers decide almost everything else. The opposite workflow, picking the infrastructure first and then seeing what it can do, is how you end up with a beautifully optimized cluster solving a problem nobody asked about.

Choosing your AI path on AWS

AWS offers several paths, and they are not interchangeable. A rough guide.

Amazon Bedrock is the right choice when you want managed foundation models, guardrails, agents, and knowledge bases without running the model infrastructure yourself. Good for teams that want to ship features, not operate GPUs.

Amazon SageMaker AI is the right choice when you need more control over training, deployment, pipelines, and MLOps. Good for teams with ML engineers who enjoy that sort of thing. Yes, they exist.

AWS accelerator-based infrastructure (Trainium, Inferentia2, SageMaker HyperPod) is the right choice when cost efficiency or raw performance at scale becomes the dominant constraint, typically for custom training or large-scale inference.

The common mistake here is picking the most powerful option by default. Bedrock with a sensible model is usually cheaper to operate than a custom SageMaker endpoint you forgot to scale down over Christmas.

The data foundation comes first

AI systems are a thin layer of cleverness on top of data. If the data layer is broken, the AI will be confidently wrong, which is worse than being uselessly wrong because people tend to believe it.

Answer the unglamorous questions first. Where does the data live? Who owns it? How fresh does it need to be? Who can see which parts of it? For generative AI workloads that use retrieval, add more questions. How are documents chunked? What embedding model is used? Which vector store? What metadata accompanies each chunk? How is the index refreshed when the source changes?

A poor data foundation produces a poor AI experience, even when the underlying model is state of the art. Think of the model as a very articulate intern; it will faithfully report whatever you put in front of it, including the typo in the policy document from 2019.

Designing compute for reality, not for demos

Training and inference are not the same workload and should rarely share the same architecture. Training is bursty, expensive, and tolerant of scheduling. Inference is steady, latency-sensitive, and intolerant of downtime. A single “AI cluster” that tries to do both tends to be bad at each.

For inference, focus on right-sizing, dynamic scaling, and high availability across AZs. For training, focus on ephemeral capacity, checkpointing, and data pipeline throughput. For serving large models, consider whether Bedrock’s managed endpoints remove enough operational burden to justify their pricing compared to self-hosted inference on EC2 or EKS with Inferentia2.

And please, autoscale. A fixed-size fleet of GPU instances running at 3% utilization is a monument to optimism.

Treating inference as a production workload

Many AI articles spend chapters on models and a paragraph on serving them, which is roughly the opposite of how the effort is distributed in real projects. Inference is where the workload meets reality, and reality brings concurrency, timeouts, thundering herds, and users who click the retry button like they are trying to start a stubborn lawnmower.

Plan for all of it. Set timeouts. Configure throttling and quotas. Add rate limiting at the edge. Use exponential backoff. Put circuit breakers between your application tier and your AI tier so a slow model does not take the whole product down. AWS explicitly recommends rate limiting and throttling as part of protecting generative AI systems from overload, and they recommend it because they have seen what happens without it.

Protecting inference is not mainly about safety. It is about surviving the traffic spike after your launch gets a mention somewhere popular.

Separating application, AI, and data responsibilities

A quietly important architectural point is that the AI tier should not share an account, an IAM boundary, or a blast radius with the application that calls it. AWS security guidance increasingly points toward separating the application account from the generative AI account. The reasoning is simple: the consequences of a mistake in prompt construction, data retrieval, or model output are different from the consequences of a mistake in, say, a shopping cart service, and they deserve different controls.

Think of it as the organizational version of not keeping your passport in the same drawer as your house keys. If one goes missing, the other is still where it should be.

Security and guardrails from day one

AI-specific controls sit on top of the usual cloud security hygiene (IAM least privilege, encryption at rest and in transit, VPC endpoints, logging, data classification). On top of that, you need approved model catalogues so teams cannot quietly wire up any foundation model they saw on Hacker News, prompt governance with templates and input validation and logging policies that do not accidentally store sensitive data forever, output filtering for harmful content and PII leakage and jailbreak attempts, and clear data classification policies that decide which data is allowed to reach which model.

For Bedrock-based systems, Amazon Bedrock Guardrails offer configurable safeguards for harmful content and sensitive information. They are not magic, but they save a surprising amount of custom work, and custom work in this area tends to age badly.

Governance is not bureaucracy. Governance is what lets your AI feature get through a security review without being rewritten twice.

Protecting the retrieval layer when you use RAG

Retrieval-augmented generation is often described as “LLM plus documents”, which is technically true and practically misleading. A production RAG system involves ingestion pipelines, embedding generation, a vector store, metadata design, and ongoing synchronization with source systems. Each of those is a place where things can quietly go wrong.

One specific point is worth emphasizing. User identity must propagate to the retrieval layer. If Alice asks a question, the knowledge base should only return chunks Alice is allowed to see. AWS guidance recommends enforcing authorization through metadata filtering so users only get results they have access to. Without this, your RAG system will happily summarize the CFO’s compensation memo for the summer intern, which is the sort of thing that gets architectures shut down by email.

Observability goes beyond CPU and memory

Traditional observability (CPU, memory, latency, error rates) is necessary but insufficient for AI workloads. For these systems, you also want to track model quality and drift over time, retrieval quality (are the right chunks being returned?), prompt behavior and common failure modes, token usage per request and per tenant and per feature, latency per model and not just per service, and user feedback signals, with thumbs-up and thumbs-down being the cheapest useful telemetry ever invented.

Amazon Bedrock provides evaluation capabilities, and SageMaker Model Monitor covers drift and model quality in production. Use them. If you run your own inference, budget time for custom metrics, because the default dashboards will tell you the endpoint is healthy right up until users stop trusting its answers.

AI operations is not a different discipline. It is mature operations thinking applied to a stack where “the service works” and “the service is useful” are two different statements.

Cost optimization belongs in the first draft

Cost should be a design constraint, not a debugging session six weeks after launch. The biggest levers, roughly in order of impact.

Model choice. Smaller models are cheaper and often good enough. Not every feature needs the largest frontier model in the catalogue.

Inference mode. Real-time endpoints, batch inference, serverless inference, and on-demand Bedrock invocations have wildly different cost profiles. Match the mode to the traffic pattern, not the other way around.

Autoscaling policy. Scale to zero where possible. Keep the minimum capacity honest.

Hardware choice. Inferentia2 and Trainium are positioned specifically for cost-effective ML deployment, and they often deliver on that positioning.

Batching. Batching inference requests can dramatically improve throughput per dollar for workloads that tolerate small latency increases.

A common failure mode is the impressive prototype with the terrifying monthly bill. Put cost targets in the design document next to the latency targets, and revisit both before go-live.

Close with an operating model, not just a diagram

An architecture diagram is the opening paragraph of the story, not the whole book. What makes an AI platform sustainable is the operating model around it (versioning, CI/CD or MLOps/LLMOps pipelines, evaluation suites, rollback strategy, incident response, and clear ownership between platform, data, security, and application teams).

AWS guidance for enterprise-ready generative AI consistently stresses repeatable patterns and standardized approaches, because that is what turns successful experiments into durable platforms rather than fragile demos held together by one engineer’s tribal knowledge.

What separates a platform from a demo

Preparing a cloud architecture for AI on AWS is not mainly about buying GPU capacity. It is about designing a platform where data, models, security, inference, observability, and cost controls work together from the start. The teams that do well with AI are not the ones with the biggest clusters; they are the ones who took the boring parts seriously before the interesting parts broke.

If your AI architecture is running quietly, scaling predictably, and costing roughly what you expected, congratulations, you have done something genuinely difficult, and nobody will notice. That is always how it goes.

Anatomy of an overworked Kubernetes operator called CoreDNS

Your newly spun-up frontend pod wakes up in the cluster with total amnesia. It has a job to do, but it has no idea where it is, who its neighbors are, or how to find the database it desperately needs to query. IP addresses in a Kubernetes cluster change as casually as socks in a gym locker room. Keeping track of them requires a level of bureaucratic masochism that no sane developer wants to deal with.

Someone has to manage this phonebook. That someone lives in a windowless sub-basement namespace called kube-system. It sits there, answering the same questions thousands of times a second, routing traffic, and receiving absolutely zero credit for any of it until something breaks.

That entity is CoreDNS. And this is an exploration of the thankless, absurd, and occasionally heroic life it leads inside your infrastructure.

The temperamental filing cabinet that came before

Before CoreDNS was given the keys to the kingdom in Kubernetes 1.13, the job belonged to kube-dns. It technically worked, much like a rusty fax machine transmits documents. But nobody was happy to see it. kube-dns was not a single, elegant program. It was a chaotic trench coat containing three different containers stacked on top of each other, all whispering to each other to resolve a single address.

CoreDNS replaced it because it is written in Go as a single, compiled binary. It is lighter, faster, and built around a modular plugin architecture. You can bolt on new behaviors to it, much like adding attachments to a vacuum cleaner. It is efficient, utterly devoid of joy, and built to survive the hostile environment of a modern microservices architecture.

Inside the passport control of resolv.conf

When a pod is born, the container runtime shoves a folded piece of paper into its pocket. This piece of paper is the /etc/resolv.conf file. It is the internal passport and instruction manual for how the pod should talk to the outside world.

If you were to exec into a standard web application pod and look at that slip of paper, you would see something resembling this:

search default.svc.cluster.local svc.cluster.local cluster.local
nameserver 10.96.0.10
options ndots:5

At first glance, it looks harmless. The nameserver line simply tells the pod exactly where the CoreDNS operator is sitting. But the rest of this file is a recipe for a spectacular amount of wasted effort.

Look at the search directive. This is a list of default neighborhoods. In human terms, if you tell a courier to deliver a package to “Smith”, the courier does not just give up. The courier checks “Smith in the default namespace”, then “Smith in the general services area”, then “Smith in the local cluster”. It is a highly structured, incredibly repetitive guessing game. Every time your application tries to look up a short name like “inventory-api”, CoreDNS has to physically flip through these directories until it finds a match.

But the true villain of this document, the source of immense invisible suffering, is located at the very bottom.

The loud waiter and the tragedy of ndots

Let us talk about options ndots:5. This single line of configuration is responsible for more wasted network traffic than a teenager downloading 4K video over cellular data.

The ndots value tells your pod how many dots a domain name must have before it is considered an absolute, fully qualified domain name. If a domain has fewer than five dots, the pod assumes it is a local nickname and starts appending the search domains to it.

Imagine a waiter in a crowded, high-end restaurant. You ask this waiter to bring a message to a guest named “google.com”.

Because “google.com” only has one dot, the waiter refuses to believe this is the person’s full legal name. Instead of looking at the master reservation book, the waiter walks into the center of the dining room and screams at the top of his lungs, “IS THERE A GOOGLE.COM IN THE DEFAULT.SVC.CLUSTER.LOCAL NAMESPACE?”

The room goes dead silent. Nobody answers.

Undeterred, the waiter moves to the next search domain and screams, “IS THERE A GOOGLE.COM IN THE SVC.CLUSTER.LOCAL NAMESPACE?”

Again, nothing. The waiter does this a third time for cluster.local. Finally, sweating and out of breath, having annoyed every single patron in the establishment, the waiter says, “Fine. Let me check the outside world for just plain google.com.”

This happens for every single external API call your application makes. Three useless, doomed-to-fail DNS queries hit CoreDNS before your pod even attempts the correct external address. CoreDNS processes these garbage queries with the dead-eyed stare of a DMV clerk stamping “DENIED” on improperly filled forms. If you ever wonder why your cluster DNS latency is slightly terrible, the screaming waiter is usually to blame.

The Corefile and other documents of self-inflicted pain

The rules governing CoreDNS are dictated by a configuration file known as the Corefile. It is essentially the Human Resources policy manual for the DNS server. It defines which plugins are active, who is allowed to ask questions, and where to forward queries it does not understand.

A simplified corporate policy might look like this:

.:53 {
    errors
    health
    kubernetes cluster.local in-addr.arpa ip6.arpa {
       pods insecure
       fallthrough in-addr.arpa ip6.arpa
    }
    prometheus :9153
    forward . /etc/resolv.conf
    cache 30
    loop
    reload
}

Most of this is standard bureaucratic routing. The Kubernetes plugin tells CoreDNS how to talk to the Kubernetes API to find out where pods actually live. The cache plugin allows CoreDNS to memorize answers for 30 seconds, so it does not have to bother the API constantly.

But the most fascinating part of this document is the loop plugin.

In complex networks, it is very easy to accidentally configure DNS servers to point at each other. Server A asks Server B, and Server B asks Server A. In a normal corporate environment, two middle managers delegating the same task back and forth will do so indefinitely, drawing salaries for years until retirement.

Software does not have a retirement plan. Left unchecked, a DNS forwarding loop will exponentially consume memory and CPU until the entire node catches fire and dies.

The loop plugin exists solely to detect this exact scenario. It sends a uniquely tagged query out into the world. If that same query comes back to it, CoreDNS realizes it is trapped in a futile, infinite cycle of middle-management delegation.

And what does it do? It refuses to participate. It halts. CoreDNS will intentionally shut itself down rather than perpetuate a stupid system. There is a profound life lesson hiding in that logic. It shows a level of self-awareness and boundary-setting that most human workers never achieve.

Headless services or giving out direct phone numbers

Most of the time, when you ask CoreDNS for a service, it gives you the IP address of a load balancer. You call the front desk, and the receptionist routes your call to an available agent. You do not know who you are talking to, and you do not care.

But some applications are needy. Databases in a cluster, like a Cassandra ring or a MongoDB replica set, cannot just talk to a random receptionist. They need to replicate data. They need to know exactly who they are talking to. They need direct home phone numbers.

This is where CoreDNS provides a feature known as a “headless service”.

When you create a service in Kubernetes, it usually looks like a standard networking request. But when you explicitly add one specific line to the YAML definition, you are effectively firing the receptionist:

apiVersion: v1
kind: Service
metadata:
  name: inventory-db
spec:
  clusterIP: None # <-- The receptionist's termination letter
  selector:
    app: database

By setting “clusterIP: None”, you are telling CoreDNS that this department has no front desk.

Now, when a pod asks for “inventory-db”, CoreDNS does not hand out a single routing IP. Instead, it dumps a raw list of the individual IP addresses of every single pod backing that service. Furthermore, it creates a custom, highly specific DNS entry for every individual pod in the StatefulSet.

It assigns them names like “pod-0.inventory-db.production.svc.cluster.local”.

Suddenly, your database node has a personal identity. It can be addressed directly. It is a minor administrative miracle, allowing complex, stateful applications to map out their own internal corporate structure without relying on the cluster’s default routing mechanics.

The unsung hero of the sub-basement

CoreDNS is rarely the subject of glowing keynotes at tech conferences. It does not have the flashy appeal of a service mesh, nor does it generate the architectural excitement of an advanced deployment strategy. It is plumbing. It is bureaucracy. It is the guy in the basement checking names off a clipboard.

But the next time you type a simple, human-readable name into an application configuration file and it flawlessly connects to a database across the cluster, think of CoreDNS. Think of the thousands of fake ndots queries it cheerfully absorbed. Think of its rigid adherence to the Corefile.

And most importantly, respect a piece of software that is smart enough to know when it is stuck in a loop, and brave enough to quit on the spot.

They left AWS to save money. Coming back cost even more

Not long ago, a partner I work with told me about a company that decided it had finally had enough of AWS.

The monthly bill had become the sort of document people opened with the facial expression usually reserved for dental estimates. Consultants were invited in. Spreadsheets were produced. Serious people said serious things about control, efficiency, and the wisdom of getting off the cloud treadmill.

The conclusion sounded almost virtuous. Leave AWS, move the workloads to a colocation facility, buy the hardware, and stop renting what could surely be owned more cheaply.

It was neat. It was rational. It was, for a while, deeply satisfying.

And then reality arrived, carrying invoices.

The company spent a substantial sum getting out of AWS. Servers were bought. Contracts were signed. Staff had to be hired to manage all the things cloud providers manage quietly in the background while everyone else gets on with their jobs. Not long after, the economics began to fray. Reversing course costs even more than leaving in the first place.

That is the part worth paying attention to.

Not because it makes for a dramatic story, though it does. Not because it is especially rare, but because it is not. It matters because it exposes one of the oldest tricks in infrastructure decision-making. Companies compare a visible bill with an invisible burden, decide the bill is the scandal, and only later discover that the burden was doing quite a lot of useful work.

The spreadsheet seduction

On paper, the move away from AWS looked wonderfully sensible.

The cloud bill was obvious, monthly, and impolite enough to keep turning up. On-premises looked calmer. Hardware could be amortized. Rack space, power, and bandwidth could be priced. With a bit of care, the whole thing could be made to resemble prudence.

This is where many repatriation plans become dangerously persuasive. The cloud is cast as an extravagant landlord. On-premises is presented as the mature decision to stop renting and finally buy the house.

Unfortunately, a data center is not a house. It is closer to owning a very large hotel whose plumbing, wiring, keys, security, fire precautions, laundry, and unexpected midnight incidents are all your responsibility, except the guests are servers and none of them leave a tip.

The spreadsheet had done a decent job of pricing the obvious things. Hardware. Colocation space. Power. Connectivity.

What was priced badly were all the dull, expensive capabilities that public cloud tends to bundle into the bill. Managed failover. Backup automation. Key rotation. Elastic capacity. Security controls. Compliance support. Monitoring that does not depend on a specific engineer being awake, available, and emotionally prepared.

What looked like cloud excess turned out to include a great deal of cloud competence.

That distinction matters.

A large cloud bill is easy to resent because it is visible. Operational competence is harder to resent because it tends to be hidden in the walls.

What the cloud had been doing all along

One of the costliest mistakes in infrastructure is confusing convenience with fluff.

A managed database can look expensive right up to the moment you have to build and test failover yourself, define recovery objectives, handle maintenance windows, rotate credentials, validate backups, and explain to auditors why one awkward part of the process still depends on a human remembering to do something after lunch.

A content delivery network may seem like a luxury until you try to reproduce low-latency delivery, edge caching, certificate handling, resilience, and attack mitigation with a mixture of hardware, internal effort, procurement delays, and hope.

The company, in this case, had not really been paying AWS only for compute and storage. It had been paying AWS to absorb a long list of repetitive operational chores, specialized platform decisions, and uncomfortable edge cases.

Once those chores came back in-house, they did not return politely.

Redundancy stopped being a feature and became a budget line, followed by an implementation plan, followed by a maintenance burden. Security controls that had once been inherited now had to be selected, deployed, documented, checked, and defended. Compliance work that had once been partly automated became a steady stream of evidence gathering, procedural discipline, and administrative repetition.

Cloud bills can look high. So can plumbing. You only discover its emotional value when it stops working.

The talent tax

The easiest part of moving on premises is buying equipment.

The harder part is finding enough people who know how to run the surrounding world properly.

Cloud expertise is now common enough that many companies can hire engineers comfortable with infrastructure as code, IAM, managed services, container platforms, observability, autoscaling, and cost controls. Strong cloud engineers are not cheap, but they are at least visible in the market.

Deep on-premises expertise is another matter. People who are strong in storage, backup infrastructure, virtualization, physical networking, hardware lifecycle, and operational recovery still exist, but they are not standing about in large numbers waiting to be discovered. They are experienced, expensive, and often well aware of their market value.

There is also a cultural issue that rarely appears in repatriation slide decks. A great many engineers would rather write Terraform than troubleshoot a hardware issue under unflattering lighting at two in the morning. This is not a moral failure. It is simple market gravity. The industry has spent years abstracting away routine infrastructure pain because abstraction is usually a better use of skilled human attention.

The partner who told me this story was particularly clear on this point. The staffing line looked manageable in planning. In practice, it turned into one of the most stubborn and underestimated parts of the whole effort.

Cloud is not cheap because expertise is cheap. Cloud is often cheaper because rebuilding enough expertise inside one company is very expensive.

Why does utilization lie so beautifully

Projected utilization is one of those numbers that becomes more charming the less time it spends near reality.

Many repatriation models assume that servers will be well used, capacity will be planned sensibly, and waste will be modest. It sounds disciplined. Responsible, even.

Real workloads behave less like equations and more like kitchens during a family gathering. There are quiet periods, sudden rushes, abandoned experiments, quarter-end panics, new projects that arrive with urgency and no warning, and services no one remembers until they break.

Elasticity is not a decorative feature added by cloud providers to justify themselves. It is one of the main ways organizations avoid buying for peak demand and then spending the rest of the year paying for machinery to sit about waiting.

Without elasticity, you provision for the busiest day and fund the silence in between.

Silence, in infrastructure, is expensive.

A half-used on-premises platform still consumes power, occupies space, demands maintenance, requires patching, and waits patiently for a workload spike that visits only now and then. Spare capacity has excellent manners. It makes no fuss. It simply eats money quietly and on schedule.

This was one of the turning points in the story I heard. Forecast utilization turned out to be far more flattering than actual utilization. Once that happened, the economics began to sag under their own good intentions.

The cost of becoming slower

Traditional total-cost comparisons handle direct spending reasonably well. They are much worse at pricing lost momentum.

When a company runs on a large cloud platform, it does not merely rent infrastructure. It also gains access to a constant flow of improvements and options. Better analytics tools. New security integrations. Managed AI services. Identity features. Database capabilities. Deployment patterns. Networking enhancements. Observability tooling.

No single addition changes everything overnight. The effect is cumulative. It is a thousand small conveniences arriving over time and sparing teams from having to rebuild ordinary civilization every quarter.

An on-premises platform can be stable and well run. For the right workloads, that may be perfectly acceptable. But it does not evolve at the pace of a hyperscaler. Upgrades become projects. New capabilities require procurement, testing, staffing, and patience. The platform becomes more careful and, usually, slower.

That slower pace does not always show up neatly in a spreadsheet, but engineers feel it almost immediately.

While competitors are experimenting with new managed services or shipping new capabilities faster, the repatriated organization may be spending its time improving backup procedures, standardizing tools, negotiating maintenance arrangements, or replacing hardware that has chosen an inconvenient moment to become philosophical.

There is nothing glamorous about that. There is also nothing free about it.

Who should actually consider on-premises

None of this means on-premises is foolish.

That would be a lazy conclusion, and lazy conclusions are where expensive architecture plans begin.

For some organizations, on-premises remains entirely reasonable. It makes sense for highly predictable workloads with very little variability. It can make sense in tightly regulated environments where legal, sovereignty, or operational constraints sharply limit the use of public cloud. And at a very large scale, some organizations genuinely can justify building substantial parts of their own platform.

But most companies tempted by repatriation are not in that category.

They are not hyperscalers. They are not all running flat, perfectly predictable workloads. They are not all boxed in by constraints that make public cloud impossible. More often, they are reacting to a painful cloud bill caused by weak cost governance, poor workload fit, loose architecture discipline, or a lack of serious FinOps.

That is a very different problem.

Leaving AWS because you are using AWS badly is a bit like selling your refrigerator because the groceries keep going off while the door is open. The appliance may not be the heart of the matter.

The middle ground companies skip past

One of the stranger features of cloud debates is how quickly they become binary.

Either remain in public cloud forever, or march solemnly back to racks and cages as if returning to a lost ancestral craft.

There is, of course, a middle ground.

Some workloads do benefit from local placement because of latency, residency, plant integration, or operational constraints. But needing hardware closer to the ground does not automatically mean rebuilding the entire service model from scratch. The more useful question is often not whether the hardware should be local, but whether the control plane, automation model, and day-to-day operations should still feel cloud-like.

That is a much more practical conversation.

A company may need some infrastructure nearby while still gaining enormous value from managed identity, familiar APIs, consistent automation, and operational patterns learned in the cloud. This tends to sound less heroic than a full repatriation story, but heroism is not a particularly reliable basis for infrastructure strategy.

The partner who described this case said as much. If they had explored the middle road earlier, they might have kept the local advantages they wanted without assuming quite so much of the surrounding operational burden.

What a real repatriation audit should include

Any company seriously considering a move off AWS should pause long enough to perform an audit that is a little less enchanted by ownership.

Start with the full cloud picture, not just the line items everyone enjoys complaining about. Include engineering effort, compliance automation, security services, platform speed, operational overhead, and the cost of scaling quickly when demand changes.

Then build the on-premises model with uncommon honesty. Price round-the-clock operations. Price redundancy properly. Price backup and recovery as if they matter, because they do. Price refresh cycles, maintenance contracts, spare capacity, patching, testing, physical security, audit evidence, and the awkward certainty that hardware fails when it is least convenient.

Then ask a cultural question, not just a financial one. How many of your engineers actually want to spend more of their time dealing with the physical stack and the operational plumbing that comes with it?

That answer matters more than many executives would like.

A strategy that looks cheaper on paper but nudges your best engineers toward the door is not, in any meaningful sense, cheaper.

Finally, compare repatriation not only against your current cloud bill, but against what a disciplined cloud optimization program could achieve. Rightsizing, storage improvements, better instance strategy, autoscaling discipline, reserved capacity planning, architecture cleanup, and proper FinOps can all change the economics without requiring anyone to rediscover the intimate emotional texture of broken hardware.

The bill behind the bill

What has stayed with me about this story is that it was never really a story about AWS.

It was a story about accounting for the wrong thing.

The visible bill was treated as the entire problem. The hidden work behind the bill was treated as background scenery. Once the company moved off AWS, the scenery walked to the front of the stage and began sending invoices.

That is the trap.

Cloud can absolutely be expensive. Plenty of organizations run it badly and pay for the privilege. But on-premises is not automatically the sober adult in the room. Quite often, it is simply a different payment model, one that hides more of the cost in staffing, slower delivery, operational fragility, maintenance overhead, and all the unlovely little chores that cloud platforms had been taking care of out of sight.

The lesson from this case was not that every workload belongs in AWS forever. It was that infrastructure decisions become dangerous when they are made in reaction to irritation rather than in response to a full economic picture.

Leaving the cloud may still be the right answer for some organizations. For many others, the more useful answer is much less theatrical. Use the cloud better. Govern it better. Design it properly. Understand what you are paying for before deciding you would prefer to rebuild it yourself.

A large monthly cloud bill can be offensive to look at.

The bill that arrives after a bad attempt to escape it is usually less offensive than heartbreaking.

And heartbreak, unlike EC2, rarely comes with autoscaling.

Surviving the Ingress NGINX apocalypse without breaking a sweat

Look at the calendar. It is March 2026. The deadline we have been hearing about for months has officially arrived, and across the globe, engineers are clutching their coffee mugs, staring at their terminals, and waiting for their Kubernetes clusters to spontaneously combust. There is a palpable panic in the air. Tech forums are overflowing with dramatic declarations that the internet is broken, all because a specific piece of software is officially retiring.

Take a deep breath. Your servers are not going to melt. Traffic is not going to suddenly hit a brick wall. But you do need to pack up your things and move, because the building you are living in just fired its maintenance staff.

To understand how we got here and how to get out alive, we need to stop treating this retirement like a digital Greek tragedy and start looking at it like a mundane eviction notice. We are going to peel back the layers of this particular onion, dry our eyes, and figure out how to migrate our traffic routing without breaking a sweat.

The great misunderstanding of what is actually dying

Before we start packing boxes, we need to address the rampant identity confusion that has turned a routine software lifecycle event into a source of mass hysteria. A lot of online discussion has mixed up three entirely different things, treating them like a single, multi-headed beast. Let us separate them.

First, there is NGINX. This is the web server and reverse proxy that has been moving packets around the internet since you were still excited about flip phones. NGINX is fine. Nobody is retiring NGINX. It is healthy, wealthy, and continues to route a massive chunk of the global internet.

Second, there is the Ingress API. This is the Kubernetes object you use to describe your HTTP and HTTPS routing rules. It is just a set of instructions. The Ingress API is not being removed. The Kubernetes maintainers are not going to sneak into your cluster at night and delete your YAML files.

Finally, there is the Ingress NGINX controller. This is the community-maintained piece of software that reads your Ingress API instructions and configures NGINX to actually execute them. This specific controller, maintained by a group of incredibly exhausted volunteers, is the thing that is retiring. As of right now, March 2026, it is no longer receiving updates, bug fixes, or security patches.

That distinction avoids most of the confusion. The bouncer at the door of your nightclub is retiring, but the nightclub itself is still open, and the rules of who gets in remain the same. You just need to hire a new bouncer.

Why the bouncer finally walked off the job

To understand why the community Ingress NGINX controller is packing its bags, you have to look at what we forced it to do. For years, this controller has been the stoic bouncer at the entrance of your Kubernetes cluster. It stood in the rain, checked the TLS certificates, and decided which request got into the VIP pod and which one got thrown out into the alley.

But the Ingress API itself was fundamentally limited. It only understood the basics. It knew about hostnames and paths, but it had no idea how to handle anything complex, like weighted canary deployments, custom header manipulation, or rate limiting.

Because we developers are needy creatures who demand complex routing, we found a workaround. We started using annotations. We slapped sticky notes all over the bouncer’s forehead. We wrote cryptic instructions on these notes, telling the controller to inject custom configuration snippets directly into the underlying NGINX engine.

Eventually, the bouncer was walking around completely blinded by thousands of contradictory sticky notes. Maintaining this chaotic system became a nightmare for the open-source volunteers. They were basically performing amateur dental surgery in the dark, trying to patch security holes in a system entirely built out of user-injected string workarounds. The technical debt became a mountain, and the maintainers rightly decided they had had enough.

The terrifying reality of unpatched edge components

If the controller is not going to suddenly stop working today, you might be tempted to just leave it running. This is a terrible idea.

Leaving an obsolete, unmaintained Ingress controller facing the public internet is exactly like leaving the front door of your house under the strict protection of a scarecrow. The crows might stay away for the first week. But eventually, the local burglars will realize your security system is made of straw and old clothes.

Edge proxies are the absolute favorite targets for attackers. They sit right on the boundary between the wild, unfiltered internet and your soft, vulnerable application data. When a new vulnerability is discovered next month, there will be no patch for your retired Ingress NGINX controller. Attackers will scan the internet for that specific outdated signature, and they will walk right past your scarecrow. Do not be the person explaining to your boss that the company data was stolen because you did not want to write a few new YAML files.

Meet the new security firm known as Gateway API

If Ingress was a single bouncer overwhelmed by sticky notes, the new standard, known as Gateway API, is a professional security firm with distinct departments.

The core problem with Ingress was that it forced the infrastructure team and the application developers to fight over the same file. The platform engineer wanted to manage the TLS certificates, while the developer just wanted to route traffic to their new shopping cart service.

Gateway API fixes this by splitting the responsibilities into different objects. You have a GatewayClass (the type of security firm), a Gateway (the physical building entrance managed by the platform team), and an HTTPRoute (the specific room VIP lists managed by the developers). It is structured, it is typed, and most importantly, it drastically reduces the need for those horrible annotation sticky notes.

You do not have to migrate to the Gateway API. You can simply switch to a different, commercially supported Ingress controller that still reads your old files. But if you are going to rip off the bandage and change your routing infrastructure, you might as well upgrade to the modern standard.

A before-and-after infomercial for your YAML files

Let us look at a practical example. Has this ever happened to you? Are your YAML files bloated, confusing, and causing you physical pain to read? Look at this disastrous piece of legacy Ingress configuration.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: desperate-cries-for-help
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /$2
    nginx.ingress.kubernetes.io/use-regex: "true"
    nginx.ingress.kubernetes.io/server-snippet: |
      location ~* ^/really-bad-regex/ {
        return 418;
      }
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "15"
spec:
  ingressClassName: nginx
  rules:
  - host: chaotic-store.example.local
    http:
      paths:
      - path: /catalog(/|$)(.*)
        pathType: Prefix
        backend:
          service:
            name: catalog-service-v2
            port:
              number: 8080

This is not a configuration. This is a hostage note. You are begging the controller to understand regex rewrites and canary deployments by passing simple strings through annotations.

Now, wipe away those tears and look at the clean, structured beauty of an HTTPRoute in the Gateway API world.

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: calm-and-collected-routing
spec:
  parentRefs:
  - name: main-company-gateway
  hostnames:
  - "smooth-store.example.local"
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /catalog
    filters:
    - type: URLRewrite
      urlRewrite:
        path:
          type: ReplacePrefixMatch
          replacePrefixMatch: /
    backendRefs:
    - name: catalog-service-v1
      port: 8080
      weight: 85
    - name: catalog-service-v2
      port: 8080
      weight: 15

Look at that. No sticky notes. No injected server snippets. The routing weights and the URL rewrites are native, structured fields. Your linter can actually read this and tell you if you made a typo before you deploy it and take down the entire production environment.

A twelve-step rehabilitation program for your cluster

You cannot just delete the old controller on a Friday afternoon and hope for the best. You need a controlled rehabilitation program for your cluster. Treat this as a serious infrastructure project.

Phase 1: The honest inventory

You need to look at yourself in the mirror and figure out exactly what you have deployed. Find every single Ingress object in your cluster. Document every bizarre annotation your developers have added over the years. You will likely find routing rules for services that were decommissioned three years ago.

Phase 2: Choosing your new guardian

Evaluate the replacements. If you want to stick with NGINX, look at the official F5 NGINX Ingress Controller. If you want something modern, look at Envoy-based solutions like Gateway API implementations from Cilium, Istio, or Contour. Deploy your choice into a sandbox environment.

Phase 3: The great translation

Start converting those sticky notes. Take your legacy Ingress objects and translate them into Gateway API routes, or at least clean them up for your new controller. This is the hardest part. You will have to decipher what nginx.ingress.kubernetes.io/configuration-snippet actually does in your specific context.

Phase 4: The side-by-side test

Run the new controller alongside the retiring community one. Use a test domain. Throw traffic at it. Watch the metrics. Ensure that your monitoring dashboards and alerting rules still work, because the new controller will expose entirely different metric formats.

Phase 5: The DNS switch

Slowly move your DNS records from the old load balancer to the new one. Do this during business hours when everyone is awake and heavily caffeinated, not at 2 AM on a Sunday.

The final word on not panicking

If you need a message to send to your management team today, keep it simple. Tell them the community ingress-nginx controller is now officially unmaintained. Assure them the website is not down, but inform them that staying on this software is a ticking security time bomb. You need time and resources to move to a modern implementation.

The real lesson here is not that Kubernetes is unstable. It is that the software world relies heavily on the unpaid labor of open-source maintainers. When a critical project no longer has enough volunteers to hold back the tide of technical debt, responsible engineering teams do not sit around complaining on internet forums. They say thank you for the years of free service, they roll up their sleeves, and they migrate before the lack of maintenance becomes an incident report.

AWSMap for smarter AWS migrations

Most AWS migrations begin with a noble ambition and a faintly ridiculous problem.

The ambition is to modernise an estate, reduce risk, tidy the architecture, and perhaps, if fortune smiles, stop paying for three things nobody remembers creating.

The ridiculous problem is that before you can migrate anything, you must first work out what is actually there.

That sounds straightforward until you inherit an AWS account with the accumulated habits of several teams, three naming conventions, resources scattered across regions, and the sort of IAM sprawl that suggests people were granting permissions with the calm restraint of a man feeding pigeons. At that point, architecture gives way to archaeology.

I do not work for AWS, and this is not a sponsored love letter to a shiny console feature. I am an AWS and GCP architect working in the industry, and I have used AWSMap when assessing environments ahead of migration work. The reason I am writing about it is simple enough. It is one of those practical tools that solves a very real problem, and somehow remains less widely known than it deserves.

AWSMap is a third-party command-line utility that inventories AWS resources across regions and services, then lets you explore the result through HTML reports, SQL queries, and plain-English questions. In other words, it turns the early phase of a migration from endless clicking into something closer to a repeatable assessment process.

That does not make it perfect, and it certainly does not replace native AWS services. But in the awkward first hours of understanding an inherited environment, it can be remarkably useful.

The migration problem before the migration

A cloud migration plan usually looks sensible on paper. There will be discovery, analysis, target architecture, dependency mapping, sequencing, testing, cutover, and the usual brave optimism seen in project plans everywhere.

In reality, the first task is often much humbler. You are trying to answer questions that should be easy and rarely are.

What is running in this account?

Which regions are actually in use?

Are there old snapshots, orphaned EIPs, forgotten load balancers, or buckets with names that sound important enough to frighten everyone into leaving them alone?

Which workloads are genuinely active, and which are just historical luggage with a monthly invoice attached?

You can answer those questions from the AWS Management Console, of course. Given enough tabs, enough patience, and a willingness to spend part of your afternoon wandering through services you had not planned to visit, you will eventually get there. But that is not a particularly elegant way to begin a migration.

This is where AWSMap becomes handy. Instead of treating discovery as a long guided tour of the console, it treats it as a data collection exercise.

What AWSMap does well

At its core, AWSMap scans an AWS environment and produces an inventory of resources. The current public package description on PyPI describes it as covering more than 150 AWS services, while version 1.5.0 covers 140 plus services, which is a good reminder that the coverage evolves. The important point is not the exact number on a given Tuesday morning, but that it covers a broad enough slice of the estate to be genuinely useful in early assessments.

What makes the tool more interesting is what it does after the scan.

It can generate a standalone HTML report, store results locally in SQLite, let you query the inventory with SQL, run named audit queries, and translate plain-English prompts into database queries without sending your infrastructure metadata off to an LLM service. The release notes for v1.5.0 describe local SQLite storage, raw SQL querying, named queries, typo-tolerant natural language questions, tag filtering, scoped account views, and browsable examples.

That combination matters because migrations are rarely single, clean events. They are usually a series of discoveries, corrections, and mildly awkward conversations. Having the inventory preserved locally means the account does not need to be rediscovered from scratch every time someone asks a new question two days later.

The report you can actually hand to people

One of the surprisingly practical parts of AWSMap is the report output.

The tool can generate a self-contained HTML report that opens locally in a browser. That sounds almost suspiciously modest, but it is useful precisely because it is modest. You can attach it to a ticket, share it with a teammate, or open it during a workshop without building a whole reporting pipeline first. The v1.5.0 release notes describe the report as a single, standalone HTML file with filtering, search, charts, and export options.

That makes it suitable for the sort of migration meeting where someone says, “Can we quickly check whether eu-west-1 is really the only active region?” and you would rather not spend the next ten minutes performing a slow ritual through five console pages.

A simple scan might look like this:

awsmap -p client-prod

If you want to narrow the blast radius a bit and focus on a few services that often matter early in migration discovery, you could do this:

awsmap -p client-prod -s ec2,rds,elb,lambda,iam

And if the account is a thicket of shared infrastructure, tags can help reduce the noise:

awsmap -p client-prod -t Environment=Production -t Owner=platform-team

That kind of filtering is helpful when the account contains equal parts business workload and historical clutter, which is to say, most real accounts.

Why SQLite is more important than it sounds

The feature I like most is not the report. It is the local SQLite database.

Every scan can be stored locally, so the inventory becomes queryable over time instead of vanishing the moment the terminal output scrolls away. The default local database path is ‘~/.awsmap/inventory.db’, and the scan results from different runs can accumulate there for later analysis.

This changes the character of the tool quite a bit. It stops being a disposable scanner and becomes something closer to a field notebook.

Suppose you scan a client account today, then return to the same work three days later, after someone mentions an old DR region nobody had documented. Without persistence, you start from scratch. With persistence, you ask the database.

That is a much more civilised way to work.

A query for the busiest services in the collected inventory might look like this:

awsmap query "SELECT service, COUNT(*) AS total
FROM resources
GROUP BY service
ORDER BY total DESC
LIMIT 12"

And a more migration-focused query might be something like:

awsmap query "SELECT account, region, service, name
FROM resources
WHERE service IN ('ec2', 'rds', 'elb', 'lambda')
ORDER BY account, region, service, name"

Neither query is glamorous, but migrations are not built on glamour. They are built on being able to answer dull, important questions reliably.

Security and hygiene checks without the acrobatics

AWSMap also includes named queries for common audit scenarios, which is useful for two reasons.

First, most people do not wake up eager to write SQL joins against IAM relationships. Second, migration assessments almost always drift into security checks sooner or later.

The public release notes describe named queries for scenarios such as admin users, public S3 buckets, unencrypted EBS volumes, unused Elastic IPs, and secrets without rotation.

That means you can move from “What exists?” to “What looks questionable?” without much ceremony.

For example:

awsmap query -n admin-users
awsmap query -n public-s3-buckets
awsmap query -n ebs-unencrypted
awsmap query -n unused-eips

Those are not, strictly speaking, migration-only questions. But they are precisely the kind of questions that surface during migration planning, especially when the destination design is meant to improve governance rather than merely relocate the furniture.

Asking questions in plain English

One of the nicer additions in the newer version is the ability to ask plain-English questions.

That is the sort of feature that normally causes one to brace for disappointment. But here the approach is intentionally local and deterministic. This functionality is a built-in parser rather than an LLM-based service, which means no API keys, no network calls to an external model, and no need to ship resource metadata somewhere mysterious.

That matters in enterprise environments where the phrase “just send the metadata to a third-party AI service” tends to receive the warm reception usually reserved for wasps.

Some examples:

awsmap ask show me lambda functions by region
awsmap ask list databases older than 180 days
awsmap ask find ec2 instances without Owner tag

Even when the exact wording varies, the basic idea is appealing. Team members who do not want to write SQL can still interrogate the inventory. That lowers the barrier for using the tool during workshops, handovers, and review sessions.

Where AWSMap fits next to AWS native services

This is the part worth stating clearly.

AWSMap is useful, but it is not a replacement for AWS Resource Explorer, AWS Config, or every other native mechanism you might use for discovery, governance, and inventory.

AWS Resource Explorer can search across supported resource types and, since 2024, can also discover all tagged AWS resources using the ‘tag:all’ operator. AWS documentation also notes an important limitation for IAM tags in Resource Explorer search.

AWS Config, meanwhile, continues to expand the resource types it can record, assess, and aggregate. AWS has announced multiple additions in 2025 and 2026 alone, which underlines that the native inventory and compliance story is still moving quickly.

So why use AWSMap at all?

Because its strengths are slightly different.

It is local.

It is quick to run.

It gives you a portable HTML report.

It stores results in SQLite for later interrogation.

It lets you query the inventory directly without setting up a broader governance platform first.

That makes it particularly handy in the early assessment phase, in consultancy-style discovery work, or in those awkward inherited environments where you need a fast baseline before deciding what the more permanent controls should be.

The weak points worth admitting

No serious article about a tool should pretend the tool has descended from heaven in perfect condition, so here are the caveats.

First, coverage breadth is not the same thing as universal depth. A tool can support a large number of services and still provide uneven detail between them. That is true of almost every inventory tool ever made.

Second, the quality of the result still depends on the credentials and permissions you use. If your access is partial, your inventory will be partial, and no amount of cheerful HTML will alter that fact.

Third, local storage is convenient, but it also means you should be disciplined about how scan outputs are handled on your machine, especially if you are working with client environments. Convenience and hygiene should remain on speaking terms.

Fourth, for organisation-wide governance, compliance history, managed rules, and native integrations, AWS services such as Config still have an obvious place. AWSMap is best seen as a sharp assessment tool, not a universal control plane.

That is not a criticism so much as a matter of proper expectations.

A practical workflow for migration discovery

If I were using AWSMap at the start of a migration assessment, the workflow would be something like this.

First, run a broad scan of the account or profile you care about.

awsmap -p client-prod

Then, if the account is noisy, refine the scope.

awsmap -p client-prod -s ec2,rds,elb,iam,route53
awsmap -p client-prod --exclude-defaults

Next, use a few named queries to surface obvious issues.

awsmap query -n public-s3-buckets
awsmap query -n secrets-no-rotation
awsmap query -n admin-users

After that, ask targeted questions in either SQL or plain English.

awsmap ask list load balancers by region
awsmap ask show databases with no backup tag
awsmap query "SELECT region, COUNT(*) AS total
FROM resources
WHERE service='ec2'
GROUP BY region
ORDER BY total DESC"

And finally, keep the HTML report and local inventory as a baseline for later design discussions.

That is where the tool earns its keep. It gives you a reasonably fast, reasonably structured picture of an estate before the migration plan turns into a debate based on memory, folklore, and screenshots in old slide decks.

When the guessing stops

There is a particular kind of misery in cloud work that comes from being asked to improve an environment before anyone has properly described it.

Tools do not eliminate that misery, but some of them reduce it to a more manageable size.

AWSMap is one of those.

It is not the only way to inventory AWS resources. It is not a substitute for native governance services. It is not magic. But it is practical, fast to understand, and surprisingly helpful when the first job in a migration is simply to stop guessing.

That alone makes it worth knowing about.

And in cloud migrations, a tool that helps replace guessing with evidence is already doing better than half the room.

How a Kubernetes Pod comes to life

Run ‘kubectl apply -f pod.yaml’ and Kubernetes has the good manners to make it look simple. You hand over a neat little YAML file, press Enter, and for a brief moment, it feels as if you have politely asked the cluster to start a container.

That is not what happened.

What you actually did was file a request with a distributed bureaucracy. Several components now need to validate your paperwork, record your wishes for posterity, decide where your Pod should live, prepare networking and storage, ask a container runtime to do the heavy lifting, and keep watching the whole arrangement in case it misbehaves. Kubernetes is extremely good at hiding all this. It has the same talent as a hotel lobby. Everything looks calm and polished, while somewhere behind the walls, people are hauling luggage, changing sheets, arguing about room allocation, and trying not to let anything catch fire.

This article follows that process from the moment you submit a manifest to the moment the Pod disappears again. To keep the story tidy, I will use a standalone Pod. In real production environments, Pods are usually created by higher-level controllers such as Deployments, Jobs, or StatefulSets. The Pod is still the thing that ultimately gets scheduled and runs, so it remains the most useful unit to study when you want to understand what Kubernetes is really doing.

The YAML lands on the front desk

Let us start with a very small Pod manifest:

apiVersion: v1
kind: Pod
metadata:
  name: demo-pod
  labels:
    app: demo
spec:
  containers:
    - name: web
      image: nginx:1.27
      ports:
        - containerPort: 80
      resources:
        requests:
          cpu: "100m"
          memory: "128Mi"
        limits:
          cpu: "250m"
          memory: "256Mi"

When you apply this file, the request goes to the Kubernetes API server. That is the front door of the cluster. Nothing important happens without passing through it first.

The API server does more than nod politely and stamp the form. It checks authentication and authorization, validates the object schema, and sends the request through admission control. Admission controllers can modify or reject the request based on policies, quotas, defaults, or security rules. Only when that process is complete does the API server persist the desired state in etcd, the key-value store Kubernetes uses as its source of truth.

At that point, the Pod officially exists as an object in the cluster.

That does not mean it is running.

It means Kubernetes has written down your intentions in a very serious ledger and is now obliged to make reality catch up.

The scheduler looks for a home

Once the Pod exists but has no node assigned, the scheduler takes interest. Its job is not to run the Pod. Its job is to decide where the Pod should run.

This is less mystical than it sounds and more like trying to seat one extra party in a crowded restaurant without blocking the fire exit.

The scheduler first filters out nodes that cannot host the Pod. A node may be ruled out because it lacks CPU or memory, does not match nodeSelector labels, has taints the Pod does not tolerate, violates affinity or anti-affinity rules, or fails other placement constraints.

From the nodes that survive this round of rejection, the scheduler scores the viable candidates and picks one. Different scoring plugins influence the choice, including resource balance and topology preferences. Kubernetes is not asking, “Which node feels lucky today?” It is performing a structured selection process, even if the result arrives so quickly that it looks like instinct.

When the decision is made, the scheduler updates the Pod object with the chosen node.

That is all.

It does not pull images, start containers, mount storage, or wave a wand. It points at a node and says, in effect, “This one. Good luck to everyone involved.

The kubelet picks up the job

Each node runs an agent called the kubelet. The kubelet watches the API server and notices when a Pod has been assigned to its node.

This is where the abstract promise turns into physical work.

The kubelet reads the Pod specification and starts coordinating with the local container runtime, such as ‘containerd’, to make the Pod real. If there are volumes to mount, secrets to project, environment variables to inject, or images to fetch, the kubelet is the one making sure those steps happen in the correct order.

The kubelet is not glamorous. It is the floor manager. It does not write the policies, it does not choose the table, and it does not get invited to keynote conferences. It simply has to make the plan work on an actual machine with actual limits. That makes it one of the most important components in the whole affair.

The sandbox appears before the containers do

Before your application container starts, Kubernetes prepares a Pod sandbox.

This is one of those wonderfully unglamorous details that turns out to matter a great deal. A Pod is not just “a container.” It is a small execution environment that may contain one or more containers sharing networking and, often, storage.

To build that environment, several things need to happen.

First, the container runtime may need to pull the image from a registry if it is not already cached on the node. This step alone can keep a Pod waiting for longer than people expect, especially when the image is huge, the registry is slow, or somebody has built an image as if hard disk space were a personal insult.

Second, networking must be prepared. Kubernetes relies on a CNI plugin to create the Pod’s network namespace and assign an IP address. All containers in the same Pod share that network namespace, which is why they can communicate over ‘localhost’. This is convenient and occasionally dangerous, much like sharing a flat with someone who assumes every shelf in the fridge belongs to them.

Third, volumes are mounted. If the Pod references ‘emptyDir’, ‘configMap’, ‘secret’, or persistent volumes, those mounts have to be prepared before the containers can use them.

There is also a small infrastructure container, commonly called the ‘pause’ container, whose job is to hold the Pod’s shared namespaces in place. It is not famous, but it is essential. The ‘pause’ container is a bit like the quiet relative at a family gathering who does no storytelling, makes no dramatic entrance, and is nevertheless the reason the chairs are still standing.

Only after this setup is complete can the application containers begin.

Watching the lifecycle from the outside

You can observe part of this process with a few simple commands:

kubectl apply -f pod.yaml
kubectl get pod demo-pod -w
kubectl describe pod demo-pod

The watch output often gives the first visible clue that the cluster is busy doing considerably more than the neatness of YAML would suggest.

A Pod typically moves through a small set of phases:

  • Pending’ means the Pod has been accepted but is still waiting for scheduling, image pulls, volume setup, or other preparation.
  • Running’ means the Pod has been bound to a node and at least one container is running or starting.
  • Succeeded’ means all containers completed successfully and will not be restarted.
  • Failed’ means all containers finished, but at least one exited with an error.
  • Unknown’ means the control plane cannot reliably determine the Pod state, usually because communication with the node has gone sideways.

These phases are useful, but they do not tell the whole story. One of the more common sources of confusion is ‘CrashLoopBackOff’. That is not a Pod phase. It is a container state pattern shown in ‘kubectl get pods’ output when a container keeps crashing, and Kubernetes backs off before trying again.

This matters because people often stare at ‘Running’ and assume everything is fine. Kubernetes, meanwhile, is quietly muttering, “Technically yes, but only in the way a car is technically functional while smoke comes out of the bonnet.”

Running is not the same as ready

Another detail worth understanding is that a Pod can be running without being ready to receive traffic.

This distinction matters in real systems because applications often need a few moments to warm up, load configuration, establish database connections, or otherwise stop acting like startled wildlife.

A readiness probe tells Kubernetes when the container is actually prepared to serve requests. Until that probe succeeds, the Pod should not be considered a healthy backend for a Service.

Here is a minimal example:

readinessProbe:
  httpGet:
    path: /
    port: 80
  initialDelaySeconds: 5
  periodSeconds: 10

With this in place, the container may be running, but Kubernetes will wait before routing traffic to it. This is one of those details that prevents very expensive forms of optimism.

Deletion is a polite process until it is not

Now, let us look at the other end of the Pod’s life.

When you run the following command, the Pod does not vanish in a puff of administrative smoke:

kubectl delete pod demo-pod

Instead, the API server marks the Pod for deletion and sets a grace period. The Pod enters a terminating state. The kubelet on the node sees that instruction and begins shutdown.

The normal sequence looks like this:

  1. Kubernetes may first stop sending new traffic to the Pod if it is behind a Service and no longer considered ready.
  2. A ‘preStop’ hook runs if one has been defined.
  3. The kubelet asks the runtime to send ‘SIGTERM’ to the container’s main process.
  4. Kubernetes waits for the grace period, which is ‘30’ seconds by default and controlled by ‘terminationGracePeriodSeconds’.
  5. If the process still refuses to exit, Kubernetes sends ‘SIGKILL’ and ends the discussion.

That grace period exists for good reasons. Applications may need time to flush logs, finish requests, close connections, write buffers, or otherwise clean up after themselves. Production systems tend to appreciate this courtesy.

Here is a small example of a graceful shutdown configuration:

terminationGracePeriodSeconds: 30
containers:
  - name: web
    image: nginx:1.27
    lifecycle:
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 5"]

Once the containers stop, Kubernetes cleans up the sandbox, releases network resources, unmounts volumes as needed, and frees the node’s CPU and memory.

If the Pod was managed by a Deployment, a replacement Pod will usually be created to maintain the desired replica count. This is an important point. In Kubernetes, individual Pods are disposable. The desired state is what matters. Pods come and go. The controller remains stubborn.

Why this matters in the real world

Understanding this lifecycle is not trivia for people who enjoy suffering through conference diagrams. It is practical.

If a Pod is stuck in ‘Pending’, you need to know whether the issue is scheduling, image pulling, volume attachment, or policy rejection.

If a container is ‘CrashLoopBackOff’, you need to know that the Pod object exists, has probably been scheduled, and that the failure is happening later in the chain.

If traffic is not reaching the application, you need to remember that ‘Running’ and ‘Ready’ are not the same thing.

If shutdowns are ugly, logs are truncated, or users get errors during rollout, you need to inspect readiness probes, ‘preStop’ hooks, and grace periods rather than blaming Kubernetes in the abstract, which it will survive, but your incident report may not.

This is also where commands like these become genuinely useful:

kubectl get pod demo-pod -o wide
kubectl describe pod demo-pod
kubectl logs demo-pod
kubectl get events --sort-by=.metadata.creationTimestamp

Those commands let you inspect node placement, container events, log output, and recent cluster activity. Most Kubernetes troubleshooting starts by figuring out which stage of the Pod lifecycle has gone wrong, then narrowing the problem from there.

The quiet machinery behind a simple command

The next time you type ‘kubectl apply -f pod.yaml’, it is worth remembering that you are not merely starting a container. You are triggering a chain of decisions and side effects across the control plane and a worker node.

The API server validates and records the request. The scheduler finds a suitable home. The kubelet coordinates the local work. The runtime pulls images and starts containers. The CNI plugin wires up networking. Volumes are mounted. Probes decide whether the Pod is truly ready. And when the time comes, Kubernetes tears the whole thing down with the brisk professionalism of hotel staff clearing a room before the next guest arrives.

Which is impressive, really.

Particularly when you consider that from your side of the terminal, it still looks as though you only asked for one modest little Pod.

Why generic auto scaling is terrible for healthcare pipelines

Let us talk about healthcare data pipelines. Running high volume payer processing pipelines is a lot like hosting a mandatory potluck dinner for a group of deeply eccentric people with severe and conflicting dietary restrictions. Each payer behaves with maddening uniqueness. One payer bursts through the door, demanding an entire roasted pig, which they intend to consume in three minutes flat. This requires massive, short-lived computational horsepower. Another payer arrives with a single boiled pea and proceeds to chew it methodically for the next five hours, requiring a small but agonizingly persistent trickle of processing power.

On top of this culinary nightmare, there are strict rules of etiquette. You absolutely must digest the member data before you even look at the claims data. Eligibility files must be validated before anyone is allowed to touch the dessert tray of downstream jobs. The workload is not just heavy. It is incredibly uneven and delightfully complicated.

Buying folding chairs for a banquet

On paper, Amazon Web Services managed Auto Scaling Mechanisms should fix this problem. They are designed to look at a growing pile of work and automatically hire more help. But applying generic auto scaling to healthcare pipelines is like a restaurant manager seeing a line out the door and solving the problem by buying fifty identical plastic folding chairs.

The manager does not care that one guest needs a high chair and another requires a reinforced steel bench. Auto scaling reacts to the generic brute force of the system load. It cannot look at a specific payer and tailor the compute shape to fit their weird eating habits. It cannot enforce the strict social hierarchy of job priorities. It scales the infrastructure, but it completely fails to scale the intention.

This is why we abandoned the generic approach and built our own dynamic EC2 provisioning system. Instead of maintaining a herd of generic servers waiting around for something to do, we create bespoke servers on demand based on a central configuration table.

The ruthless nightclub bouncer of job scheduling

Let us look at how this actually works regarding prioritization. Our system relies on that central configuration table to dictate order. Think of this table as the guest list at an obnoxiously exclusive nightclub. Our scheduler acts as the ruthless bouncer.

When jobs arrive at the queue, the bouncer checks the list. Member data? Right this way to the VIP lounge, sir. Claims data? Stand on the curb behind the velvet rope until the members are comfortably seated. Generic auto scaling has no native concept of this social hierarchy. It just sees a mob outside the club and opens the front doors wide. Our dynamic approach gives us perfect, tyrannical control over who gets processed first, ensuring our pipelines execute in a beautifully deterministic way. We spin up exactly the compute we specify, exactly when we want it.

Leaving your car running in the garage

Then there is the financial absurdity of warm pools. Standard auto scaling often relies on keeping a baseline of idle instances warm and ready, just in case a payer decides to drop a massive batch of files at two in the morning.

Keeping idle servers running is the technological equivalent of leaving your car engine idling in the closed garage all night just in case you get a sudden craving for a carton of milk at dawn. It is expensive, it is wasteful, and it makes you look a bit foolish when the AWS bill arrives.

Our dynamic system operates with a baseline of zero. We experience one hundred percent burst efficiency because we only pay for the exact compute we use, precisely when we use it. Cost savings happen naturally when you refuse to pay for things that are sitting around doing nothing.

A delightfully brutal server lifecycle

The operational model we ended up with is almost comically simple compared to traditional methods. A generic scaling group requires complex scaling policies, tricky cooldown periods, and endless tweaking of CloudWatch alarms. It is like managing a highly sensitive, moody teenager.

Our dynamic EC2 model is wonderfully ruthless. We create the instance and inject it with a single, highly specific purpose via a startup script. The instance wakes up, processes the healthcare data with absolute precision, and then politely self destructs so it stops billing us. They are the mayflies of the cloud computing world. They live just long enough to do their job, and then they vanish. There are no orphaned instances wandering the cloud.

This dynamic provisioning model has fundamentally altered how we digest payer workloads. We have somehow achieved a weird but perfect holy grail of cloud architecture. We get the granular flexibility of serverless functions, the raw, unadulterated horsepower of dedicated EC2 instances, and the stingy cost efficiency of a pure event-driven design.

If your processing jobs vary wildly from payer to payer, and if you care deeply about enforcing priorities without burning money on idle metal, building a disposable compute army might be exactly what your architecture is missing. We said goodbye to our idle servers, and honestly, we do not miss them at all.

The lazy cloud architect guide to AWS automation

The shortcuts I use on every project now, after learning that scale mostly changes the bill, not the mistakes.

Let me tell you how this started. I used to measure my productivity by how many AWS services I could haphazardly stitch together in a single afternoon. Big mistake.

One night, I was deploying what should have been a boring, routine feature. Nothing fancy. Just basic plumbing. Six hours later, I was still babysitting the deployment, clicking through the AWS console like a caffeinated lab rat, re-running scripts, and manually patching up tiny human errors.

That is when the epiphany hit me like a rogue server rack. I was not slow because AWS is a labyrinth of complexity. I was slow because I was doing things manually that AWS already knows how to do in its sleep.

The patterns below did not come from sanitized tutorials. They were forged in the fires of shipping systems under immense pressure and desperately wanting my weekends back.

Event-driven everything and absolutely no polling

If you are polling, you are essentially paying Jeff Bezos for the privilege of wasting your own time. Polling is the digital equivalent of sitting in the backseat of a car and constantly asking, “Are we there yet?” every five seconds.

AWS is an event machine. Treat it like one. Instead of writing cron jobs that anxiously ask the database if something changed, just let AWS tap you on the shoulder when it actually happens.

Where this shines:

  • File uploads
  • Database updates
  • Infrastructure state changes
  • Cross-account automation

Example of reacting to an S3 upload instantly:

def lambda_handler(event, context):
    for record in event['Records']:
        bucket_name = record['s3']['bucket']['name']
        object_key = record['s3']['object']['key']

        # Stop asking if the file is there. AWS just handed it to you.
        trigger_completely_automated_workflow(bucket_name, object_key)

No loops. No waiting. Just action.

Pro tip: Event-driven systems fail less frequently simply because they do less work. They are the lazy geniuses of the cloud world.

Immutable deployments or nothing

SSH is not a deployment strategy. It is a desperate cry for help.

If your deployment plan involves SSH, SCP, or uttering the cursed phrase “just this one quick change in production”, you do not have a system. You have a fragile ecosystem built on hope and duct tape. I stopped “fixing” servers years ago. Now, I just murder them and replace them with fresh clones.

The pattern is brutally simple:

  1. Build once
  2. Deploy new
  3. Destroy old

Example of launching a new EC2 version programmatically:

import boto3

ec2_client = boto3.client('ec2', region_name='eu-west-1')
response = ec2_client.run_instances(
    ImageId='ami-0123456789abcdef0', # Totally fake AMI
    InstanceType='t3a.nano',
    MinCount=1,
    MaxCount=1,
    TagSpecifications=[{
        'ResourceType': 'instance',
        'Tags': [{'Key': 'Purpose', 'Value': 'EphemeralClone'}]
    }]
)

It is like doing open-heart surgery. Instead of trying to fix the heart while the patient is running a marathon, just build a new patient with a healthy heart and disintegrate the old one. When something breaks, I do not debug the server. I debug the build process. That is where the real parasites live.

Infrastructure as code for the forgettable things

Most teams only use IaC for the big, glamorous stuff. VPCs. Kubernetes clusters. Massive databases.

This is completely backwards. It is like wearing a bespoke tuxedo but forgetting your underwear. The small, forgettable resources are the ones that will inevitably bite you when you least expect it.

What I automate with religious fervor:

  • IAM roles
  • Alarms
  • Schedules
  • Policies
  • Log retention

Example of creating a CloudWatch alarm in code:

cloudwatch.put_metric_alarm(
    AlarmName="QueueIsExploding",
    MetricName="ApproximateNumberOfMessagesVisible",
    Namespace="AWS/SQS",
    Threshold=10000,
    ComparisonOperator="GreaterThanThreshold",
    EvaluationPeriods=1,
    Period=300,
    Statistic="Sum"
)

If it matters in production, it lives in code. No exceptions.

Let Step Functions own the flow

Early in my career, I crammed all my business logic into Lambdas. Retries, branching, timeouts, bizarre edge cases. I treated them like a digital junk drawer.

I do not do that anymore. Lambdas should be as dumb and fast as a golden retriever chasing a tennis ball.

The new rule: One Lambda equals one job. If you need a workflow, use Step Functions. They are the micromanaging middle managers your architecture desperately needs.

Example of a simple workflow state:

{
  "Type": "Task",
  "Resource": "arn:aws:lambda:eu-west-1:123456789012:function:DoOneThingWell",
  "Retry": [
    {
      "ErrorEquals": ["States.TaskFailed"],
      "IntervalSeconds": 3,
      "MaxAttempts": 2
    }
  ],
  "Next": "CelebrateSuccess"
}

This separation makes debugging highly visual, makes retries explicit, and makes onboarding the new guy infinitely less painful. Your future self will thank you.

Kill cron jobs and use managed schedulers

Cron jobs are perfectly fine until they suddenly are not.

They are the ghosts of your infrastructure. They are completely invisible until they fail, and when they do fail, they die in absolute silence like a ninja with a sudden heart condition. AWS gives you managed scheduling. Just use it.

Why this is fundamentally faster:

  • Central visibility
  • Built-in retries
  • IAM-native permissions

Example of creating a scheduled rule:

eventbridge.put_rule(
    Name="TriggerNightlyChaos",
    ScheduleExpression="cron(0 2 * * ? *)",
    State="ENABLED",
    Description="Wakes up the system when nobody is looking"
)

Automation should be highly observable. Cron jobs are just waiting in the dark to ruin your Tuesday.

Bake cost controls into automation

Speed without cost awareness is just a highly efficient way to bankrupt your employer. The fastest teams I have ever worked with were not just shipping fast. They were failing cheaply.

What I automate now with the ruthlessness of a debt collector:

  • Budget alerts
  • Resource TTLs
  • Auto-shutdowns for non-production environments

Example of tagging resources with an expiration date:

ec2.create_tags(
    Resources=['i-0deadbeef12345678'],
    Tags=[
        {"Key": "TerminateAfter", "Value": "2026-12-31"},
        {"Key": "Owner", "Value": "TheVoid"}
    ]
)

Leaving resources without an owner or an expiration date is like leaving the stove on, except this stove bills you by the millisecond. Anything without a TTL is just technical debt waiting to invoice you.

A quote I live by: “Automation does not cut costs by magic. It cuts costs by quietly preventing the expensive little mistakes humans call normal.”

The death of the cloud hero

These patterns did not make me faster because they are particularly clever. They made me faster because they completely eliminated the need to make decisions.

Less clicking. Less remembering. Absolutely zero heroics.

If you want to move ten times faster on AWS, stop asking what to build next. Once automation is in charge, real speed usually arrives as work you no longer have to remember.

The profitable art of being difficult to replace

I once held the charmingly idiotic belief that net worth was directly correlated to calorie expenditure. As a younger man staring up at the financial stratosphere where the ultra-high earners floated, I assumed their lives were a relentless marathon of physiological exertion. I pictured CEOs and Senior Architects sweating through their Italian suits, solving quadratic equations while running on treadmills, their cortisol levels permanently redlining as they suffered for every single cent.

It was a comforting delusion because it implied the universe was a meritocracy based on thermodynamics. It suggested that if I just gritted my teeth hard enough and pushed until my vision blurred, the universe would eventually hand me a corner office and a watch that cost more than my first car.

Then I entered the actual workforce and realized that the universe is not fair. Worse than that, it is not even logical. The market does not care about your lactic acid buildup. In fact, there seems to be an inverse relationship between how much your back hurts at the end of the day and how many zeros are on your paycheck.

The thermodynamic lie of manual labor

Consider the holiday season retail worker. If you have ever worked in a shop during December, you know it is less of a job and more of a biological stress test designed by a sadist. You are on your feet for eight hours. You are smiling at people who are actively trying to return a toaster they clearly dropped in a bathtub. You are lifting boxes, dodging frantic shoppers, and absorbing the collective anxiety of a population that forgot to buy gifts until Christmas Eve.

It is physically draining, emotionally taxing, and mentally numbing. By any objective measure of human suffering, it is “hard work.”

And yet the compensation for this marathon of patience is often a number that barely covers the cost of the therapeutic insoles you need to survive the shift. If hard work were the currency of wealth, the person stacking shelves at 2 AM would be buying the yacht. Instead, they are usually the ones waiting for the night bus while the mall owner sleeps soundly in a bed that probably costs more than the worker’s annual rent.

This is the brutal reality of the labor market. We are not paid for the calories we burn. We are not paid for the “effort” in the strict physics sense of work equals force times distance. We are paid based on a much colder, less human metric. We are paid based on how annoying it would be to find someone else to do it.

The lucrative business of sitting very still

Let us look at my current reality as a DevOps engineer and Cloud Architect. My daily caloric burn is roughly equivalent to a hibernating sloth. While a construction worker is dissolving their kneecaps on concrete, I am sitting in an ergonomic chair designed by NASA, getting irrationally upset because my coffee is slightly below optimal temperature.

To an outside observer, my job looks like a scam. I type a few lines of YAML. I stare at a progress bar. I frown at a dashboard. Occasionally, I sigh dramatically to signal to my colleagues that I am doing something very complex with Kubernetes.

And yet the market values this sedentary behavior at a premium. Why?

It is certainly not because typing is difficult. Most people can type. It is not because I am working “harder” than the retail employee. I am definitely not. The reason is fear. Specifically, the fear of what happens when the progress bar turns red.

We are not paid for the typing. We are paid because we are the only ones willing to perform open-heart surgery on a zombie platform while the CEO watches. The ability to stare into the abyss of a crashing production database without vomiting is a rare and expensive evolutionary trait.

Companies do not pay us for the hours when everything is working. They pay us a retainer fee for the fifteen minutes a year when the entire digital infrastructure threatens to evaporate. We are basically insurance policies that drink too much caffeine.

The panic tax

This brings us to the core of the salary misunderstanding. Most technical professionals think they are paid to build things. This is only partially true. We are largely paid to absorb panic.

When a server farm goes dark, the average business manager experiences a visceral fight-or-flight response. They see revenue dropping to zero. They see lawsuits. They see their bonus fluttering away like a moth. The person who can walk into that room, look at the chaos, and say “I know which wire to wiggle” is not charging for the wire-wiggling. They are charging a “Panic Tax.”

The harder the problem is to understand, and the fewer people there are who can stomach the risk of solving it, the higher the tax you can levy.

If your job can be explained to a five-year-old in a single sentence, you are likely underpaid. If your job involves acronyms that sound like a robotic sneeze and requires you to understand why a specific version of a library hates a specific version of an operating system, you are in the money.

You are being paid for the obscurity of your suffering, not the intensity of it.

The golden retriever replacement theory

To understand your true value, you have to look at yourself with the cold, unfeeling eyes of a hiring manager. You have to ask yourself how easy it would be to replace you.

If you are a generalist who works very hard, follows all the rules, and does exactly what is asked, you are a wonderful employee. You are also doomed. To the algorithm of capitalism, a generalist worker is essentially a standard spare part. If you vanish, the organization simply scoops another warm body from the LinkedIn gene pool and plugs it into the socket before the seat gets cold.

However, consider the engineer who manages the legacy authentication system. You know the one. The system was written ten years ago by a guy named Dave who didn’t believe in documentation and is now living in a yurt in Montana. The code is a terrifying plate of spaghetti that somehow processes payments.

The engineer who knows how to keep Dave’s ghost alive is not working “hard.” They might spend four hours a day reading Reddit. But if they leave, the company stops making money. That engineer is difficult to replace.

This is the goal. You do not want to be the shiny new cog that fits perfectly in the machine. You want to be the weird, knobby, custom-forged piece of metal that holds the entire transmission together. You want to be the structural integrity of the department.

This does not mean you should hoard knowledge or refuse to document your work. That makes you a villain, not an asset. It means you should tackle the problems that are so messy, so risky, and so complex that other people are afraid to touch them.

The art of being a delightful bottleneck

There is a nuance here that is often missed. Being difficult to replace does not mean being difficult to work with. There is a specific type of IT professional who tries to create job security by being the “Guru on the Mountain.” They are grumpy, they refuse to explain anything, and they treat every question as a personal insult.

Do not be that person. Companies will tolerate that person for a while, but they will actively plot to replace them. It is a resentment-based retention strategy.

The profitable approach is to be the “Delightful Bottleneck.” You are the only one who can solve the problem, but you are also happy to help. You become the wizard who saves the day, not the troll under the bridge who demands a toll.

When you position yourself as the only person who can navigate the complexity of the cloud architecture, and you do it with a smile, you create a dependency that feels like a partnership. Management stops looking for your replacement and starts looking for ways to keep you happy. That is when the salary negotiations stop being a battle and start being a formality.

Navigating the scarcity market

If you want to increase your salary, stop trying to increase your effort. You cannot physically work harder than a script. You cannot out-process a serverless function. You will lose that battle every time because biology is inefficient.

Instead, focus on lowering your replaceability.

Niche down until it hurts. Find a corner of the cloud ecosystem that makes other developers wince. Learn the tools that are high in demand but low in experts because the documentation is written in riddles. It is not about working harder. It is about positioning yourself in the market where the supply line is thin and the desperation is high.

Look for the “unsexy” problems. Everyone wants to work on the new AI features. It is shiny. It is fun. It is great for dinner party conversation. But because everyone wants to do it, the supply of labor is high.

Fewer people want to work on compliance automation, security governance, or mainframe migration. These tasks are the digital equivalent of plumbing. They are not glamorous. They involve dealing with sludge. But when the toilet backs up, the plumber can charge whatever they want because nobody else wants to touch it.

Final thoughts on leverage

We often confuse motion with progress. We confuse exhaustion with value. We have been trained since school to believe that the student who studies the longest gets the best grade.

The market does not care about your exhaustion. It cares about your leverage.

Leverage comes from specific knowledge. It comes from owning a problem set that scares other people. It comes from being the person who can walk into a room where everyone is panicking and lower the collective blood pressure by simply existing.

Do not grind yourself into dust trying to be the hardest worker in the room. Be the most difficult one to replace. It pays better, and your lower back will thank you for it.

How we ditched AWS ELB and accidentally built a time machine

I was staring at our AWS bill at two in the morning, nursing my third cup of coffee, when I realized something that should have been obvious months earlier. We were paying more to distribute our traffic than to process it. Our Application Load Balancer, that innocent-looking service that simply forwards packets from point A to point B, was consuming $3,900 every month. That is $46,800 a year. For a traffic cop. A very expensive traffic cop that could not even handle our peak loads without breaking into a sweat.

The particularly galling part was that we had accepted this as normal. Everyone uses AWS load balancers, right? They are the standard, the default, the path of least resistance. It is like paying rent for an apartment you only use to store your shoes. Technically functional, financially absurd.

So we did what any reasonable engineering team would do at that hour. We started googling. And that is how we discovered IPVS, a technology so old that half our engineering team had not been born when it was first released. IPVS stands for IP Virtual Server, which sounds like something from a 1990s hacker movie, and honestly, that is not far off. It was written in 1998 by a fellow named Wensong Zhang, who presumably had no idea that twenty-eight years later, a group of bleary-eyed engineers would be using his code to save more than forty-six thousand dollars a year.

The expensive traffic cop

To understand why we were so eager to jettison our load balancer, you need to understand how AWS pricing works. Or rather, how it accumulates like barnacles on the hull of a ship, slowly dragging you down until you wonder why you are moving so slowly.

An Application Load Balancer costs $0.0225 per hour. That sounds reasonable, about sixteen dollars a month. But then there are LCUs, or Load Balancer Capacity Units, which charge you for every new connection, every rule evaluation, every processed byte. It is like buying a car and then discovering you have to pay extra every time you turn the steering wheel.

In practice, this meant our ALB was consuming fifteen to twenty percent of our entire infrastructure budget. Not for compute, not for storage, not for anything that actually creates value. Just for forwarding packets. It was the technological equivalent of paying a butler to hand you the remote control.

The ALB also had some architectural quirks that made us scratch our heads. It terminated TLS, which sounds helpful until you realize we were already terminating TLS at our ingress. So we were decrypting traffic, then re-encrypting it, then decrypting it again. It was like putting on a coat to go outside, then taking it off and putting on another identical coat, then finally going outside. The security theater was strong with this one.

A trip to 1999

I should confess that when we started this project, I had no idea what IPVS even stood for. I had heard it mentioned in passing by a colleague who used to work at a large Chinese tech company, where apparently everyone uses it. He described it with the kind of reverence usually reserved for vintage wine or classic cars. “It just works,” he said, which in engineering terms is the highest possible praise.

IPVS, I learned, lives inside the Linux kernel itself. Not in a container, not in a microservice, not in some cloud-managed abstraction. In the actual kernel. This means when a packet arrives at your server, the kernel looks at it, consults its internal routing table, and forwards it directly. No context switches, no user-space handoffs, no “let me ask my manager” delays. Just pure, elegant packet forwarding.

The first time I saw it in action, I felt something I had not felt in years of cloud engineering. I felt wonder. Here was code written when Bill Clinton was president, when the iPod was still three years away, when people used modems to connect to the internet. And it was outperforming a service that AWS charges thousands of dollars for. It was like discovering that your grandfather’s pocket watch keeps better time than your smartwatch.

How the magic happens

Our setup is almost embarrassingly simple. We run a DaemonSet called ipvs-router on dedicated, tiny nodes in each Availability Zone. Each pod does four things, and it does them with the kind of efficiency that makes you question everything else in your stack.

First, it claims an Elastic IP using kube-vip, a CNCF project that lets Kubernetes pods take ownership of spare EIPs. No AWS load balancer required. The pod simply announces “this IP is mine now”, and the network obliges. It feels almost rude how straightforward it is.

Second, it programs IPVS in the kernel. IPVS builds an L4 load-balancing table that forwards packets at line rate. No proxies, no user-space hops. The kernel becomes your load balancer, which is a bit like discovering your car engine can also make excellent toast. Unexpected, but delightful.

Third, it syncs with Kubernetes endpoints. A lightweight controller watches for new pods, and when one appears, IPVS adds it to the rotation in less than a hundred milliseconds. Scaling feels instantaneous because, well, it basically is.

But the real trick is the fourth thing. We use something called Direct Server Return, or DSR. Here is how it works. When a request comes in, it travels from the client to IPVS to the pod. But the response goes directly from the pod back to the client, bypassing the load balancer entirely. The load balancer never sees response traffic. That is how we get ten times the throughput. It is like having a traffic cop who only directs cars into the city but does not care how they leave.

The code that makes it work

Here is what our DaemonSet looks like. I have simplified it slightly for readability, but this is essentially what runs in our production cluster:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ipvs-router
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: ipvs-router
  template:
    metadata:
      labels:
        app: ipvs-router
    spec:
      hostNetwork: true
      containers:
      - name: ipvs-router
        image: ghcr.io/kube-vip/kube-vip:v0.8.0
        args:
        - manager
        env:
        - name: vip_arp
          value: ""true""
        - name: port
          value: ""443""
        - name: vip_interface
          value: eth0
        - name: vip_cidr
          value: ""32""
        - name: cp_enable
          value: ""true""
        - name: cp_namespace
          value: kube-system
        - name: svc_enable
          value: ""true""
        - name: vip_leaderelection
          value: ""true""
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
            - NET_RAW

The key here is hostNetwork: true, which gives the pod direct access to the host’s network stack. Combined with the NET_ADMIN capability, this allows IPVS to manipulate the kernel’s routing tables directly. It requires a certain level of trust in your containers, but then again, so does running a load balancer in the first place.

We also use a custom controller to sync Kubernetes endpoints with IPVS. Here is the core logic:

# Simplified endpoint sync logic
def sync_endpoints(service_name, namespace):
    # Get current endpoints from Kubernetes
    endpoints = k8s_client.list_namespaced_endpoints(
        namespace=namespace,
        field_selector=f""metadata.name={service_name}""
    )
    
    # Extract pod IPs
    pod_ips = []
    for subset in endpoints.items[0].subsets:
        for address in subset.addresses:
            pod_ips.append(address.ip)
    
    # Build IPVS rules using ipvsadm
    for ip in pod_ips:
        subprocess.run([
            ""ipvsadm"", ""-a"", ""-t"", 
            f""{VIP}:443"", ""-r"", f""{ip}:443"", ""-g""
        ])
    
    # The -g flag enables Direct Server Return (DSR)
    return len(pod_ips)

The numbers that matter

Let me tell you about the math, because the math is almost embarrassing for AWS. Our old ALB took about five milliseconds to set up a new connection. IPVS takes less than half a millisecond. That is not an improvement. That is a different category of existence. It is the difference between walking to the shops and being teleported there.

While our ALB would start getting nervous around one hundred thousand concurrent connections, IPVS just does not. It could handle millions. The only limit is how much memory your kernel has, which in our case meant we could have hosted the entire internet circa 2003 without breaking a sweat.

In terms of throughput, our ALB topped out around 2.5 gigabits per second. IPVS saturates the 25-gigabit NIC on our c7g.medium instances. That is ten times the throughput, for those keeping score at home. The load balancer stopped being the bottleneck, which was refreshing because previously it had been like trying to fill a swimming pool through a drinking straw.

But the real kicker is the cost. Here is the breakdown. We run one c7g.medium spot instance per availability zone, three zones total. Each costs about $0.017 per hour. That is $0.051 per hour for compute. We also have three Elastic IPs at $0.005 per hour each, which is $0.015 per hour. With Direct Server Return, outbound transfer costs are effectively zero because responses bypass the load balancer entirely.

The total? A mere $0.066 per hour. Divide that among three availability zones, and you’re looking at roughly $0.009 per hour per zone. That’s nine-tenths of a cent per hour. Let’s not call it optimization, let’s call it a financial exorcism. We went from shelling out $3,900 a month to a modest $48. The savings alone could probably afford a very capable engineer’s caffeine habit.

But what about L7 routing

At this point, you might be raising a valid objection. IPVS is dumb L4. It does not inspect HTTP headers, it does not route based on gRPC metadata, and it does not care about your carefully crafted REST API conventions. It just forwards packets based on IP and port. It is the postal worker of the networking world. Reliable, fast, and utterly indifferent to what is in the envelope.

This is where we layer in Envoy, because intelligence should live where it makes sense. Here is how the request flow works. A client connects to one of our Elastic IPs. IPVS forwards that connection to a random healthy pod. Inside that pod, an Envoy sidecar inspects the HTTP/2 headers or gRPC metadata and routes to the correct internal service.

The result is L4 performance at the edge and L7 intelligence at the pod. We get the speed of kernel-level packet forwarding combined with the flexibility of modern service mesh routing. It is like having a Formula 1 engine in a car that also has comfortable seats and a good sound system. Best of both worlds. Our Envoy configuration looks something like this:

static_resources:
  listeners:
  - name: ingress_listener
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 443
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          ""@type"": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress
          route_config:
            name: local_route
            virtual_hosts:
            - name: api
              domains:
              - ""api.ourcompany.com""
              routes:
              - match:
                  prefix: ""/v1/users""
                route:
                  cluster: user_service
              - match:
                  prefix: ""/v1/orders""
                route:
                  cluster: order_service
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              ""@type"": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

The afternoon we broke everything

I should mention that our first attempt did not go smoothly. In fact, it went so poorly that we briefly considered pretending the whole thing had never happened and going back to our expensive ALBs.

The problem was DNS. We pointed our api.ourcompany.com domain at the new Elastic IPs, and then we waited. And waited. And nothing happened. Traffic was still going to the old ALB. It turned out that our DNS provider had a TTL of one hour, which meant that even after we updated the record, most clients were still using the old IP address for, well, an hour.

But that was not the real problem. The real problem was that we had forgotten to update our health checks. Our monitoring system was still pinging the old ALB’s health endpoint, which was now returning 404s because we had deleted the target group. So our alerts were going off, our pagers were buzzing, and our on-call engineer was having what I can only describe as a difficult afternoon.

We fixed it, of course. Updated the health checks, waited for DNS to propagate, and watched as traffic slowly shifted to the new setup. But for about thirty minutes, we were flying blind, which is not a feeling I recommend to anyone who values their peace of mind.

Deploying this yourself

If you are thinking about trying this yourself, the good news is that it is surprisingly straightforward. The bad news is that you will need to know your way around Kubernetes and be comfortable with the idea of pods manipulating kernel networking tables. If that sounds terrifying, perhaps stick with your ALB. It is expensive, but it is someone else’s problem.

Here is the deployment process in a nutshell. First, deploy the DaemonSet. Then allocate some spare Elastic IPs in your subnet. There is a particular quirk in AWS networking that can ruin your afternoon: the source/destination check. By default, EC2 instances are configured to reject traffic that does not match their assigned IP address. Since our setup explicitly relies on handling traffic for IP addresses that the instance does not technically ‘own’ (our Virtual IPs), AWS treats this as suspicious activity and drops the packets. You must disable the source/destination check on any instance running these router pods. It is a simple checkbox in the console, but forgetting it is the difference between a working load balancer and a black hole.
The pods will auto-claim them using kube-vip. Also, ensure your worker node IAM roles have permission to reassociate Elastic IPs, or your pods will shout into the void without anyone listening. Update your DNS to point at the new IPs, using latency-based routing if you want to be fancy. Then watch as your ALB target group drains, and delete the ALB next week after you are confident everything is working.

The whole setup takes about three hours the first time, and maybe thirty minutes if you do it again. Three hours of work for $46,000 per year in savings. That is $15,000 per hour, which is not a bad rate by anyone’s standards.

What we learned about Cloud computing

Three months after we made the switch, I found myself at an AWS conference, listening to a presentation about their newest managed load balancing service. It was impressive, all machine learning and auto-scaling and intelligent routing. It was also, I calculated quietly, about four hundred times more expensive than our little IPVS setup.

I did not say anything. Some lessons are better learned the hard way. And as I sat there, sipping my overpriced conference coffee, I could not help but smile.

AWS managed services are built for speed of adoption and lowest-common-denominator use cases. They are not built for peak efficiency, extreme performance, or cost discipline. For foundational infrastructure like load balancing, a little DIY unlocks exponential gains.

The embarrassing truth is that we should have done this years ago. We were so accustomed to reaching for managed services that we never stopped to ask whether we actually needed them. It took a 2 AM coffee-fueled bill review to make us question the assumptions we had been carrying around.

Sometimes the future of cloud computing looks a lot like 1999. And honestly, that is exactly what makes it beautiful. There is something deeply satisfying about discovering that the solution to your expensive modern problem was solved decades ago by someone working on a much simpler internet, with much simpler tools, and probably much more sleep.

Wensong Zhang, wherever you are, thank you. Your code from 1998 is still making engineers happy in 2026. That is not a bad legacy for any piece of software.

The author would like to thank his patient colleagues who did not complain (much) during the DNS propagation incident, and the kube-vip maintainers who answered his increasingly desperate questions on Slack.