June 2026

Why multi cloud architecture is overrated

Somewhere along the way, multi cloud stopped being an architecture decision and became a personality trait. It shows up on slide decks the way kale shows up at weddings, nobody is quite sure who asked for it, but removing it now would feel like a confession. Say the words out loud in a meeting and watch the room nod with the solemn confidence of people who have read the headline but not the footnotes.

The pitch is seductive. Spread your workloads across two or three providers and you get freedom, resilience, and a vague aura of technical adulthood. What you actually get, in a surprising number of cases, is a second pager rotation and a networking bill that reads like a ransom note.

This is not an argument that multi cloud is always wrong. It is an argument that it is rarely the default answer to a question nobody bothered to ask. A lot of cloud decisions get made the way people buy treadmills, for the version of themselves they intend to become, not the one currently standing in the room.

What the multi cloud pitch is really selling

Strip away the architecture diagrams, and the multi cloud sales pitch is mostly emotional. It promises escape from vendor lock in, that primal fear of being trapped by a provider who knows it. It offers the warm feeling of resilience, the sense that if one cloud falls over, your business will calmly stroll to the next one. And it projects maturity, the impression that your team has graduated from amateur hour to serious infrastructure people.

These are real desires. The problem is that multi cloud answers them roughly the way buying a second house answers your fear of a leaky roof. Technically, you now have options. You also have two roofs.

Where the trouble starts

The fantasy version of multi cloud assumes a clean layer of abstraction sitting neatly on top of every provider, so your team writes once and runs anywhere. The practical version is that each cloud is its own country, with its own language, customs, and bewildering opinions about what a load balancer should be called.

Total abstraction almost never survives contact with production. Your engineers do not get to learn one platform deeply. They get to learn two platforms simultaneously, plus the seam between them, which is where the genuinely interesting bugs live. The architecture becomes harder to document, harder to operate, and harder to evolve, which is a polite way of saying nobody on the team fully understands all of it anymore, including the person who built it.

The costs nobody put on the slide

Here is where the war story usually turns grim. The expensive part of multi cloud is rarely the compute. It is everything wrapped around it.

You duplicate your observability stack, because the dashboards that work beautifully in one cloud are politely useless in the other. Your IAM and governance model doubles in surface area, since every permission, role, and policy now needs a twin that behaves identically and never quite does. Network egress, cross cloud replication, and data transfer charges accumulate quietly in the background like a subscription you forgot to cancel.

And then there is the human cost, which never appears in the cost calculator. Multi cloud raises the seniority floor of your entire team. The junior engineer who could safely ship in a single, well governed environment now needs to understand two of everything before they can be trusted near production. You do not just pay for more infrastructure. You pay for more expertise to keep that infrastructure from quietly drifting apart.

Security and compliance, now in stereo

Security people have a particular look they get when you mention multi cloud, the expression of someone who has been asked to childproof two houses for the price of one.

Every additional cloud is another attack surface, another set of policies to keep synchronized, and another generous opportunity for a misconfiguration to go unnoticed. Auditing becomes an exercise in translation, because a control that means one thing in the first provider means something subtly different in the second. The realistic end state is not airtight redundancy. It is an uneven security posture, strong where your team has invested attention and quietly soft where they ran out of hours.

Operations and the art of the unsolvable incident

Incidents are where multi cloud presents its invoice. At three in the morning, when something is on fire, you do not want your logs, metrics, and traces scattered across two providers like a crime scene split between jurisdictions.

Correlating an incident across clouds is slow, and slow is expensive when customers are watching. Mean time to detection creeps up. Mean time to resolution creeps up with it. Postmortems acquire an extra paragraph that always begins with some variation of “we lost time understanding which cloud was actually responsible.” To stitch the picture back together, you end up leaning on yet another external platform whose only job is to make your two clouds look like one, which, if you squint, is a strange amount of effort to undo a decision you made on purpose.

Portability you have versus portability you imagine

The intellectual cornerstone of multi cloud is portability, the comforting belief that you could pack up and move providers whenever you liked. In practice, portability tends to be the gym membership of architecture, fully paid for, rarely used, and quietly aspirational.

Most systems end up leaning on provider specific services anyway, because those services are good and saying no to them on principle is a luxury few teams can afford. Kubernetes and containers genuinely help, they smooth the edges, but they do not magically erase the dependency underneath. Real migrations are still slow, still costly, and still the kind of project that gets proposed with enthusiasm and abandoned with relief.

The honest goal is not total independence. It is a reasonable exit, an architecture where leaving would be painful but possible, rather than one where leaving is theoretically free and practically unthinkable. Designing for a sensible escape route beats promising a freedom you will never exercise.

When multi cloud actually earns its keep

None of this means multi cloud is a mistake. It means it is a tool with a narrow, legitimate set of jobs, and it deserves to be used for those rather than worn as a badge.

It earns its place when regulation or data sovereignty leaves you no choice, when the data legally must live in particular places. It makes sense when your organization is large enough that real negotiating leverage over providers translates into serious money. It is justified when the business risk is so high that genuine cross cloud redundancy is worth its considerable price. It is reasonable when a specific service on another cloud offers an advantage you cannot replicate elsewhere. And it works, crucially, only when the organization is mature enough to operate it, with the teams, processes, and discipline already in place rather than hopefully on order.

Notice that every one of these starts with a concrete business reason, not a vibe.

What I would do instead

If you handed me a blank slate and a reasonable budget, I would pick one primary cloud and govern it properly, because a single environment run well beats two environments run anxiously almost every time.

I would design for enough portability to sleep at night, not for the fantasy of frictionless migration. I would automate deployments and infrastructure from day one, before the shortcuts calcify into tradition. I would standardize observability, security, and cost management so the whole thing stays legible to the people who did not build it. And I would revisit the single cloud decision on a schedule, honestly, looking for a real reason to expand rather than a fashionable one.

That last part matters. The goal is not loyalty to one provider. It is refusing to add a second one until something other than anxiety is asking for it.

The question worth asking

Multi cloud can be genuinely useful. It just should not be the reflex, the thing you reach for because the alternative feels insufficiently ambitious. Far too often, what gets sold as strategy is simply complexity wearing a nicer outfit.

Cloud maturity is not measured in the number of providers on your invoice. It is measured by how well you use the one you actually need. The most sophisticated architecture in the room is frequently the one that resisted the urge to be impressive.

So the question is not how many clouds you run. It is whether each one earns its place.

Your Ingress resource is living on borrowed time

There is a special kind of grief reserved for infrastructure that works fine. Nobody writes eulogies for the broken stuff; that gets deleted with enthusiasm. The painful goodbyes are for the things that still do their job every day, quietly, while the rest of the industry has already decided they belong in a museum. Your Ingress resources are in that category now. They route traffic, they terminate TLS, and they have not paged you in months. And they are, officially and by design, a dead end.

The Kubernetes project has been remarkably polite about this. Ingress is “frozen”, which is the standards body equivalent of moving someone to a nice farm upstate. No new features, no spec evolution, no fixes for the design decisions everyone now regrets. The replacement is called Gateway API, it reached general availability back in 2023, and it is one of those rare cases where the new thing is not just the old thing with more YAML. It actually fixes the organizational problem that made Ingress miserable, which, as we will see, was never really a technical problem at all.

The Ingress spec was always a rough draft

Here is the part of the story that usually gets left out. When Ingress shipped in 2015, the Kubernetes maintainers did not believe they had solved HTTP routing. They believed, correctly, that they had no idea what HTTP routing should look like, and they shipped a minimal spec on purpose. Host, path, backend service. That was essentially it. Everything else, the maintainers figured, could be handled by annotations until the community figured out what it actually wanted.

The community figured out what it wanted, all right. It wanted everything, and it wanted it via annotations.

If you have ever operated an nginx ingress controller in production, you know the genre. nginx.ingress.kubernetes.io/rewrite-target. nginx.ingress.kubernetes.io/canary-weight. nginx.ingress.kubernetes.io/configuration-snippet, which is the annotation equivalent of a hole in the wall that you push raw nginx config through and hope for the best. Traefik grew its own dialect. HAProxy grew another. At some point, the nginx controller alone supported well over a hundred proprietary annotations, each one a small confession that the spec underneath could not do the job.

The practical consequence is one that every platform engineer has lived. Your routing configuration is portable in theory and welded to your controller in practice. Migrating from nginx to anything else means translating a folklore of annotations by hand, and some of them have no translation, because they were never features of Kubernetes. They were features of one specific reverse proxy, smuggled in through a string field.

None of this makes Ingress bad design. It makes Ingress an honest admission, in 2015, that nobody agreed on what routing should look like. Gateway API is what happened after roughly eight years of arguing, when they finally agreed.

Three resources instead of one, and that is the whole upgrade

Gateway API replaces the single Ingress object with three, and before your YAML fatigue kicks in, stay with me, because the count is not the point. The ownership is.

GatewayClass is the template. It declares what kind of gateway infrastructure your cluster offers (Envoy, Cilium, or a cloud load balancer), and it gets written approximately once, by whoever runs the platform, and then mostly forgotten.

Gateway is a running instance of that template. It is the actual listener, the thing with an IP address and open ports, and it lives in an infrastructure namespace where application developers cannot poke it.

HTTPRoute is the routing rule. It says “traffic for this hostname and this path goes to this service”, and it lives in the application’s own namespace, right next to the Deployment it serves, owned by the team that owns the app.

That is the entire model. Three objects, three different owners, three different namespaces if you want them. Every interesting thing about Gateway API follows from that separation, which brings us to the actual argument.

The hallway belongs to the platform team, and the door belongs to the app team

Think about what an Ingress object actually is, organizationally. It is one resource that contains both infrastructure concerns (TLS certificates, load balancer behavior, controller tuning) and application concerns (which path goes to which service). One object, two very different audiences, and Kubernetes RBAC can only draw permission lines around whole objects.

So every organization running Ingress at scale ends up choosing between two bad options. Option one, the platform team owns all Ingress resources, and application teams file tickets to change a path rule, which is a magnificent way to turn a thirty-second change into a three-day wait. Option two, application teams own their Ingress resources, which means application teams can now set controller-level annotations, and somewhere in your cluster, there is a configuration snippet written by an intern in 2022 that nobody dares to remove. Both options are workarounds for the same flaw. The spec crammed two jobs into one object, and org charts do not bend that way.

Gateway API splits the object along exactly the line where your teams already split. The platform engineer provisions the Gateway in the infra namespace. They decide which ports are open, which TLS policy applies, and, crucially, which namespaces are allowed to attach routes to it. The application developer writes an HTTPRoute in their own namespace that says, in effect, “attach me to the gateway named external-web”. The route references the gateway by name; the gateway grants permission by policy. Cross-namespace routing is not a hack here, it is the core mechanic of the spec, with an explicit handshake on both sides.

If you read my past RBAC article, this will feel familiar, because it is the same principle wearing a different hat. Least privilege stopped being just about who can “kubectl delete” things and started applying to the network path itself. App teams get exactly the surface they need (their routes, their namespace) and nothing else. The platform team stops being a ticket-processing bottleneck and goes back to doing platform work. Nobody negotiates over annotations in a Slack thread at 6 p.m. on a Friday, which I am told does wonders for retention.

There is also a quieter benefit that only shows up in the postmortem. When routing rules live next to the application, the blast radius of a bad change is the application. When everything lives in one shared Ingress layer, a typo in one team’s path rule can take an unrelated team’s traffic with it. Separation of concerns is usually sold as elegance. In production, it is mostly sold as smaller incidents.

What Ingress made you beg your controller to do

Now for the features, briefly, because the features are genuinely less interesting than the reframe behind them.

Take canary deployments. With Ingress on nginx, weight-based traffic splitting means creating a second Ingress object, blessing it with ‘canary: “true”’ and ‘canary-weight: “10”’ annotations, and trusting that the controller interprets your strings correctly. With Gateway API, an HTTPRoute simply lists two backends with weights, 90 and 10, as ordinary structured fields. The API server validates them. Your canary rollout is now plain YAML instead of an incantation, and you did not have to install a service mesh to get it.

Header-based routing gets the same treatment. Routing requests with ‘x-beta-user: true’ to a different backend is a match condition in the spec, not a regex pasted into a controller-specific snippet. URL rewriting is a filter. Request mirroring, the trick where you copy live traffic to a new version without affecting real responses, is a filter too. Timeouts, header manipulation, traffic redirection, all first-class citizens with schemas.

Here is the reframe. None of these capabilities are new. Your reverse proxy could do all of this in 2016; reverse proxies are old and wise. What was missing was a portable way to ask for it. Under Ingress, every feature beyond host-and-path routing required learning the proprietary annotation dialect of whichever controller you happened to inherit, and your hard-won fluency in nginx annotations was worth exactly nothing the day someone migrated to Traefik. Gateway API moves those features into the spec itself, where they are typed, validated, and identical across implementations. The knowledge finally transfers. So do the manifests.

GatewayClass is the new vendor coupling point, and that is a better deal

Time for the honest section, because every article praising a new standard owes you one.

Gateway API does not eliminate vendor lock-in, and anyone telling you otherwise is selling a controller. The GatewayClass is where you commit. You pick Cilium, or Envoy Gateway, or Istio, or nginx-gateway-fabric, and from that moment your gateways run on that implementation’s machinery, with that implementation’s performance profile and that implementation’s extension features. Conformance across implementations is real but not absolute; the spec has core features everyone must support and extended ones they may.

What changed is the geometry of the coupling. With Ingress, the vendor dependency was smeared across your entire estate, hiding inside opaque annotation strings on every single routing object. You could not see it, measure it, or contain it; you discovered its true size on migration day, which is the worst possible day to discover anything. With Gateway API, the coupling is compressed into one object type. Everything above the GatewayClass (your routes, your matches, your filters, your weights) is portable standard YAML. Everything below it is the vendor’s problem. Swapping implementations becomes “change the GatewayClass and re-test”, not “translate three hundred annotations from one dialect to another and pray”.

The ecosystem, for the record, is not a science fair. Cilium ships a Gateway implementation on eBPF. Envoy Gateway is the CNCF’s straightforward Envoy packaging. Istio treats Gateway API as its preferred configuration surface these days. nginx-gateway-fabric exists for the sizable demographic that would like to keep nginx but lose the annotations. All of these run in production at companies whose outages would make the news.

You do not need to migrate everything to start

The best property of Gateway API for anyone with an existing cluster is that it demands nothing of your existing cluster. Gateway API and Ingress run side by side indefinitely. The controllers do not fight, the resources do not overlap, and your hundred working Ingress objects can keep working while you experiment two namespaces away.

The sensible entry point is not a migration project (migration projects are where enthusiasm goes to file status reports). It is one new service, or one feature branch, routed through an HTTPRoute while everything else stays put. You get a feel for the model, your platform team writes its first Gateway, and the canary feature gets a real audition on something low-stakes.

Whether your cluster is already prepared takes one command to find out.

kubectl get crds | grep gateway.networking.k8s.io

If that returns a list of CRDs, the welcome mat is already out; managed offerings like GKE ship them preinstalled. If it returns nothing, the installation is a single manifest from the Gateway API releases page, and then the welcome mat is out.

Ingress will keep working for years. Frozen APIs in Kubernetes enjoy long, comfortable retirements, and nobody is coming to delete your manifests. But every new routing feature, every new controller capability, and increasingly every new piece of documentation is being written for the other API now. Borrowed time is still time. It is just no longer the kind you should be building on.

RBAC is not least privilege, and your cluster is the proof

Your security scanner ran last night. It came back green. RBAC is configured, there are no critical findings, and you closed the tab with the quiet satisfaction of someone who has done the responsible thing. The cluster is locked down. You can go to lunch.

Here is the uncomfortable part. A green scanner answers the question “Is access controlled?” It does not answer the question “Is access minimal?” Those are different questions, and most teams conflate them because the first one is easy to check and the second one requires reading things nobody wants to read on a Tuesday.

RBAC answers the first. Least privilege requires answering both. And a perfectly valid RBAC configuration can be, at the very same time, a perfectly generous one. The scanner has no opinion about generosity.

The ClusterRole you inherited from a Helm chart in March

Kubernetes ships three aggregated ClusterRoles out of the box (admin, edit, view), and they have a quietly alarming property. They absorb permissions. Any ClusterRole carrying the label ‘rbac.authorization.k8s.io/aggregate-to-edit: “true”’ gets automatically folded into ‘edit’, with no human in the loop and no diff to review.

This is convenient right up until it is not. When you installed that operator back in March, its Helm chart shipped a CRD and a ClusterRole with the aggregation label attached, because that is the polite, idiomatic way to do it. From the moment ‘helm install’ finished, every subject bound to ‘edit’ in your cluster silently gained permissions over a brand new resource type. Nobody approved it. Nobody saw it. The controller did exactly what it was designed to do, which is the part that should worry you.

So the RoleBinding still says ‘edit’. The word has not changed. What it grants has, several times, across several chart upgrades, and the only record of the expansion is scattered across ClusterRole objects nobody has opened since they were applied.

The takeaway is small and annoying: every time you install a chart, check what it aggregated. ‘kubectl get clusterrole -l rbac.authorization.k8s.io/aggregate-to-edit=true’ is two minutes of your life and occasionally a genuine surprise.

That ServiceAccount reads secrets, all of them, probably

Consider a ServiceAccount with ‘get’ on secrets in a single namespace. On paper, this looks narrow and tidy. The reviewer who approved it was right to approve it. The problem is that RBAC grants do not live in isolation; they live next to whatever else is running in that namespace.

If that namespace also hosts External Secrets Operator, a Vault Agent sidecar, or a CSI secrets driver, the secrets sitting there are not application trivia. They are the synced, materialized credentials that those tools pulled from somewhere more important. A grant that reads “can view secrets in ‘team-a’” can, depending on the architecture around it, mean “can read the cloud provider credentials that External Secrets faithfully copied into ‘team-a’ thirty seconds ago.”

Nothing here is broken. Every component is behaving as documented. That is exactly why it slips past review: each piece is reasonable, and the risk only exists in the seam between them, where no single Role definition is looking.

So when you audit a secrets grant, do not read the Role. Read the room. Ask what else lives in that namespace and what those neighbors keep in their pockets.

Creating a Pod sometimes creates a root shell on the node

This is the one people refuse to believe until you show them.

If Pod Security Admission is not enforced in ‘restricted’ mode, a subject with ‘create’ on pods is, functionally, a subject with a path to the node. They can define a pod that mounts the host root filesystem as a volume, sets ‘hostPID: true’, runs ‘privileged: true’, or maps a host port to quietly intercept traffic. From inside that pod, the node is no longer a node; it is a directory.

None of this is a vulnerability. There is no CVE to patch, because Kubernetes is doing precisely what the spec permits. The escalation lives in the gap between two true statements: “we have RBAC” and “nobody can reach the node.” Both can be accurate. Together, they can still be a hole you could drive a cluster through.

The fix is not more RBAC. It is admission control. Enforce PSA ‘restricted’ as the namespace default, and treat every exception as a decision someone wrote down and owns, rather than a default nobody chose.

Three commands that will ruin your afternoon

Theory is comfortable. Here is the part where you actually look.

‘kubectl-who-can’ answers the blunt question: who can perform this verb on this resource, right now. ‘kubectl who-can create pods -n production’ is a fast way to find out that the list is longer than you remembered.

‘rakkess’ produces a full access matrix for a given subject, so you can stare at an entire grid of green checkmarks belonging to a ServiceAccount that, in principle, only needed to read a config map.

‘rbac-tool lookup’ lists everything a specific subject can do across the whole cluster, which is the tool you run when you have a name and a bad feeling.

I will set an honest expectation. The first time you run any of these against a cluster older than a year, you will find at least one thing nobody intended, and there is a decent chance it will be something you granted. This is not a moral failing. It is entropy. Permissions accrete the same way junk drawers do, one reasonable decision at a time.

The scanner will still be green, that is no longer the point

Here is where I am supposed to hand you a fix that makes the scary parts go away. I cannot, because least privilege in Kubernetes is not a configuration state you reach and then defend. It is a process you keep doing, slightly grudgingly, forever.

Start subjects at zero and grant only what the audit log proves they actually use. Tools like ‘audit2rbac’ can generate tight RBAC from real API server audit events, which is to say from evidence rather than from optimism. Enforce PSA ‘restricted’ by default. Audit aggregated ClusterRoles every time you install a chart. Rotate ServiceAccount tokens, because a credential that never expires is just a future incident with good patience.

Do all of that, and run the scanner again. It will still be green. It was always going to be green. The result has not changed at all. The only thing that has changed is the question you now know to ask, and that, inconveniently, was the whole job.

There is no universal answer here, only better-informed trade-offs, and the faint suspicion that your next audit will find something too. It usually does.