Kubernetes

Random comments about Kubernetes

Why your Kubernetes pods and EBS volumes refuse to reconnect

You’re about to head out for lunch. One last, satisfying glance at the monitoring dashboard, all systems green. Perfect. You return an hour later, coffee in hand, to a cascade of alerts. Your application is down. At the heart of the chaos is a single, cryptic message from Kubernetes, and it’s in a mood.

Warning: 1 node(s) had volume node affinity conflict.

You stare at the message. “Volume node affinity conflict” sounds less like a server error and more like something a therapist would say about a couple that can’t agree on which city to live in. You grab your laptop. One of your critical application pods has been evicted from its node and now sits stubbornly in a Pending state, refusing to start anywhere else.

Welcome to the quiet, simmering nightmare of running stateful applications on a multi-availability zone Kubernetes cluster. Your pods and your storage are having a domestic dispute, and you’re the unlucky counselor who has to fix it before the morning stand-up.

Meet the unhappy couple

To understand why your infrastructure is suddenly giving you the silent treatment, you need to understand the two personalities at the heart of this conflict.

First, we have the Pod. Think of your Pod as a freewheeling digital nomad. It’s lightweight, agile, and loves to travel. If its current home (a Node) gets too crowded or suddenly vanishes in a puff of cloud provider maintenance, the Kubernetes scheduler happily finds it a new place to live on another node. The Pod packs its bags in a microsecond and moves on, no questions asked. It believes in flexibility and a minimalist lifestyle.

Then, there’s the EBS volume. If the Pod is a nomad, the Amazon EBS Volume is a resolute homebody. It’s a hefty, 20GB chunk of your application’s precious data. It’s incredibly reliable and fast, but it has one non-negotiable trait: it is physically, metaphorically, and spiritually attached to one single place. That place is an AWS Availability Zone (AZ), which is just a fancy term for a specific data center. An EBS volume created in us-west-2a lives in us-west-2a, and it would rather be deleted than move to us-west-2b. It finds the very idea of travel vulgar.

You can already see the potential for drama. The free-spirited Pod gets evicted and is ready to move to a lovely new node in us-west-2b. But its data, its entire life story, is sitting back in us-west-2a, refusing to budge. The Pod can’t function without its data, so it just sits there, Pending, forever waiting for a reunion that will never happen.

The brute force solution that creates new problems

When faced with this standoff, our first instinct is often to play the role of a strict parent. “You two will stay together, and that’s final!” In Kubernetes, this is called the nodeSelector.

You can edit your Deployment and tell the Pod, in no uncertain terms, that it is only allowed to live in the same neighborhood as its precious volume.

# deployment-with-nodeslector.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: stateful-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-stateful-app
  template:
    metadata:
      labels:
        app: my-stateful-app
    spec:
      nodeSelector:
        # "You will ONLY live in this specific zone!"
        topology.kubernetes.io/zone: us-west-2b
      containers:
        - name: my-app-container
          image: nginx:1.25.3
          volumeMounts:
            - name: app-data
              mountPath: /var/www/html
      volumes:
        - name: app-data
          persistentVolumeClaim:
            claimName: my-app-pvc

This works. Kind of. The Pod is now shackled to the us-west-2b availability zone. If it gets rescheduled, the scheduler will only consider other nodes within that same AZ. The affinity conflict is solved.

But you’ve just traded one problem for a much scarier one. You’ve effectively disabled the “multi-AZ” resilience for this application. If us-west-2b experiences an outage or simply runs out of compute resources, your pod has nowhere to go. It will remain Pending, not because of a storage spat, but because you’ve locked it in a house that’s just run out of oxygen. This isn’t a solution; it’s just picking a different way to fail.

The elegant fix of intelligent patience

So, how do we get our couple to cooperate without resorting to digital handcuffs? The answer lies in changing not where they live, but how they decide to move in together.

The real hero of our story is a little-known StorageClass parameter: volumeBindingMode: WaitForFirstConsumer.

By default, when you ask for a PersistentVolumeClaim, Kubernetes provisions the EBS volume immediately. It’s like buying a heavy, immovable sofa before you’ve even chosen an apartment. The delivery truck drops it in us-west-2a, and now you’re forced to find an apartment in that specific neighborhood.

WaitForFirstConsumer flips the script entirely. It tells Kubernetes: “Hold on. Don’t buy the sofa yet. First, let the Pod (the ‘First Consumer’) find an apartment it likes.”

Here’s how this intelligent process unfolds:

You request a volume with a PersistentVolumeClaim.
The StorageClass, configured with WaitForFirstConsumer, does… nothing. It waits.
The Kubernetes scheduler, now free from any storage constraints, analyzes all your nodes across all your availability zones. It finds the best possible node for your Pod based on resources and other policies. Let’s say it picks a node in us-west-2c.
Only after the Pod has been assigned a home on that node does the StorageClass get the signal. It then dutifully provisions a brand-new EBS volume in that exact same zone, us-west-2c.

The Pod and its data are born together, in the same place, at the same time. No conflict. No drama. It’s a match made in cloud heaven.

Here is what this “patient” StorageClass looks like:

# storageclass-patient.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-sc-wait
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  fsType: ext4
# This is the magic line.
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

Your PersistentVolumeClaim simply needs to reference it:

# persistentvolumeclaim.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-app-pvc
spec:
  # Reference the patient StorageClass
  storageClassName: ebs-sc-wait
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi

And now, your Deployment can be blissfully unaware of zones, free to roam as a true digital nomad should.

# deployment-liberated.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: stateful-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-stateful-app
  template:
    metadata:
      labels:
        app: my-stateful-app
    spec:
      # No nodeSelector! The pod is free!
      containers:
        - name: my-app-container
          image: nginx:1.25.3
          volumeMounts:
            - name: app-data
              mountPath: /var/www/html
      volumes:
        - name: app-data
          persistentVolumeClaim:
            claimName: my-app-pvc

Let your infrastructure work for you

The moral of the story is simple. Don’t fight the brilliant, distributed nature of Kubernetes with rigid, zonal constraints. You chose a multi-AZ setup for resilience, so don’t let your storage configuration sabotage it.

By using WaitForFirstConsumer, which, thankfully, is the default in modern versions of the AWS EBS CSI Driver, you allow the scheduler to do its job properly. Your pods and volumes can finally have a healthy, lasting relationship, happily migrating together wherever the cloud winds take them.

And you? You can go back to sleep.

Playing detective with dead Kubernetes nodes

It arrives without warning, a digital tap on the shoulder that quickly turns into a full-blown alarm. Maybe you’re mid-sentence in a meeting, or maybe you’re just enjoying a rare moment of quiet. Suddenly, a shriek from your phone cuts through everything. It’s the on-call alert, flashing a single, dreaded message: NodeNotReady.

Your beautifully orchestrated city of containers, a masterpiece of modern engineering, now has a major power outage in one of its districts. One of your worker nodes, a once-diligent and productive member of the cluster, has gone completely silent. It’s not responding to calls, it’s not picking up new work, and its existing jobs are in limbo. In the world of Kubernetes, this isn’t just a technical issue; it’s a ghosting of the highest order.

Before you start questioning your life choices or sacrificing a rubber chicken to the networking gods, take a deep breath. Put on your detective’s trench coat. We have a case to solve.

First on the scene, the initial triage

Every good investigation starts by surveying the crime scene and asking the most basic question: What the heck happened here? In our world, this means a quick and clean interrogation of the Kubernetes API server. It’s time for a roll call.

kubectl get nodes -o wide

This little command is your first clue. It lines up all your nodes and points a big, accusatory finger at the one in the Not Ready state.

NAME                    STATUS     ROLES    AGE   VERSION   INTERNAL-IP      EXTERNAL-IP     OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
k8s-master-1            Ready      master   90d   v1.28.2   10.128.0.2       34.67.123.1     Ubuntu 22.04.1 LTS   5.15.0-78-generic   containerd://1.6.9
k8s-worker-node-7b5d    NotReady   <none>   45d   v1.28.2   10.128.0.5       35.190.45.6     Ubuntu 22.04.1 LTS   5.15.0-78-generic   containerd://1.6.9
k8s-worker-node-fg9h    Ready      <none>   45d   v1.28.2   10.128.0.4       35.190.78.9     Ubuntu 22.04.1 LTS   5.15.0-78-generic   containerd://1.6.9

There’s our problem child: k8s-worker-node-7b5d. Now that we’ve identified our silent suspect, it’s time to pull it into the interrogation room for a more personal chat.

kubectl describe node k8s-worker-node-7b5d

The output of describe is where the juicy gossip lives. You’re not just looking at specs; you’re looking for a story. Scroll down to the Conditions and, most importantly, the Events section at the bottom. This is where the node often leaves a trail of breadcrumbs explaining exactly why it decided to take an unscheduled vacation.

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Mon, 13 Oct 2025 09:55:12 +0200   Mon, 13 Oct 2025 09:45:30 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 13 Oct 2025 09:55:12 +0200   Mon, 13 Oct 2025 09:45:30 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 13 Oct 2025 09:55:12 +0200   Mon, 13 Oct 2025 09:45:30 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Mon, 13 Oct 2025 09:55:12 +0200   Mon, 13 Oct 2025 09:50:05 +0200   KubeletNotReady              container runtime network not ready: CNI plugin reporting error: rpc error: code = Unavailable desc = connection error

Events:
  Type     Reason                   Age                  From                       Message
  ----     ------                   ----                 ----                       -------
  Normal   Starting                 25m                  kubelet                    Starting kubelet.
  Warning  ContainerRuntimeNotReady 5m12s (x120 over 25m) kubelet                    container runtime network not ready: CNI plugin reporting error: rpc error: code = Unavailable desc = connection error

Aha! Look at that. The Events log is screaming for help. A repeating warning, ContainerRuntimeNotReady, points to a CNI (Container Network Interface) plugin having a full-blown tantrum. We’ve moved from a mystery to a specific lead.

The usual suspects, a rogues’ gallery

When a node goes quiet, the culprit is usually one of a few repeat offenders. Let’s line them up.

1. The silent saboteur network issues

This is the most common villain. Your node might be perfectly healthy, but if it can’t talk to the control plane, it might as well be on a deserted island. Think of the control plane as the central office trying to call its remote employee (the node). If the phone line is cut, the office assumes the employee is gone. This can be caused by firewall rules blocking ports, misconfigured VPC routes, or a DNS server that’s decided to take the day off.

2. The overworked informant, the kubelet

The kubelet is the control plane’s informant on every node. It’s a tireless little agent that reports on the node’s health and carries out orders. But sometimes, this agent gets sick. It might have crashed, stalled, or is struggling with misconfigured credentials (like expired TLS certificates) and can’t authenticate with the mothership. If the informant goes silent, the node is immediately marked as a person of interest.

You can check on its health directly on the node:

# SSH into the problematic node
ssh user@<node-ip>

# Check the kubelet's vital signs
systemctl status kubelet

A healthy output should say active (running). Anything else, and you’ve found a key piece of evidence.

3. The glutton resource exhaustion

Your node has a finite amount of CPU, memory, and disk space. If a greedy application (or a swarm of them) consumes everything, the node itself can become starved. The kubelet and other critical system daemons need resources to breathe. Without them, they suffocate and stop reporting in. It’s like one person eating the entire buffet, leaving nothing for the hosts of the party.

A quick way to check for gluttons is with:

kubectl top node <your-problem-child-node-name>

If you see CPU or memory usage kissing 100%, you’ve likely found your culprit.

The forensic toolkit: digging deeper

If the initial triage and lineup didn’t reveal the killer, it’s time to break out the forensic tools and get our hands dirty.

Sifting Through the Diary with journalctl

The journalctl command is your window into the kubelet’s soul (or, more accurately, its log files). This is where it writes down its every thought, fear, and error.

# On the node, tail the kubelet's logs for clues
journalctl -u kubelet -f --since "10 minutes ago"

Look for recurring error messages, failed connection attempts, or anything that looks suspiciously out of place.

Quarantining the patient with drain

Before you start performing open-heart surgery on the node, it’s wise to evacuate the civilians. The kubectl drain command gracefully evicts all the pods from the node, allowing them to be rescheduled elsewhere.

kubectl drain k8s-worker-node-7b5d --ignore-daemonsets --delete-local-data

This isolates the patient, letting you work without causing a city-wide service outage.

Confirming the phone lines with curl

Don’t just trust the error messages. Verify them. From the problematic node, try to contact the API server directly. This tells you if the fundamental network path is even open.

# From the problem node, try to reach the API server endpoint
curl -k https://<api-server-ip>:<port>/healthz

If you get ok, the basic connection is fine. If it times out or gets rejected, you’ve confirmed a networking black hole.

Crime prevention: keeping your nodes out of trouble

Solving the case is satisfying, but a true detective also works to prevent future crimes.

Set up a neighborhood watch: Implement robust monitoring with tools like Prometheus and Grafana. Set up alerts for high resource usage, disk pressure, and node status changes. It’s better to spot a prowler before they break in.
Install self-healing robots: Most cloud providers (GKE, EKS, AKS) offer node auto-repair features. If a node fails its health checks, the platform will automatically attempt to repair it or replace it. Turn this on. It’s your 24/7 robotic police force.
Enforce city zoning laws: Use resource requests and limits on your deployments. This prevents any single application from building a resource-hogging skyscraper that blocks the sun for everyone else.
Schedule regular health checkups: Keep your cluster components, operating systems, and container runtimes updated. Many Not Ready mysteries are caused by long-solved bugs that you could have avoided with a simple patch.

The case is closed for now

So there you have it. The rogue node is back in line, the pods are humming along, and the city of containers is once again at peace. You can hang up your trench coat, put your feet up, and enjoy that lukewarm coffee you made three hours ago. The mystery is solved.

But let’s be honest. Debugging a Not Ready node is less like a thrilling Sherlock Holmes novel and more like trying to figure out why your toaster only toasts one side of the bread. It’s a methodical, often maddening, process of elimination. You start with grand theories of network conspiracies and end up discovering the culprit was a single, misplaced comma in a YAML file, the digital equivalent of the butler tripping over the rug.

So the next time an alert yanks you from your peaceful existence, don’t panic. Remember that you are a digital detective, a whisperer of broken machines. Your job is to patiently ask the right questions until the silent, uncooperative suspect finally confesses. After all, in the world of Kubernetes, a node is never truly dead. It’s just being dramatic and waiting for a good detective to find the clues, and maybe, just maybe, restart its kubelet. The city is safe… until the next time. And there is always a next time.

October 13, 2025 by Fernando SRE Cloud stuff DevOps stuff Kubernetes SRE stuff

Parenting your Kubernetes using hierarchical namespaces

Let’s be honest. Your Kubernetes cluster, on its bad days, feels less like a sleek, futuristic platform and more like a chaotic shared apartment right after college. The frontend team is “borrowing” CPU from the backend team, the analytics project left its sensitive data lying around in a public bucket, and nobody knows who finished the last of the memory reserves.

You tried to bring order. You dutifully handed out digital rooms to each team using namespaces. For a while, there was peace. But then those teams had their own little sub-projects, staging, testing, that weird experimental feature no one talks about, and your once-flat world devolved into a sprawling city with no zoning laws. The shenanigans continued, just inside slightly smaller boxes.

What you need isn’t more rules scribbled on a whiteboard. You need a family tree. It’s time to introduce some much-needed parental supervision into your cluster. It’s time for Hierarchical Namespaces.

The origin of the namespace rebellion

In the beginning, Kubernetes gave us namespaces, and they were good. The goal was simple: create virtual walls to stop teams from stealing each other’s lunch (metaphorically speaking, of course). Each namespace was its own isolated island, a sovereign nation with its own rules. This “flat earth” model worked beautifully… until it didn’t.

As organizations scaled, their clusters turned into bustling archipelagos of hundreds of namespaces. Managing them felt like being an air traffic controller for a fleet of paper airplanes in a hurricane. Teams realized that a flat structure was basically a free-for-all party where every guest could raid the fridge, as long as they stayed in their designated room. There was no easy way to apply a single rule, like a network policy or a set of permissions, to a group of related namespaces. The result was a maddening copy-paste-a-thon of YAML files, a breeding ground for configuration drift and human error.

The community needed a way to group these islands, to draw continents. And so, the Hierarchical Namespace Controller (HNC) was born, bringing a simple, powerful concept to the table: namespaces can have parents.

What this parenting gig gets you

Adopting a hierarchical structure isn’t just about satisfying your inner control freak. It comes with some genuinely fantastic perks that make cluster management feel less like herding cats.

The “Because I said so” principle: This is the magic of policy inheritance. Any Role, RoleBinding, or NetworkPolicy you apply to a parent namespace automatically cascades down to all its children and their children, and so on. It’s the parenting dream: set a rule once, and watch it magically apply to everyone. No more duplicating RBAC roles for the dev, staging, and testing environments of the same application.
The family budget: You can set a resource quota on a parent namespace, and it becomes the total budget for that entire branch of the family tree. For instance, team-alpha gets 100 CPU cores in total. Their dev and qa children can squabble over that allowance, but together, they can’t exceed it. It’s like giving your kids a shared credit card instead of a blank check.
Delegated authority: You can make a developer an admin of a “team” namespace. Thanks to inheritance, they automatically become an admin of all the sub-namespaces under it. They get the freedom to manage their own little kingdoms (staging, testing, feature-x) without needing to ping a cluster-admin for every little thing. You’re teaching them responsibility (while keeping the master keys to the kingdom, of course).

Let’s wrangle some namespaces

Convinced? I thought so. The good news is that bringing this parental authority to your cluster isn’t just a fantasy. Let’s roll up our sleeves and see how it works.

Step 0: Install the enforcer

Before we can start laying down the law, we need to invite the enforcer. The Hierarchical Namespace Controller (HNC) doesn’t come built-in with Kubernetes. You have to install it first.

You can typically install the latest version with a single kubectl command:

kubectl apply -f [https://github.com/kubernetes-sigs/hierarchical-namespaces/releases/latest/download/hnc-manager.yaml](https://github.com/kubernetes-sigs/hierarchical-namespaces/releases/latest/download/hnc-manager.yaml)

Wait a minute for the controller to be up and running in its own hnc-system namespace. Once it’s ready, you’ll have a new superpower: the kubectl hns plugin.

Step 1: Create the parent namespace

First, let’s create a top-level namespace for a project. We’ll call it project-phoenix. This will be our proud parent.

kubectl create namespace project-phoenix

Step 2: Create some children

Now, let’s give project-phoenix a couple of children: staging and testing. Wait, what’s that hns command? That’s not your standard kubectl. That’s the magic wand the HNC just gave you. You’re telling it to create a staging namespace and neatly tuck it under its parent.

kubectl hns create staging -n project-phoenix
kubectl hns create testing -n project-phoenix

Step 3: Admire your family tree

To see your beautiful new hierarchy in all its glory, you can ask HNC to draw you a picture.

kubectl hns tree project-phoenix

You’ll get a satisfyingly clean ASCII art diagram of your new family structure:

You can even create grandchildren. Let’s give the staging namespace its own child for a specific feature branch.

kubectl hns create feature-login-v2 -n staging
kubectl hns tree project-phoenix

And now your tree looks even more impressive:

Step 4 Witness the magic of inheritance

Let’s prove that this isn’t all smoke and mirrors. We’ll create a Role in the parent namespace that allows viewing Pods.

# viewer-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-viewer
  namespace: project-phoenix
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

Apply it:

kubectl apply -f viewer-role.yaml

Now, let’s give a user, let’s call her jane.doe, that role in the parent namespace.

kubectl create rolebinding jane-viewer --role=pod-viewer --user=jane.doe -n project-phoenix

Here’s the kicker. Even though we only granted Jane permission in project-phoenix, she can now magically view pods in the staging and feature-login-v2 namespaces as well.

# This command would work for Jane!
kubectl auth can-i get pods -n staging --as=jane.doe
# YES

# And even in the grandchild namespace!
kubectl auth can-i get pods -n feature-login-v2 --as=jane.doe
# YES

No copy-pasting required. The HNC saw the binding in the parent and automatically propagated it down the entire tree. That’s the power of parenting.

A word of caution from a fellow parent

As with real parenting, this new power comes with its own set of challenges. It’s not a silver bullet, and you should be aware of a few things before you go building a ten-level deep namespace dynasty.

Complexity can creep in: A deep, sprawling tree of namespaces can become its own kind of nightmare to debug. Who has access to what? Which quota is affecting this pod? Keep your hierarchy logical and as flat as you can get away with. Just because you can create a great-great-great-grandchild namespace doesn’t mean you should.
Performance is not free: The HNC is incredibly efficient, but propagating policies across thousands of namespaces does have a cost. For most clusters, it’s negligible. For mega-clusters, it’s something to monitor.
Not everyone obeys the parents: Most core Kubernetes resources (RBAC, Network Policies, Resource Quotas) play nicely with HNC. But not all third-party tools or custom controllers are hierarchy-aware. They might only see the flat world, so always test your specific tools.

Go forth and organize

Hierarchical Namespaces are the organizational equivalent of finally buying drawer dividers for that one kitchen drawer, you know the one. The one where the whisk is tangled with the batteries and a single, mysterious key. They transform your cluster from a chaotic free-for-all into a structured, manageable hierarchy that actually reflects how your organization works. It’s about letting you set rules with confidence and delegate with ease.

So go ahead, embrace your inner cluster parent. Bring some order to the digital chaos. Your future self, the one who isn’t spending a Friday night debugging a rogue pod in the wrong environment, will thank you. Just don’t be surprised when your newly organized child namespaces start acting like teenagers, asking for the production Wi-Fi password or, heaven forbid, the keys to the cluster-admin car.After all, with great power comes great responsibility… and a much, much cleaner kubectl get ns output.

October 3, 2025 by Fernando SRE Cloud stuff DevOps stuff Kubernetes SRE stuff

Ingress and egress on EKS made understandable

Getting traffic in and out of a Kubernetes cluster isn’t a magic trick. It’s more like running the city’s most exclusive nightclub. It’s a world of logistics, velvet ropes, bouncers, and a few bureaucratic tollbooths on the way out. Once you figure out who’s working the front door and who’s stamping passports at the exit, the rest is just good manners.

Let’s take a quick tour of the establishment.

A ninety-second tour of the premises

There are really only two journeys you need to worry about in this club.

Getting In: A hopeful guest (the client) looks up the address (DNS), arrives at the front door, and is greeted by the head bouncer (Load Balancer). The bouncer checks the guest list and directs them to the right party room (Service), where they can finally meet up with their friend (the Pod).

Getting Out: One of our Pods needs to step out for some fresh air. It gets an escort from the building’s internal security (the Node’s ENI), follows the designated hallways (VPC routing), and is shown to the correct exit—be it the public taxi stand (NAT Gateway), a private car service (VPC Endpoint), or a connecting tunnel to another venue (Transit Gateway).

The secret sauce in EKS is that our Pods aren’t just faceless guests; the AWS VPC CNI gives them real VPC IP addresses. This means the building’s security rules, Security Groups, route tables, and NACLs aren’t just theoretical policies. They are the very real guards and locked doors that decide whether a packet’s journey ends in success or a silent, unceremonious death.

Getting past the velvet rope

In Kubernetes, Ingress is the set of rules that governs the front door. But rules on paper are useless without someone to enforce them. That someone is a controller, a piece of software that translates your guest list into actual, physical bouncers in AWS.

The head of security for EKS is the AWS Load Balancer Controller. You hand it an Ingress manifest, and it sets up the door staff.

For your standard HTTP web traffic, it deploys an Application Load Balancer (ALB). Think of the ALB as a meticulous, sharp-dressed bouncer who doesn’t just check your name. It inspects your entire invitation (the HTTP request), looks at the specific event you’re trying to attend (/login or /api/v1), and only then directs you to the right room.
For less chatty protocols like raw TCP, UDP, or when you need sheer, brute-force throughput, it calls in a Network Load Balancer (NLB). The NLB is the big, silent type. It checks that you have a ticket and shoves you toward the main hall. It’s incredibly fast but doesn’t get involved in the details.

This whole operation can be made public or private. For internal-only events, the controller sets up an internal ALB or NLB and uses a private Route 53 zone, hiding the party from the public internet entirely.

The modern VIP system

The classic Ingress system works, but it can feel a bit like managing your guest list with a stack of sticky notes. The rules for routing, TLS, and load balancer behavior are all crammed into a single resource, creating a glorious mess of annotations.

This is where the Gateway API comes in. It’s the successor to Ingress, designed by people who clearly got tired of deciphering annotation soup. Its genius lies in separating responsibilities.

The Platform team (the club owners) manages the Gateway. They decide where the entrances are, what protocols are allowed (HTTP, TCP), and handle the big-picture infrastructure like TLS certificates.
The Application teams (the party hosts) manage Routes (HTTPRoute, TCPRoute, etc.). They just point to an existing Gateway and define the rules for their specific application, like “send traffic for app.example.com/promo to my service.”

This creates a clean separation of duties, offers richer features for traffic management without resorting to custom annotations, and makes your setup far more portable across different environments.

The art of the graceful exit

So, your Pods are happily running inside the club. But what happens when they need to call an external API, pull an image, or talk to a database? They need to get out. This is egress, and it’s mostly about navigating the building’s corridors and exits.

The public taxi stand: For general internet access from private subnets, Pods are sent to a NAT Gateway. It works, but it’s like a single, expensive taxi stand for the whole neighborhood. Every trip costs money, and if it gets too busy, you’ll see it on your bill. Pro tip: Put one NAT in each Availability Zone to avoid paying extra for your Pods to take a cross-town cab just to get to the taxi stand.
The private car service: When your Pods need to talk to other AWS services (like S3, ECR, or Secrets Manager), sending them through the public internet is a waste of time and money. Use
VPC endpoints instead. Think of this as a pre-booked black car service. It creates a private, secure tunnel directly from your VPC to the AWS service. It’s faster, cheaper, and the traffic never has to brave the public internet.
The diplomatic passport: The worst way to let Pods talk to AWS APIs is by attaching credentials to the node itself. That’s like giving every guest in the club a master key. Instead, we use
IRSA (IAM Roles for Service Accounts). This elegantly binds an IAM role directly to a Pod’s service account. It’s the equivalent of issuing your Pod a diplomatic passport. It can present its credentials to AWS services with full authority, no shared keys required.

Setting the house rules

By default, Kubernetes networking operates with the cheerful, chaotic optimism of a free-for-all music festival. Every Pod can talk to every other Pod. In production, this is not a feature; it’s a liability. You need to establish some house rules.

Your two main tools for this are Security Groups and NetworkPolicy.

Security Groups are your Pod’s personal bodyguards. They are stateful and wrap around the Pod’s network interface, meticulously checking every incoming and outgoing connection against a list you define. They are an AWS-native tool and very precise.

NetworkPolicy, on the other hand, is the club’s internal security team. You need to hire a third-party firm like Calico or Cilium to enforce these rules in EKS, but once you do, you can create powerful rules like “Pods in the ‘database’ room can only accept connections from Pods in the ‘backend’ room on port 5432.”

The most sane approach is to start with a default deny policy. This is the bouncer’s universal motto: “If your name’s not on the list, you’re not getting in.” Block all egress by default, then explicitly allow only the connections your application truly needs.

A few recipes from the bartender

Full configurations are best kept in a Git repository, but here are a few cocktail recipes to show the key ingredients.

Recipe 1: Public HTTPS with a custom domain. This Ingress manifest tells the AWS Load Balancer Controller to set up a public-facing ALB, listen on port 443, use a specific TLS certificate from ACM, and route traffic for app.yourdomain.com to the webapp service.

# A modern Ingress for your web application
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: webapp-ingress
  annotations:
    # Set the bouncer to be public
    alb.ingress.kubernetes.io/scheme: internet-facing
    # Talk to Pods directly for better performance
    alb.ingress.kubernetes.io/target-type: ip
    # Listen for secure traffic
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    # Here's the TLS certificate to wear
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789012:certificate/your-cert-id
spec:
  ingressClassName: alb
  rules:
    - host: app.yourdomain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: webapp-service
                port:
                  number: 8080

Recipe 2: A diplomatic passport for S3 access. This gives our Pod a ServiceAccount annotated with an IAM role ARN. Any Pod that uses this service account can now talk to AWS APIs (like S3) with the permissions granted by that role, thanks to IRSA.

# The ServiceAccount with its IAM credentials
apiVersion: v1
kind: ServiceAccount
metadata:
  name: s3-reader-sa
  annotations:
    # This is the diplomatic passport: the ARN of the IAM role
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/EKS-S3-Reader-Role
---
# The Deployment that uses the passport
apiVersion: apps/v1
kind: Deployment
metadata:
  name: report-generator
spec:
  replicas: 1
  selector:
    matchLabels: { app: reporter }
  template:
    metadata:
      labels: { app: reporter }
    spec:
      # Use the service account we defined above
      serviceAccountName: s3-reader-sa
      containers:
        - name: processor
          image: your-repo/report-generator:v1.5.2
          ports:
            - containerPort: 8080

A short closing worth remembering

When you boil it all down, Ingress is just the etiquette you enforce at the front door. Egress is the paperwork required for a clean exit. In EKS, the etiquette is defined by Kubernetes resources, while the paperwork is pure AWS networking. Neither one cares about your intentions unless you write them down clearly.

So, draw the path for traffic both ways, pick the right doors for the job, give your Pods a proper identity, and set the tolls where they make sense. If you do, the cluster will behave, the bill will behave, and your on-call shifts might just start tasting a lot more like sleep.

September 23, 2025 by Fernando SRE Cloud stuff DevOps stuff Kubernetes SRE stuff

Your Kubernetes rollback is lying

The PagerDuty alert screams. The new release, born just minutes ago with such promising release notes, is coughing up blood in production. The team’s Slack channel is a frantic mess of flashing red emojis. Someone, summoning the voice of a panicked adult, yells the magic word: “ROLLBACK!”

And so, Helm, our trusty tow-truck operator, rides in with a smile, waving its friendly green check marks. The dashboards, those silent accomplices, beam with the serene glow of healthy metrics. Kubernetes probes, ever so polite, confirm that the resurrected pods are, in fact, “breathing.”

Then, production face-plants. Hard.

The feeling is like putting a cartoon-themed bandage on a burst water pipe and then wondering, with genuine surprise, why the living room has become a swimming pool. This article is the autopsy of those “perfect” rollbacks. We’re going to uncover why your monitoring is a pathological liar, how network traffic becomes a double agent, and what to do so that the next time Helm gives you a thumbs-up, you can actually believe it.

A state that refuses to time-travel

The first, most brutal lie a rollback tells you is that it can turn back time. A helm rollback is like the “rewind” button on an old VCR remote; it diligently rewinds the tape (your YAML manifests), but it has absolutely no power to make the actors on screen younger.

Your application’s state is one of those stubborn actors.

While your ConfigMaps and Secrets might dutifully revert to their previous versions, your data lives firmly in the present. If your new release included a database migration that added a column, rolling back the application code doesn’t magically make that column disappear. Now your old code is staring at a database schema from the future, utterly confused, like a medieval blacksmith being handed an iPad.

The same goes for PersistentVolumeClaims, external caches like Redis, or messages sitting in a Kafka queue. The rollback command whispers sweet nothings about returning to a “known good state,” but it’s only talking about itself. The rest of your universe has moved on, and it refuses to travel back with you.

The overly polite doorman

The second culprit in our investigation is the Kubernetes probe. Think of the readinessProbe as an overly polite doorman at a fancy party. Its job is to check if a guest (your pod) is ready to enter. But its definition of “ready” can be dangerously optimistic.

Many applications, especially those running on the JVM, have what we’ll call a “warming up” period. When a pod starts, the process is running, the HTTP port is open, and it will happily respond to a simple /health check. The doorman sees a guest in a tuxedo and says, “Looks good to me!” and opens the door.

What the doorman doesn’t see is that this guest is still stretching, yawning, and trying to remember where they are. The application’s caches are cold, its connection pools are empty, and its JIT compiler is just beginning to think about maybe, possibly, optimizing some code. The first few dozen requests it receives will be painfully slow or, worse, time out completely.

So while your readinessProbe is giving you a green light, your first wave of users is getting a face full of errors. For these sleepy applications, you need a more rigorous bouncer.

A startupProbe is that bouncer. It gives the app a generous amount of time to get its act together before even letting the doorman (readiness and liveness probes) start their shift.

# This probe gives our sleepy JVM app up to 5 minutes to wake up.
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
startupProbe:
  httpGet:
    path: /health/ready
    port: 8080
  # Kubelet will try 30 times with a 10-second interval (300 seconds).
  # If the app isn't ready by then, the pod will be restarted.
  failureThreshold: 30
  periodSeconds: 10

Without it, your rollback creates a fleet of pods that are technically alive but functionally useless, and Kubernetes happily sends them a flood of unsuspecting users.

Traffic, the double agent

And that brings us to our final suspect: the network traffic itself. In a modern setup using a service mesh like Istio or Linkerd, traffic routing is a sophisticated dance. But even the most graceful dancer can trip.

When you roll back, a new ReplicaSet is created with the old pod specification. The service mesh sees these new pods starting up, asks the doorman (readinessProbe) if they’re good to go, gets an enthusiastic “yes!”, and immediately starts sending them a percentage of live production traffic.

This is where all our problems converge. Your service mesh, in its infinite efficiency, has just routed 50% of your user traffic to a platoon of sleepy, confused pods that are trying to talk to a database from the future.

Let’s look at the evidence. This VirtualService, which we now call “The 50/50 Disaster Splitter,” was routing traffic with criminal optimism.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: checkout-api-vs
  namespace: prod-eu-central
spec:
  hosts:
    - "checkout.api.internal"
  http:
    - route:
        - destination:
            host: checkout-api-svc
            subset: v1-stable
          weight: 50 # 50% to the (theoretically) working pods
        - destination:
            host: checkout-api-svc
            subset: v1-rollback
          weight: 50 # 50% to the pods we just dragged from the past

The service mesh isn’t malicious. It’s just an incredibly efficient tool that is very good at following bad instructions. It sees a green light and hits the accelerator.

A survival guide that won’t betray you

So, you’re in the middle of a fire, and the “break glass in case of emergency” button is a lie. What do you do? You need a playbook that acknowledges reality.

Step 0: Breathe and isolate the blast radius

Before you even think about rolling back, stop the bleeding. The fastest way to do that is often at the traffic level. Use your service mesh or ingress controller to immediately shift 100% of traffic back to the last known good version. Don’t wait for new pods to start. This is a surgical move that takes seconds and gives you breathing room.

Step 1: Declare an incident and gather the detectives

Get the right people on a call. Announce that this is not a “quick rollback” but an incident investigation. Your goal is to understand why the release failed, not just to hit the undo button.

Step 2: Perform the autopsy (while the system is stable)

With traffic safely routed away from the wreckage, you can now investigate. Check the logs of the failed pods. Look at the database. Is there a schema mismatch? A bad configuration? This is where you find the real killer.

Step 3: Plan the counter-offensive (which might not be a rollback)

Sometimes, the safest path forward is a roll forward. A small hotfix that corrects the issue might be faster and less risky than trying to force the old code to work with a new state. A rollback should be a deliberate, planned action, not a panic reflex. If you must roll back, do it with the knowledge you’ve gained from your investigation.

Step 4: The deliberate, cautious rollback

If you’ve determined a rollback is the only way, do it methodically.

Scale down the broken deployment:
kubectl scale deployment/checkout-api –replicas=0
Execute the Helm rollback:
helm rollback checkout-api 1 -n prod-eu-central
Watch the new pods like a hawk: Monitor their logs and key metrics as they come up. Don’t trust the green check marks.
Perform a Canary Release: Once the new pods look genuinely healthy, use your service mesh to send them 1% of the traffic. Then 10%. Then 50%. Then 100%. You are now in control, not the blind optimism of the automation.

The truth will set you free

A Kubernetes rollback isn’t a time machine. It’s a YAML editor with a fancy title. It doesn’t understand your data, it doesn’t appreciate your app’s need for a morning coffee, and it certainly doesn’t grasp the nuances of traffic routing under pressure.

Treating a rollback as a simple, safe undo button is the fastest way to turn a small incident into a full-blown outage. By understanding the lies it tells, you can build a process that trusts human investigation over deceptive green lights. So the next time a deployment goes sideways, don’t just reach for the rollback lever. Reach for your detective’s hat instead.

September 11, 2025 by Fernando SRE DevOps stuff Kubernetes SRE stuff

If your Kubernetes YAML looks Like hieroglyphics, this post is for you

It all started, as most tech disasters do, with a seductive whisper. “Just describe your infrastructure with YAML,” Kubernetes cooed. “It’ll be easy,” it said. And we, like fools in a love story, believed it.

At first, it was a beautiful romance. A few files, a handful of lines. It was elegant. It was declarative. It was… manageable. But entropy, the nosy neighbor of every DevOps team, had other plans. Our neat little garden of YAML files soon mutated into a sprawling, untamed jungle of configuration.

We had 12 microservices jostling for position, spread across 4 distinct environments, each with its own personality quirks and dark secrets. Before we knew it, we weren’t writing infrastructure anymore; we were co-authoring a Byzantine epic in a language seemingly designed by bureaucrats with a fetish for whitespace.

The question that broke the camel’s back

The day of reckoning didn’t arrive with a server explosion or a database crash. It came with a question. A question that landed in our team’s Slack channel with the subtlety of a dropped anvil, courtesy of a junior engineer who hadn’t yet learned to fear the YAML gods.

“Hey, why does our staging pod have a different CPU limit than prod?”

Silence. A deep, heavy, digital silence. The kind of silence that screams, “Nobody has a clue.”

What followed was an archaeological dig into the fossil record of our own repository. We unearthed layers of abstractions we had so cleverly built, peeling them back one by one. The trail led us through a hellish labyrinth:

We started at deployment.yaml, the supposed source of all truth.
That led us to values.yaml, the theoretical source of all truth.
From there, we spelunked into values.staging.yaml, where truth began to feel… relative.
We stumbled upon a dusty patch-cpu-emergency.yaml, a fossil from a long-forgotten crisis.
Then we navigated the dark forest of custom/kustomize/base/deployment-overlay.yaml.
And finally, we reached the Rosetta Stone of our chaos: an argocd-app-of-apps.yaml.

The revelation was as horrifying as finding a pineapple on a pizza: we had declared the same damn value six times, in three different formats, using two tools that secretly despised each other. We weren’t managing the configuration. We were performing a strange, elaborate ritual and hoping the servers would be pleased.

That’s when we knew. This wasn’t a configuration problem. It was an existential crisis. We were, without a doubt, deep in YAML Hell.

The tools that promised heaven and delivered purgatory

Let’s talk about the “friends” who were supposed to help. These tools promised to be our saviors, but without discipline, they just dug our hole deeper.

Helm, the chaotic magician

Helm is like a powerful but slightly drunk magician. When it works, it pulls a rabbit out of a hat. When it doesn’t, it sets the hat on fire, and the rabbit runs off with your wallet.

The Promise: Templating! Variables! A whole ecosystem of charts!

The Reality: Debugging becomes a form of self-torment that involves piping helm template into grep and praying. You end up with conditionals inside your templates that look like this:

image:
  repository: {{ .Values.image.repository | quote }}
  tag: {{ .Values.image.tag | default .Chart.AppVersion }}
  pullPolicy: {{ .Values.image.pullPolicy | default "IfNotPresent" }}

This looks innocent enough. But then someone forgets to pass image.tag for a specific environment, and you silently deploy :latest to production on a Friday afternoon. Beautiful.

Kustomize the master of patches

Kustomize is the “sensible” one. It’s built into kubectl. It promises clean, layered configurations. It’s like organizing your Tupperware drawer with labels.

The Promise: A clean base and tidy overlays for each environment.

The Reality: Your patch files quickly become a mystery box. You see this in your kustomization.yaml:

patchesStrategicMerge:
  - increase-replica-count.yaml
  - add-resource-limits.yaml
  - disable-service-monitor.yaml

Where are these files? What do they change? Why does disable-service-monitor.yaml only apply to the dev environment? Good luck, detective. You’ll need it.

ArgoCD, the all-seeing eye (that sometimes blinks)

GitOps is the dream. Your Git repo is the single source of truth. No more clicking around in a UI. ArgoCD or Flux will make it so.

The Promise: Declarative, automated sync from Git to cluster. Rollbacks are just a git revert away.

The Reality: If your Git repo is a dumpster fire of conflicting YAML, ArgoCD will happily, dutifully, and relentlessly sync that dumpster fire to production. It won’t stop you. One bad merge, and you’ve automated a catastrophe.

Our escape from YAML hell was a five-step sanity plan

We knew we couldn’t burn it all down. We had to tame the beast. So, we gathered the team, drew a line in the sand, and created five commandments for configuration sanity.

1. We built a sane repo structure

The first step was to stop the guesswork. We enforced a simple, predictable layout for every single service.

├── base/
│   ├── deployment.yaml
│   ├── service.yaml
│   └── configmap.yaml
└── overlays/
    ├── dev/
    │   ├── kustomization.yaml
    │   └── values.yaml
    ├── staging/
    │   ├── kustomization.yaml
    │   └── values.yaml
    └── prod/
        ├── kustomization.yaml
        └── values.yaml

This simple change eliminated 80% of the “wait, which file do I edit?” conversations.

2. One source of truth for values

This was a sacred vow. Each environment gets one values.yaml file. That’s it. We purged the heretics:

values-prod-final.v2.override.yaml
backup-of-values.yaml
donotdelete-temp-config.yaml

If a value wasn’t in the designated values.yaml for that environment, it didn’t exist. Period.

3. We stopped mixing Helm and Kustomize

You have to pick a side. We made a rule: if a service requires complex templating logic, use Helm. If it primarily needs simple overlays (like changing replica counts or image tags per environment), use Kustomize. Using both on the same service is like trying to write a sentence in two languages at once. It’s a recipe for suffering.

4. We render everything before deploying

Trust, but verify. We added a mandatory step in our CI pipeline to render the final YAML before it ever touches the cluster.

# For Helm + Kustomize setups
helm template . --values overlays/prod/values.yaml | \
kustomize build | \
kubeval -

This simple script does three magical things:

It validates that the output is syntactically correct YAML.
It lets us see exactly what is about to be applied.
It has completely eliminated the “well, that’s not what I expected” class of production incidents.

5. We built a simple config CLI

To make the right way the easy way, we built a small internal CLI tool. Now, instead of navigating the YAML jungle, an engineer simply runs:

$ ops-cli config generate --app=user-service --env=prod

This tool:

Pulls the correct base templates and overlay values.
Renders the final, glorious YAML.
Validates it against our policies.
Shows the developer a diff of what will change in the cluster.
Saves lives and prevents hair loss.

YAML is now a tool again, not a trap.

The afterlife is peaceful

YAML didn’t ruin our lives. We did, by refusing to treat it with the respect it demands. Templating gives you incredible power, but with great power comes great responsibility… and redundancy, and confusion, and pull requests with 10,000 lines of whitespace changes. Now, we treat our YAML like we treat our application code. We lint it. We test it. We render it. And most importantly, we’ve built a system that makes it difficult to do the wrong thing. It’s the institutional equivalent of putting childproof locks on the kitchen cabinets. A determined toddler could probably still get to the cleaning supplies, but it would require a conscious, frustrating effort. Our system doesn’t make us smarter; it just makes our inevitable moments of human fallibility less catastrophic. It’s the guardrail on the scenic mountain road of configuration. You can still drive off the cliff, but you have to really mean it. Our infrastructure is no longer a hieroglyphic. It’s just… configuration. And the resulting boredom is a beautiful thing.

August 3, 2025 by Fernando SRE Cloud stuff DevOps stuff Kubernetes SRE stuff

What if Kubernetes was the wrong tool for almost everyone?

It was 11 PM on a Tuesday, and for the third time that week, we were on an emergency call. Not because our product had a critical bug, but because a routine deployment had, once again, mysteriously broken the internal DNS. Three of our sharpest engineers weren’t creating value; they were offering sacrifices to the capricious god of Istio, hoping it might bless our pods with connectivity.

As I stared at the sprawling diagram we’d made just to add a single, simple microservice, a heretical thought wormed its way into my brain: When did our job stop being about building software and become about… well, serving Kubernetes?

A rumor we couldn’t ignore

The next morning, amidst our collective YAML-induced hangover, someone dropped a screenshot into our team’s Slack channel. It was from some tiny, no-name startup, claiming they were getting 8x the performance of a standard Kubernetes stack at one-tenth of the cost.

The team’s reaction was a collective, cynical laugh. “Sure,” our lead SRE typed, “and my home server can out-render Pixar.” It was obviously marketing fluff. Propaganda. The post itself was deleted within an hour, but the screenshot had already gone viral. It was absurd, unbelievable, and we all knew it was nonsense.

But the idea, like a catchy, terrible pop song, got stuck in our heads.

Let’s just prove it’s impossible

That Friday, we decided to do it. We’d run a small, contained experiment, mostly for the bragging rights of publicly debunking the ridiculous claim. The plan was simple: take one of our standard, moderately complex services and see what it would take to run it on a simpler stack.

Our service, an image processor, wasn’t a behemoth, but its YAML file had grown… organically. Like a colony of particularly stubborn mold. It had sidecars, persistent volume claims, readiness probes, and enough annotations to qualify as a short novel.

Here’s a sanitized glimpse of the beast we were trying to tame:

# apiVersion: apps/v1
# kind: Deployment
# metadata:
#   name: image-processor-svc
#   labels:
#     app: image-processor
# spec:
#   replicas: 3
#   selector:
#     matchLabels:
#       app: image-processor
#   template:
#     metadata:
#       labels:
#         app: image-processor
#     spec:
#       containers:
#       - name: processor
#         image: our-repo/image-processor:v1.2.4
#         ports:
#         - containerPort: 8080
#         resources:
#           requests:
#             memory: "512Mi"
#             cpu: "250m"
#           limits:
#             memory: "1024Mi"
#             cpu: "500m"
#         readinessProbe:
#           httpGet:
#             path: /healthz
#             port: 8080
#           initialDelaySeconds: 5
#           periodSeconds: 10
#       - name: metrics-sidecar
#         image: prom/statsd-exporter:v0.22.0
#         args:
#         - "--statsd.mapping-config=/etc/statsd/mapper.yml"
#         ports:
#         - name: metrics
#           containerPort: 9102
#         volumeMounts:
#         - name: config-volume
#           mountPath: /etc/statsd
#   # ...and so on, for another 150 lines.

We figured it would take us all afternoon just to untangle it.

The uncomfortable silence of success

We set up a competing stack using Firecracker MicroVMs, essentially tiny, lightning-fast virtual machines. The goal was to run the same container, but without the entire Kubernetes universe orbiting around it.

By 3 PM, we had our first results. And that’s when the room went quiet. It was the kind of uncomfortable silence you get when you realize the joke you’ve been telling for months is actually on you.

The numbers weren’t just holding up; they were embarrassing us. We stared at the Grafana dashboard, waiting for the figures to make sense. They didn’t. Our projected monthly cloud bill for this single service didn’t just shrink; it plummeted.

We had spent years building a complex, expensive, and fragile machine, all to solve a problem that, it turned out, could be handled with a much, much simpler approach.

Our quest for sane infrastructure

That weekend experiment turned into a full-blown obsession. Inspired, we started building our own internal escape hatch. We cobbled together a tool that we jokingly called the “SQL-ifier.” The premise was simple and, to a Kubernetes purist, utterly profane: what if you could manage infrastructure with simple, readable rules instead of YAML incantations?

Instead of a 200-line YAML file for an autoscaler, what if you could just write this?

-- This is our internal, SQL-like syntax for managing MicroVMs.
-- It's not a public standard (yet!).

-- Rule: If CPU usage on the frontend service is over 80% for 3 minutes,
-- add two more instances, but never exceed 20 total.
ON high_cpu(>80%) FOR 3m
IF service.name = 'image-processor-svc'
DO SCALE service.name TO instances + 2
LIMIT 20;

-- Rule: If we see more than 10 critical payment errors in 1 minute,
-- immediately revert to the last stable version.
ON log_error(level='critical', service='payment-gateway') > 10 FOR 1m
DO ROLLBACK service 'payment-gateway' TO previous_stable;

It was declarative, readable, and, most importantly, it could be understood by a human being without needing a certification.

How did we all end up here?

This journey forced us to ask a bigger question. Why did we, and thousands of other smart teams, willingly chain ourselves to this complexity?

The answer is surprisingly human. We bought an 18-wheeler truck to do our weekly grocery shopping.

Sure, a giant truck can carry milk and eggs. But you spend most of your time finding a place to park it, paying for diesel, getting a special license to drive it, and explaining to your neighbors why you just flattened their mailbox. Kubernetes was built for Google-scale problems. Most of us run businesses that need the reliability of a Toyota Camry, not a fleet of space shuttles. We adopted the tool because everyone else did, mistaking its complexity for sophistication.

Is your infrastructure serving you?

This whole experience led us to create a simple “mirror test.” If you’re wondering if you’re in the same boat, ask your team these questions:

Do you dread upgrading your cluster more than you dread a root canal?
Is a significant portion of your engineering time spent on “infra-babysitting” instead of building product features?
Could you confidently explain your service mesh configuration to a new hire without a whiteboard and a two-hour meeting?

If you answered “yes” to two or more, you might not have an infrastructure problem. You might have a Kubernetes problem.

This isn’t a manifesto to uninstall kubectl tomorrow. Some organizations genuinely operate at a scale where Kubernetes is not just useful, but necessary. This is just a friendly nudge. A reminder to look up from your YAML files once in a while and ask: is this tool still serving me, or have I started serving it?

July 30, 2025 by Fernando SRE Cloud stuff DevOps stuff Kubernetes SRE stuff

How Headless services and StatefulSets work together in Kubernetes

Kubernetes is an open-source platform designed to seamlessly manage containerized applications. Imagine the manager at your favorite café coordinating baristas, chefs, and servers effortlessly, ensuring a smooth customer experience every single time. Kubernetes automates deployments, scaling, and operations, making it indispensable for today’s complex digital landscape.

Understanding headless services

At first glance, Headless Services might seem unusual, yet they’re essential Kubernetes components. Regular Kubernetes Services act as receptionists routing your calls; Headless Services, however, skip the receptionist altogether and connect you directly to individual pods via their unique IP addresses.

Consider them as a neighborhood directory listing direct phone numbers, eliminating the central switchboard. This direct approach is particularly beneficial when individual pod identity and communication are critical, such as with database clusters.

Example YAML for a headless service:

apiVersion: v1
kind: Service
metadata:
  name: my-headless-service
spec:
  clusterIP: None
  selector:
    app: my-app
  ports:
  - port: 80

Demystifying StatefulSets

StatefulSets uniquely manage stateful applications by assigning each pod a stable identity and persistent storage. Imagine a classroom where each student (pod) has an assigned desk (storage) that remains consistent, no matter how often they come and go.

Comparing StatefulSets and deployments

Deployments are ideal for stateless applications, where each instance is interchangeable and can be replaced without affecting the overall system. StatefulSets, however, excel with stateful applications, ensuring pods have stable identities and persistent storage, perfect for databases and message queues.

Example YAML for a StatefulSet:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: my-stateful-app
spec:
  serviceName: my-headless-service
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: app-container
        image: my-app-image
        ports:
        - containerPort: 80
        volumeMounts:
        - name: my-volume
          mountPath: /data
  volumeClaimTemplates:
  - metadata:
      name: my-volume
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 1Gi

The strength of pairing headless services with StatefulSets

Headless Services and StatefulSets each have significant strengths independently, but they truly shine when combined. Headless Services provide stable network identities for StatefulSet pods, akin to each member of a specialized team having their direct communication line for efficient collaboration.

Picture an emergency medical team; direct lines enable doctors and nurses to coordinate rapidly and precisely during critical situations. Similarly, distributed databases such as Cassandra or MongoDB rely heavily on this direct communication model to maintain data consistency and reliability.

Practical use-case

Consider a Cassandra database running on Kubernetes. StatefulSets ensure each Cassandra node has dedicated data storage and a unique identity. With Headless Services, these nodes communicate directly, consistently synchronizing data and ensuring seamless accessibility, irrespective of which node handles the incoming requests.

Concluding insights

Headless Services combined with StatefulSets form a powerful solution for managing stateful applications within Kubernetes. They address distinct challenges in state management and network stability, ensuring reliability and scalability for your applications.

Leveraging these Kubernetes capabilities equips your infrastructure for success, akin to empowering each team member with the necessary tools for clear communication and consistent performance. Embrace this dynamic duo for a more robust and efficient Kubernetes environment.

July 9, 2025 by Fernando SRE Cloud stuff DevOps stuff Kubernetes SRE stuff

Why Kubernetes Ingress feels outdated and Gateway API is stepping up

Kubernetes has transformed container orchestration, rapidly pushing the boundaries of scalability and flexibility. Yet some core components haven’t evolved as gracefully. Kubernetes Ingress is a prime example; it’s beginning to feel like using an old flip phone when everyone else has moved on to smartphones.

What’s driving this shift away from the once-reliable Ingress, and why are more Kubernetes professionals turning to Gateway API?

The rise and limits of Kubernetes Ingress

When Kubernetes introduced Ingress, its appeal lay in its simplicity. Its job was straightforward: route HTTP and HTTPS traffic into Kubernetes clusters predictably. Like traffic lights at a busy intersection, it provided clear and reliable outcomes: set paths and hostnames, and your Ingress controller (NGINX, Traefik, or others) took care of the rest.

However, as Kubernetes workloads grew more complex, this simplicity became restrictive. Teams began seeking advanced capabilities such as canary deployments, complex traffic management, support for additional protocols, and finer control. Unfortunately, Ingress remained static, forcing teams to rely on cumbersome vendor-specific customizations.

Why Ingress now feels outdated

Ingress still performs adequately, but managing it becomes increasingly cumbersome as complexity rises. It’s comparable to owning a reliable but outdated vehicle; it gets you to your destination but lacks modern conveniences. Here’s why Ingress feels out-of-date:

Limited protocol support – Only HTTP and HTTPS are supported natively. If your applications require gRPC, TCP, or UDP, you’re out of luck.
Vendor lock-in with annotations – Advanced routing features and authentication mechanisms often require vendor-specific annotations, locking you into particular solutions.
Rigid permission models – Managing shared control across multiple teams is complicated and inefficient, similar to having a single TV remote for an entire household.
No evolutionary path – Ingress will remain stable but static, unable to evolve as the Kubernetes ecosystem demands greater flexibility.

Gateway API offers a modern alternative

Gateway API isn’t merely an upgraded Ingress; it’s a fundamental rethink of how Kubernetes handles external traffic. It cleanly separates roles and responsibilities, streamlining interactions between network administrators, platform teams, and developers. Think of it as a well-run restaurant: chefs, managers, and servers each have clear roles, ensuring smooth and efficient operation.

Additionally, Gateway API supports multiple protocols, including gRPC, TCP, and UDP, natively. This eliminates reliance on awkward annotations and vendor lock-in, resembling an upgrade from single-purpose appliances to versatile multi-function tools that adapt smoothly to emerging needs.

When Gateway API becomes essential

Gateway API won’t suit every situation, but specific scenarios benefit from its use. Consider these questions:

Do your applications require sophisticated traffic handling, like canary deployments or traffic mirroring?
Are your services utilizing protocols beyond HTTP and HTTPS?
Is your Kubernetes cluster shared among multiple teams, each needing distinct control?
Do you seek portability across cloud providers and wish to avoid vendor lock-in?
Do you often desire modern features that are unavailable through traditional Ingress?

Answering “yes” to any of these indicates that Gateway API isn’t just helpful, it’s essential.

Deciding to move forward

Ingress isn’t entirely obsolete. For straightforward HTTP/HTTPS routing for smaller services, it remains effective. But as soon as your needs scale up, involve complex traffic management, or require clear team boundaries, Gateway API becomes the superior choice.

Technology continuously advances, and your infrastructure must evolve with it. Gateway API isn’t a futuristic solution; it’s already here, enhancing your Kubernetes deployments with greater intelligence, flexibility, and manageability.

When better tools appear, upgrading isn’t merely sensible, it’s crucial. Gateway API represents precisely this meaningful advancement, ensuring your Kubernetes environment remains robust, adaptable, and ready for whatever comes next.

June 26, 2025 by Fernando SRE Cloud stuff DevOps stuff Kubernetes SRE stuff

Achieving perfect elasticity in Kubernetes with multidimensional autoscaling

Running a Kubernetes environment can feel like a high-stakes game of guesswork. We estimate our application’s needs, define our resource requests, and hope we’ve struck the right balance. Too generous, and we’re paying for cloud resources that sit idle. Too conservative, and we risk sluggish performance or critical outages when real-world demand spikes. It’s a constant, stressful effort to manually tune a system that is inherently dynamic.

There is, however, a more elegant path. It involves moving away from this static guesswork and towards building a truly adaptive infrastructure. This is not about simply adding more tools; it’s about creating a self-regulating system that breathes with the rhythm of your workload. This is the core promise of a well-orchestrated Kubernetes autoscaling strategy. Let’s explore how to build it, piece by piece.

The three pillars of autoscaling

To build our adaptive system, we need to understand its three fundamental components. Think of them as the different ways a professional restaurant kitchen responds to a dinner rush.

The Horizontal Pod Autoscaler HPA

When a flood of orders hits the kitchen, the head chef doesn’t ask each cook to work twice as fast. The first, most logical step is to bring more cooks to the line. This is precisely what the Horizontal Pod Autoscaler does. It acts as the kitchen’s manager, watching the incoming demand (typically CPU or memory usage). As orders pile up, it adds more identical pod replicas, more “cooks”, to handle the load. When the rush subsides, it sends some cooks home, ensuring you’re only paying for the staff you need. It’s the frontline response to fluctuating demand.

The Vertical Pod Autoscaler VPA

Now, consider a specialized station, like the grill. What if the single grill cook is overwhelmed, not by the number of orders, but because their workspace is too small and inefficient? Simply adding another grill cook might just create more chaos in a cramped space. The better solution is to give the specialist a bigger, better grill station. This is the domain of the Vertical Pod Autoscaler. The VPA doesn’t change the number of pods. Instead, it meticulously observes the real-world resource consumption of a single pod over time and adjusts its allocated CPU and memory, its “workspace”, to be the perfect size. It answers the question, “How much power does this one cook need to do their job perfectly?”

The Cluster Autoscaler CA

What happens if the kitchen runs out of physical space? You can’t add more cooks or bigger grills if there’s no room for them. This is where the Cluster Autoscaler comes in. It is the architect of the kitchen itself. The CA doesn’t pay attention to individual orders or cooks. Its sole focus is space. When it sees pods that can’t be scheduled because no node has enough capacity, our “cooks without a counter”, it expands the kitchen by adding new nodes to the cluster. Conversely, when it sees entire sections of the kitchen sitting empty for too long, it smartly downsizes the space to keep operational costs low.

From static blueprints to dynamic reality

When we first deploy an application on Kubernetes, we manually define its resources.requests, and resources.limits. This is like creating a static architectural blueprint for our kitchen. We draw the lines based on our best assumptions.

But a blueprint doesn’t capture the chaotic, dynamic flow of a real dinner service. An application’s actual needs are rarely static. This is where the VPA transforms our approach. It moves us from relying on a fixed blueprint to observing the kitchen’s real-time workflow. It provides the data-driven intelligence to continuously refine and optimize our initial design, shifting us from a world of reactive fixes to one of proactive optimization.

How a great platform elevates the craft

Anyone can assemble a kitchen, but the difference between a home setup and a Michelin-star facility lies in the integration, quality, and advanced tooling. In the Kubernetes world, this is the value a managed platform like Google Kubernetes Engine (GKE) provides.

While HPA, VPA, and CA are open-source concepts, managing them yourself is like building and maintaining that professional kitchen from scratch. GKE offers them as fully managed, seamlessly integrated services.

Effortless setup. Enabling these autoscalers in GKE is a simple, declarative action, removing significant operational overhead.
An expert consultant, the VPA’s “recommendation-only” mode is a game-changer. It’s like having a master chef observe your kitchen and leave detailed notes on how to improve efficiency, all without interrupting service. This free, built-in guidance is invaluable for right-sizing your workloads.

However, GKE’s most significant innovation is a technique that solves a classic Kubernetes puzzle: The Multidimensional Pod Autoscaler (MPA).

Historically, trying to use HPA (more cooks) and VPA (better workspaces) on the same workload was a recipe for conflict. The two would issue contradictory signals, leading to instability. GKE’s MPA acts as the master head chef, intelligently coordinating both actions. It allows you to scale horizontally and vertically at the same time, ensuring your kitchen can both add more cooks and give them better equipment in one fluid motion. This is the ultimate expression of elasticity.

A practical blueprint for your strategy

With this understanding, you can now design a robust autoscaling strategy:

For Your Stateless Dishes (e.g., web frontends, APIs)
Start with the HPA to handle variable traffic. As you mature, graduate to the MPA to achieve a superior level of efficiency by scaling in both dimensions.
For Your Stateful Specialties (e.g., databases, message queues)
Rely on the VPA to meticulously right-size these critical components, ensuring they always have the exact resources needed for stable and reliable performance.
For the Entire Kitchen
Let the Cluster Autoscaler work in the background as your ever-vigilant architect, always ensuring there is enough underlying infrastructure for your applications to thrive.

An autonomous future awaits

We started with a stressful guessing game and have arrived at the blueprint for an intelligent, self-regulating infrastructure. By thoughtfully combining HPA, VPA, and CA, we evolve from being reactive system administrators to proactive cloud architects.

This journey culminates with tools like GKE’s Multidimensional Pod Autoscaler. The MPA is more than just another feature; it represents a paradigm shift. It solves the fundamental conflict between scaling out and scaling up, allowing our applications to adapt with a new level of intelligence. With MPA, workloads can simultaneously handle sudden traffic surges by adding replicas, while continuously right-sizing the resource footprint of each instance. This dual-axis scaling eliminates the trade-offs we once had to make, unlocking a state of true, cost-effective elasticity.

The path to this autonomous state is an incremental one. The best first step is to harness the power of observation. Start today by enabling VPA in recommendation-only mode on a non-production workload. Listen to its insights, understand your application’s real needs, and use that data to transform your static blueprints. This is the foundational skill that will empower you to confidently adopt multidimensional scaling, creating a dynamic, living system ready to meet any challenge that comes its way.

June 21, 2025 by Fernando SRE Cloud stuff DevOps stuff Kubernetes