PlatformEngineering

Ingress and egress on EKS made understandable

Getting traffic in and out of a Kubernetes cluster isn’t a magic trick. It’s more like running the city’s most exclusive nightclub. It’s a world of logistics, velvet ropes, bouncers, and a few bureaucratic tollbooths on the way out. Once you figure out who’s working the front door and who’s stamping passports at the exit, the rest is just good manners.

Let’s take a quick tour of the establishment.

A ninety-second tour of the premises

There are really only two journeys you need to worry about in this club.

Getting In: A hopeful guest (the client) looks up the address (DNS), arrives at the front door, and is greeted by the head bouncer (Load Balancer). The bouncer checks the guest list and directs them to the right party room (Service), where they can finally meet up with their friend (the Pod).

Getting Out: One of our Pods needs to step out for some fresh air. It gets an escort from the building’s internal security (the Node’s ENI), follows the designated hallways (VPC routing), and is shown to the correct exit—be it the public taxi stand (NAT Gateway), a private car service (VPC Endpoint), or a connecting tunnel to another venue (Transit Gateway).

The secret sauce in EKS is that our Pods aren’t just faceless guests; the AWS VPC CNI gives them real VPC IP addresses. This means the building’s security rules, Security Groups, route tables, and NACLs aren’t just theoretical policies. They are the very real guards and locked doors that decide whether a packet’s journey ends in success or a silent, unceremonious death.

Getting past the velvet rope

In Kubernetes, Ingress is the set of rules that governs the front door. But rules on paper are useless without someone to enforce them. That someone is a controller, a piece of software that translates your guest list into actual, physical bouncers in AWS.

The head of security for EKS is the AWS Load Balancer Controller. You hand it an Ingress manifest, and it sets up the door staff.

For your standard HTTP web traffic, it deploys an Application Load Balancer (ALB). Think of the ALB as a meticulous, sharp-dressed bouncer who doesn’t just check your name. It inspects your entire invitation (the HTTP request), looks at the specific event you’re trying to attend (/login or /api/v1), and only then directs you to the right room.
For less chatty protocols like raw TCP, UDP, or when you need sheer, brute-force throughput, it calls in a Network Load Balancer (NLB). The NLB is the big, silent type. It checks that you have a ticket and shoves you toward the main hall. It’s incredibly fast but doesn’t get involved in the details.

This whole operation can be made public or private. For internal-only events, the controller sets up an internal ALB or NLB and uses a private Route 53 zone, hiding the party from the public internet entirely.

The modern VIP system

The classic Ingress system works, but it can feel a bit like managing your guest list with a stack of sticky notes. The rules for routing, TLS, and load balancer behavior are all crammed into a single resource, creating a glorious mess of annotations.

This is where the Gateway API comes in. It’s the successor to Ingress, designed by people who clearly got tired of deciphering annotation soup. Its genius lies in separating responsibilities.

The Platform team (the club owners) manages the Gateway. They decide where the entrances are, what protocols are allowed (HTTP, TCP), and handle the big-picture infrastructure like TLS certificates.
The Application teams (the party hosts) manage Routes (HTTPRoute, TCPRoute, etc.). They just point to an existing Gateway and define the rules for their specific application, like “send traffic for app.example.com/promo to my service.”

This creates a clean separation of duties, offers richer features for traffic management without resorting to custom annotations, and makes your setup far more portable across different environments.

The art of the graceful exit

So, your Pods are happily running inside the club. But what happens when they need to call an external API, pull an image, or talk to a database? They need to get out. This is egress, and it’s mostly about navigating the building’s corridors and exits.

The public taxi stand: For general internet access from private subnets, Pods are sent to a NAT Gateway. It works, but it’s like a single, expensive taxi stand for the whole neighborhood. Every trip costs money, and if it gets too busy, you’ll see it on your bill. Pro tip: Put one NAT in each Availability Zone to avoid paying extra for your Pods to take a cross-town cab just to get to the taxi stand.
The private car service: When your Pods need to talk to other AWS services (like S3, ECR, or Secrets Manager), sending them through the public internet is a waste of time and money. Use
VPC endpoints instead. Think of this as a pre-booked black car service. It creates a private, secure tunnel directly from your VPC to the AWS service. It’s faster, cheaper, and the traffic never has to brave the public internet.
The diplomatic passport: The worst way to let Pods talk to AWS APIs is by attaching credentials to the node itself. That’s like giving every guest in the club a master key. Instead, we use
IRSA (IAM Roles for Service Accounts). This elegantly binds an IAM role directly to a Pod’s service account. It’s the equivalent of issuing your Pod a diplomatic passport. It can present its credentials to AWS services with full authority, no shared keys required.

Setting the house rules

By default, Kubernetes networking operates with the cheerful, chaotic optimism of a free-for-all music festival. Every Pod can talk to every other Pod. In production, this is not a feature; it’s a liability. You need to establish some house rules.

Your two main tools for this are Security Groups and NetworkPolicy.

Security Groups are your Pod’s personal bodyguards. They are stateful and wrap around the Pod’s network interface, meticulously checking every incoming and outgoing connection against a list you define. They are an AWS-native tool and very precise.

NetworkPolicy, on the other hand, is the club’s internal security team. You need to hire a third-party firm like Calico or Cilium to enforce these rules in EKS, but once you do, you can create powerful rules like “Pods in the ‘database’ room can only accept connections from Pods in the ‘backend’ room on port 5432.”

The most sane approach is to start with a default deny policy. This is the bouncer’s universal motto: “If your name’s not on the list, you’re not getting in.” Block all egress by default, then explicitly allow only the connections your application truly needs.

A few recipes from the bartender

Full configurations are best kept in a Git repository, but here are a few cocktail recipes to show the key ingredients.

Recipe 1: Public HTTPS with a custom domain. This Ingress manifest tells the AWS Load Balancer Controller to set up a public-facing ALB, listen on port 443, use a specific TLS certificate from ACM, and route traffic for app.yourdomain.com to the webapp service.

# A modern Ingress for your web application
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: webapp-ingress
  annotations:
    # Set the bouncer to be public
    alb.ingress.kubernetes.io/scheme: internet-facing
    # Talk to Pods directly for better performance
    alb.ingress.kubernetes.io/target-type: ip
    # Listen for secure traffic
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    # Here's the TLS certificate to wear
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789012:certificate/your-cert-id
spec:
  ingressClassName: alb
  rules:
    - host: app.yourdomain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: webapp-service
                port:
                  number: 8080

Recipe 2: A diplomatic passport for S3 access. This gives our Pod a ServiceAccount annotated with an IAM role ARN. Any Pod that uses this service account can now talk to AWS APIs (like S3) with the permissions granted by that role, thanks to IRSA.

# The ServiceAccount with its IAM credentials
apiVersion: v1
kind: ServiceAccount
metadata:
  name: s3-reader-sa
  annotations:
    # This is the diplomatic passport: the ARN of the IAM role
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/EKS-S3-Reader-Role
---
# The Deployment that uses the passport
apiVersion: apps/v1
kind: Deployment
metadata:
  name: report-generator
spec:
  replicas: 1
  selector:
    matchLabels: { app: reporter }
  template:
    metadata:
      labels: { app: reporter }
    spec:
      # Use the service account we defined above
      serviceAccountName: s3-reader-sa
      containers:
        - name: processor
          image: your-repo/report-generator:v1.5.2
          ports:
            - containerPort: 8080

A short closing worth remembering

When you boil it all down, Ingress is just the etiquette you enforce at the front door. Egress is the paperwork required for a clean exit. In EKS, the etiquette is defined by Kubernetes resources, while the paperwork is pure AWS networking. Neither one cares about your intentions unless you write them down clearly.

So, draw the path for traffic both ways, pick the right doors for the job, give your Pods a proper identity, and set the tolls where they make sense. If you do, the cluster will behave, the bill will behave, and your on-call shifts might just start tasting a lot more like sleep.

Stop building cathedrals in Terraform

It’s 9 AM on a Tuesday. You, a reasonably caffeinated engineer, open a pull request to add a single tag to an S3 bucket. A one-line change. You run terraform plan and watch in horror as your screen scrolls with a novel’s worth of green, yellow, and red text. Two hundred and seventeen resources want to be updated.

Welcome to a special kind of archaeological dig. Somewhere, buried three folders deep, a “reusable” module you haven’t touched in six months has decided to redecorate your entire production environment. The brochure promised elegance and standards. The reality is a Tuesday spent doing debugging, cardio, and praying to the Git gods.

Small teams, in particular, fall into this trap. You don’t need to build a glorious cathedral of abstractions just to hang a picture on the wall. You need a hammer, a nail, and enough daylight to see what you’re doing.

The allure of the perfect system

Let’s be honest, custom Terraform modules are seductive. They whisper sweet nothings in your ear about the gospel of DRY (Don’t Repeat Yourself). They promise a future where every resource is a perfect, standardized snowflake, lovingly stamped out from a single, blessed template. It’s the engineering equivalent of having a perfectly organized spice rack where all the labels face forward.

In theory, it’s beautiful. In practice, for a small, fast-moving team, it’s a tax. A heavy one. An indirection tax.

What starts as a neat wrapper today becomes a Matryoshka doll of complexity by next quarter. Inputs multiply. Defaults are buried deeper than state secrets. Soon, flipping a single boolean in a variables.tf file feels like rewiring a nuclear submarine with the lights off. The module is no longer serving you; you are now its humble servant.

It’s like buying one of those hyper-specific kitchen gadgets, like a banana slicer. Yes, it slices bananas. Perfectly. But now you own a piece of plastic whose only job is to do something a knife you already owned could do just fine. That universal S3 module you built is the junk drawer of your infrastructure. Sure, it holds everything, but now you have to rummage past a broken can opener and three instruction manuals just to find a spoon.

A heuristic for staying sane

So, what’s the alternative? Anarchy? Copy-pasting HCL like a digital barbarian? Yes. Sort of.

Here’s a simple, sanity-preserving heuristic:

Duplicate once without shame. Duplicate twice with comments. On the third time, and only then, consider extracting a module.

Until you hit that third, clear, undeniable repetition of a pattern, plain HCL is your best friend. It wins on speed, clarity, and keeping the blast radius of any change predictably small. You avoid abstracting a solution before you even fully understand the problem.

Let’s see it in action. You need a simple, private S3 bucket for your new service.

The cathedral-builder’s approach might look like this:

# service-alpha/main.tf

module "service_alpha_bucket" {
  source = "git::ssh://git@github.com/your-org/terraform-modules.git//s3/private-bucket?ref=v1.4.2"

  bucket_name      = "service-alpha-data-logs-2025"
  enable_versioning = true
  force_destroy    = false # Safety first!
  lifecycle_days   = 90
  tags = {
    Service   = "alpha"
    ManagedBy = "Terraform"
  }
}

It looks clean, but what happens when you need to add a specific replication rule? Or a weird CORS policy for a one-off integration? You’re off to another repository to wage war with the module’s maintainer (who is probably you, from six months ago).

Now, the boring, sane, ship-it-today approach:

# service-alpha/main.tf

resource "aws_s3_bucket" "data_bucket" {
  bucket = "service-alpha-data-logs-2025"

  tags = {
    Service   = "alpha"
    ManagedBy = "Terraform"
  }
}

resource "aws_s3_bucket_versioning" "data_bucket_versioning" {
  bucket = aws_s3_bucket.data_bucket.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "data_bucket_lifecycle" {
  bucket = aws_s3_bucket.data_bucket.id

  rule {
    id     = "log-expiration"
    status = "Enabled"
    expiration {
      days = 90
    }
  }
}

resource "aws_s3_bucket_public_access_block" "data_bucket_access" {
  bucket                  = aws_s3_bucket.data_bucket.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

Is it more lines of code? Yes. Is it gloriously, beautifully, and unapologetically obvious? Absolutely. You can read it, understand it, and change it in thirty seconds. No context switching. No spelunking through another codebase. Just a bucket, doing bucket things.

Where a module is not a swear word

Okay, I’m not a total monster. Modules have their place. They are the right tool when you are building the foundations, not the furniture.

A module earns its keep when it defines a stable, slow-moving, and genuinely complex pattern that you truly want to be identical everywhere. Think of it like the plumbing and electrical wiring of a house. You don’t reinvent it for every room.

Good candidates for a module include:

VPC and core networking: The highway system of your cloud. Build it once, build it well, and then leave it alone.
Kubernetes cluster baselines: The core EKS/GKE/AKS setup, IAM roles, and node group configurations.
Security and telemetry agents: The non-negotiable stuff that absolutely must run on every single instance.
IAM roles for CI/CD: A standardized way for your deployment pipeline to get the permissions it needs.

The key difference? These things change on a scale of months or years, not days or weeks.

Your escape plan from module purgatory

What if you’re reading this and nodding along in despair, already trapped in a gilded cage of your own abstractions? Don’t panic. There’s a way out, and it doesn’t require a six-month migration project.

Freeze everything: First, go to every service that uses the problematic module and pin the version number. ref=v1.4.2. No more floating on main. You’ve just stopped the bleeding.
Take inventory: In one service, run a Terraform state list to see the exact resources managed by the module.
Perform the adoption: This is the magic trick. Write the plain HCL code for those resources directly in your service’s configuration. Then, tell Terraform that the old resource (inside the module) and your new resource (the plain HCL) are actually the same thing. You do this with a moved block or the Terraform state mv command.

Let’s say your module created a bucket. The state address is module.service_alpha_bucket.aws_s3_bucket.this[0]. Your new plain resource is aws_s3_bucket.data_bucket.

You would run:

terraform state mv 'module.service_alpha_bucket.aws_s3_bucket.this[0]' aws_s3_bucket.data_bucket

Verify and obliterate: Run terraform plan. It should come back with “No changes. Your infrastructure matches the configuration.” The plan is clean. You are now free. Delete the module block, pop the champagne, and submit your PR. Repeat for other services, one at a time. No heroics.

Fielding objections from the back row

When you propose this radical act of simplicity, someone will inevitably raise their hand.

“But we need standards!” You absolutely do. Standardize on things that matter: tags, naming conventions, and security policies. Enforce them with tools like tflint, checkov, and OPA/Gatekeeper. A linter yelling at you in a PR is infinitely better than a module silently deploying the wrong thing everywhere.
“What about junior developers? They need a paved road!” They do. A haunted mega-module with 50 input variables is not a paved road; it’s a labyrinth with a minotaur. A better “paved road” is a folder of well-documented, copy-pasteable examples of plain HCL for common tasks.
“Compliance will have questions!” Good. Let them. A tiny, focused, version-pinned module for your IAM boundary policy is a fantastic answer. A sprawling, do-everything wrapper module that changes every week is a compliance nightmare waiting to happen.

The gospel of ‘Good Enough’ for now

Stop trying to solve tomorrow’s problems today. That perfect, infinitely configurable abstraction you’re dreaming of is a solution in search of a problem you don’t have yet.

Don’t optimize for DRY. Optimize for change.

Small teams don’t need fewer lines of HCL; they need fewer places to look when something breaks at 3 PM on a Friday. They need clarity, not cleverness. Keep your power tools for the heavy-duty work. Save the cathedral for when you’ve actually founded a religion.

For now, ship the bucket, and go get lunch.

September 14, 2025 by Fernando SRE Cloud stuff DevOps stuff

Your Kubernetes rollback is lying

The PagerDuty alert screams. The new release, born just minutes ago with such promising release notes, is coughing up blood in production. The team’s Slack channel is a frantic mess of flashing red emojis. Someone, summoning the voice of a panicked adult, yells the magic word: “ROLLBACK!”

And so, Helm, our trusty tow-truck operator, rides in with a smile, waving its friendly green check marks. The dashboards, those silent accomplices, beam with the serene glow of healthy metrics. Kubernetes probes, ever so polite, confirm that the resurrected pods are, in fact, “breathing.”

Then, production face-plants. Hard.

The feeling is like putting a cartoon-themed bandage on a burst water pipe and then wondering, with genuine surprise, why the living room has become a swimming pool. This article is the autopsy of those “perfect” rollbacks. We’re going to uncover why your monitoring is a pathological liar, how network traffic becomes a double agent, and what to do so that the next time Helm gives you a thumbs-up, you can actually believe it.

A state that refuses to time-travel

The first, most brutal lie a rollback tells you is that it can turn back time. A helm rollback is like the “rewind” button on an old VCR remote; it diligently rewinds the tape (your YAML manifests), but it has absolutely no power to make the actors on screen younger.

Your application’s state is one of those stubborn actors.

While your ConfigMaps and Secrets might dutifully revert to their previous versions, your data lives firmly in the present. If your new release included a database migration that added a column, rolling back the application code doesn’t magically make that column disappear. Now your old code is staring at a database schema from the future, utterly confused, like a medieval blacksmith being handed an iPad.

The same goes for PersistentVolumeClaims, external caches like Redis, or messages sitting in a Kafka queue. The rollback command whispers sweet nothings about returning to a “known good state,” but it’s only talking about itself. The rest of your universe has moved on, and it refuses to travel back with you.

The overly polite doorman

The second culprit in our investigation is the Kubernetes probe. Think of the readinessProbe as an overly polite doorman at a fancy party. Its job is to check if a guest (your pod) is ready to enter. But its definition of “ready” can be dangerously optimistic.

Many applications, especially those running on the JVM, have what we’ll call a “warming up” period. When a pod starts, the process is running, the HTTP port is open, and it will happily respond to a simple /health check. The doorman sees a guest in a tuxedo and says, “Looks good to me!” and opens the door.

What the doorman doesn’t see is that this guest is still stretching, yawning, and trying to remember where they are. The application’s caches are cold, its connection pools are empty, and its JIT compiler is just beginning to think about maybe, possibly, optimizing some code. The first few dozen requests it receives will be painfully slow or, worse, time out completely.

So while your readinessProbe is giving you a green light, your first wave of users is getting a face full of errors. For these sleepy applications, you need a more rigorous bouncer.

A startupProbe is that bouncer. It gives the app a generous amount of time to get its act together before even letting the doorman (readiness and liveness probes) start their shift.

# This probe gives our sleepy JVM app up to 5 minutes to wake up.
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
startupProbe:
  httpGet:
    path: /health/ready
    port: 8080
  # Kubelet will try 30 times with a 10-second interval (300 seconds).
  # If the app isn't ready by then, the pod will be restarted.
  failureThreshold: 30
  periodSeconds: 10

Without it, your rollback creates a fleet of pods that are technically alive but functionally useless, and Kubernetes happily sends them a flood of unsuspecting users.

Traffic, the double agent

And that brings us to our final suspect: the network traffic itself. In a modern setup using a service mesh like Istio or Linkerd, traffic routing is a sophisticated dance. But even the most graceful dancer can trip.

When you roll back, a new ReplicaSet is created with the old pod specification. The service mesh sees these new pods starting up, asks the doorman (readinessProbe) if they’re good to go, gets an enthusiastic “yes!”, and immediately starts sending them a percentage of live production traffic.

This is where all our problems converge. Your service mesh, in its infinite efficiency, has just routed 50% of your user traffic to a platoon of sleepy, confused pods that are trying to talk to a database from the future.

Let’s look at the evidence. This VirtualService, which we now call “The 50/50 Disaster Splitter,” was routing traffic with criminal optimism.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: checkout-api-vs
  namespace: prod-eu-central
spec:
  hosts:
    - "checkout.api.internal"
  http:
    - route:
        - destination:
            host: checkout-api-svc
            subset: v1-stable
          weight: 50 # 50% to the (theoretically) working pods
        - destination:
            host: checkout-api-svc
            subset: v1-rollback
          weight: 50 # 50% to the pods we just dragged from the past

The service mesh isn’t malicious. It’s just an incredibly efficient tool that is very good at following bad instructions. It sees a green light and hits the accelerator.

A survival guide that won’t betray you

So, you’re in the middle of a fire, and the “break glass in case of emergency” button is a lie. What do you do? You need a playbook that acknowledges reality.

Step 0: Breathe and isolate the blast radius

Before you even think about rolling back, stop the bleeding. The fastest way to do that is often at the traffic level. Use your service mesh or ingress controller to immediately shift 100% of traffic back to the last known good version. Don’t wait for new pods to start. This is a surgical move that takes seconds and gives you breathing room.

Step 1: Declare an incident and gather the detectives

Get the right people on a call. Announce that this is not a “quick rollback” but an incident investigation. Your goal is to understand why the release failed, not just to hit the undo button.

Step 2: Perform the autopsy (while the system is stable)

With traffic safely routed away from the wreckage, you can now investigate. Check the logs of the failed pods. Look at the database. Is there a schema mismatch? A bad configuration? This is where you find the real killer.

Step 3: Plan the counter-offensive (which might not be a rollback)

Sometimes, the safest path forward is a roll forward. A small hotfix that corrects the issue might be faster and less risky than trying to force the old code to work with a new state. A rollback should be a deliberate, planned action, not a panic reflex. If you must roll back, do it with the knowledge you’ve gained from your investigation.

Step 4: The deliberate, cautious rollback

If you’ve determined a rollback is the only way, do it methodically.

Scale down the broken deployment:
kubectl scale deployment/checkout-api –replicas=0
Execute the Helm rollback:
helm rollback checkout-api 1 -n prod-eu-central
Watch the new pods like a hawk: Monitor their logs and key metrics as they come up. Don’t trust the green check marks.
Perform a Canary Release: Once the new pods look genuinely healthy, use your service mesh to send them 1% of the traffic. Then 10%. Then 50%. Then 100%. You are now in control, not the blind optimism of the automation.

The truth will set you free

A Kubernetes rollback isn’t a time machine. It’s a YAML editor with a fancy title. It doesn’t understand your data, it doesn’t appreciate your app’s need for a morning coffee, and it certainly doesn’t grasp the nuances of traffic routing under pressure.

Treating a rollback as a simple, safe undo button is the fastest way to turn a small incident into a full-blown outage. By understanding the lies it tells, you can build a process that trusts human investigation over deceptive green lights. So the next time a deployment goes sideways, don’t just reach for the rollback lever. Reach for your detective’s hat instead.

September 11, 2025 by Fernando SRE DevOps stuff Kubernetes SRE stuff

Terraform scales better without a centralized remote state

It’s 4:53 PM on a Friday. You’re pushing a one-line change to an IAM policy. A change so trivial, so utterly benign, that you barely give it a second thought. You run terraform apply, lean back in your chair, and dream of the weekend. Then, your terminal returns a greeting from the abyss: Error acquiring state lock.

Somewhere across the office, or perhaps across the country, a teammate has just started a plan on their own, seemingly innocuous change. You are now locked in a digital standoff. The weekend is officially on hold. Your shared Terraform state file, once a symbol of collaboration and a single source of truth, has become a temperamental roommate who insists on using the kitchen right when you need to make dinner. And they’re a very, very slow cook.

Our Terraform honeymoon phase

It wasn’t always like this. Most of us start our Terraform journey in a state of blissful simplicity. Remember those early days? A single, elegant main.tf file, a tidy remote backend in an S3 bucket, and a DynamoDB table to handle the locking. It was the infrastructure equivalent of a brand-new, minimalist apartment. Everything had its place. Deployments were clean, predictable, and frankly, a little bit boring.

Our setup looked something like this, a testament to a simpler time:

# in main.tf
terraform {
  backend "s3" {
    bucket         = "our-glorious-infra-state-prod"
    key            = "global/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-lock-prod"
    encrypt        = true
  }
}

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
  # ... and so on
}

It worked beautifully. Until it didn’t. The problem with minimalist apartments is that they don’t stay that way. You add a person, then another. You buy more furniture. Soon, you’re tripping over things, and that one clean kitchen becomes a chaotic battlefield of conflicting needs.

The kitchen gets crowded

As our team and infrastructure grew, our once-pristine state file started to resemble a chaotic shared kitchen during rush hour. The initial design, meant for a single chef, was now buckling under the pressure of a full restaurant staff.

The state lock standoff

The first and most obvious symptom was the state lock. It’s less of a technical “race condition” and more of a passive-aggressive duel between two colleagues who both need the only good frying pan at the exact same time. The result? Burnt food, frayed nerves, and a CI/CD pipeline that spends most of its time waiting in line.

The mystery of the shared spice rack

With everyone working out of the same state file, we lost any sense of ownership. It became a communal spice rack where anyone could move, borrow, or spill things. You’d reach for the salt (a production security group) only to find someone had replaced it with sugar (a temporary rule for a dev environment). Every Terraform apply felt like a gamble. You weren’t just deploying your change; you were implicitly signing off on the current, often mysterious, state of the entire kitchen.

The pre-apply prayer

This led to a pervasive culture of fear. Before running an apply, engineers would perform a ritualistic dance of checks, double-checks, and frantic Slack messages: “Hey, is anyone else touching prod right now?” The Terraform plan output would scroll for pages, a cryptic epic poem of changes, 95% of which had nothing to do with you. You’d squint at the screen, whispering a little prayer to the DevOps gods that you wouldn’t accidentally tear down the customer database because of a subtle dependency you missed.

The domino effect of a single spilled drink

Worst of all was the tight coupling. Our infrastructure became a house of cards. A team modifying a network ACL for their new microservice could unintentionally sever connectivity for a legacy monolith nobody had touched in years. It was the architectural equivalent of trying to change a lightbulb and accidentally causing the entire building’s plumbing to back up.

An uncomfortable truth appears

For a while, we blamed Terraform. We complained about its limitations, its verbosity, and its sharp edges. But eventually, we had to face an uncomfortable truth: the tool wasn’t the problem. We were. Our devotion to the cult of the single centralized state—the idea that one file to rule them all was the pinnacle of infrastructure management—had turned our single source of truth into a single point of failure.

The great state breakup

The solution was as terrifying as it was liberating: we had to break up with our monolithic state. It was time to move out of the chaotic shared house and give every team their own well-equipped studio apartment.

Giving everyone their own kitchenette

First, we dismantled the monolith. We broke our single Terraform configuration into dozens of smaller, isolated stacks. Each stack managed a specific component or application, like a VPC, a Kubernetes cluster, or a single microservice’s infrastructure. Each had its own state file.

Our directory structure transformed from a single folder into a federation of independent projects:

infra/
├── networking/
│   ├── vpc.tf
│   └── backend.tf      # Manages its own state for the VPC
├── databases/
│   ├── rds-main.tf
│   └── backend.tf      # Manages its own state for the primary RDS
└── services/
    ├── billing-api/
    │   ├── ecs-service.tf
    │   └── backend.tf  # Manages state for just the billing API
    └── auth-service/
        ├── iam-roles.tf
        └── backend.tf  # Manages state for just the auth service

The state lock standoffs vanished overnight. Teams could work in parallel without tripping over each other. The blast radius of any change was now beautifully, reassuringly small.

Letting infrastructure live with its application

Next, we embraced GitOps patterns. Instead of a central infrastructure repository, we decided that infrastructure code should live with the application it supports. It just makes sense. The code for an API and the infrastructure it runs on are a tightly coupled couple; they should live in the same house. This meant code reviews for application features and infrastructure changes happened in the same pull request, by the same team.

Tasting the soup before serving it

Finally, we made surprises a thing of the past by validating plans before they ever reached the main branch. We set up simple CI workflows that would run a Terraform plan on every pull request. No more mystery meat deployments. The plan became a clear, concise contract of what was about to happen, reviewed and approved before merge.

A snippet from our GitHub Actions workflow looked like this:

name: 'Terraform Plan Validation'
on:
  pull_request:
    paths:
      - 'infra/**'
      - '.github/workflows/terraform-plan.yml'

jobs:
  plan:
    name: 'Terraform Plan'
    runs-on: ubuntu-latest
    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Setup Terraform
      uses: hashicorp/setup-terraform@v3
      with:
        terraform_version: 1.5.0

    - name: Terraform Init
      run: terraform init -backend=false

    - name: Terraform Plan
      run: terraform plan -no-color

Stories from the other side

This wasn’t just a theoretical exercise. A fintech firm we know split its monolithic repo into 47 micro-stacks. Their deployment speed shot up by 70%, not because they wrote code faster, but because they spent less time waiting and untangling conflicts. Another startup moved from a central Terraform setup to the AWS CDK (TypeScript), embedding infra in their app repos. They cut their time-to-deploy in half, freeing their SRE team from being gatekeepers and allowing them to become enablers.

Guardrails not gates

Terraform is still a phenomenally powerful tool. But the way we use it has to evolve. A centralized remote state, when not designed for scale, becomes a source of fragility, not strength. Just because you can put all your eggs in one basket doesn’t mean you should, especially when everyone on the team needs to carry that basket around.

The most scalable thing you can do is let teams build independently. Give them ownership, clear boundaries, and the tools to validate their work. Build guardrails to keep them safe, not gates to slow them down. Your Friday evenings will thank you for it.

August 23, 2025 by Fernando SRE Cloud stuff DevOps stuff Git SRE stuff

Confessions of a recovering GitOps addict

There’s a moment in every tech trend’s lifecycle when the magic starts to wear off. It’s like realizing the artisanal, organic, free-range coffee you’ve been paying eight dollars for just tastes like… coffee. For me, and many others in the DevOps trenches, that moment has arrived for GitOps.

We once hailed it as the silver bullet, the grand unifier, the one true way. Now, I’m here to tell you that the romance is over. And something much more practical is taking its place.

The alluring promise of a perfect world

Let’s be honest, we all fell hard for GitOps. The promise was intoxicating. A single source of truth for our entire infrastructure, nestled right in the warm, familiar embrace of Git. Pull Requests became the sacred gates through which all changes must pass. CI/CD pipelines were our holy scrolls, and tools like ArgoCD and Flux were the messiahs delivering us from the chaos of manual deployments.

It was a world of perfect order. Every change was audited, every state was declared, and every rollback was just a git revert away. It felt clean. It felt right. It felt… professional. For a while, it was the hero we desperately needed.

The tyranny of the pull request

But paradise had a dark side, and it was paved with endless YAML files. The first sign of trouble wasn’t a catastrophic failure, but a slow, creeping bureaucracy that we had built for ourselves.

Need to update a single, tiny secret? Prepare for the ritual. First, the offering: a Pull Request. Then, the prayer for the high priests (your colleagues) to grant their blessing (the approval). Then, the sacrifice (the merge). And finally, the tense vigil, watching ArgoCD’s sync status like it’s a heart monitor, praying it doesn’t flatline.

The lag became a running joke. Your change is merged… but has it landed in production? Who knows! The sync bot seems to be having a bad day. When everything is on fire at 2 AM, Git is like that friend who proudly tells you, “Well, according to my notes, the plan was for there not to be a fire.” Thanks, Git. Your record of intent is fascinating, but I need a fire hose, not a historian.

We hit our wall during what should have been a routine update.

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: auth-service
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: auth-service-container
        image: our-app:v1.12.4
        envFrom:
        - secretRef:
            name: production-credentials

A simple change to the production-credentials secret required updating an encrypted file, PR-ing it, and then explaining in the commit message something like, “bumping secret hash for reasons”. Nobody understood it. Infrastructure changes started to require therapy sessions just to get merged.

And then, the tools fought back

When a system creates more friction than it removes, a rebellion is inevitable. And the rebels have arrived, not with pitchforks, but with smarter, more flexible tools.

First, the idea that developers should be fluent in YAML began to die. Internal Developer Platforms (IDPs) like Backstage and Port started giving developers what they always wanted: self-service with guardrails. Instead of wrestling with YAML syntax, they click a button in a portal to provision a database or spin up a new environment. Git becomes a log of what happened, not a bottleneck to make things happen.

Second, we remembered that pushing things can be good. The pull-based model was trendy, but let’s face it: push is immediate. Push is observable. We’ve gone back to CI pipelines pushing manifests directly into clusters, but this time they’re wearing body armor.

# This isn't your old wild-west kubectl apply
# It's a command wrapped in an approval system, with observability baked in.
deploy-cli --service auth-service --env production --approve

The change is triggered precisely when we want it, not when a bot feels like syncing. Finally, we started asking a radical question: why are we describing infrastructure in a static markup language when we could be programming it? Tools like Pulumi and Crossplane entered the scene. Instead of hundreds of lines of YAML, we’re writing code that feels alive.

import * as aws from "@pulumi/aws";

// Create an S3 bucket with versioning enabled.
const bucket = new aws.s3.Bucket("user-uploads-bucket", {
    versioning: {
        enabled: true,
    },
    acl: "private",
});

Infrastructure can now react to events, be composed into reusable modules, and be written in a language with types and logic. YAML simply can’t compete with that.

A new role for the abdicated king

So, is GitOps dead? No, that’s just clickbait. But it has been demoted. It’s no longer the king ruling every action; it’s more like a constitutional monarch, a respected elder statesman.

It’s fantastic for auditing, for keeping a high-level record of intended state, and for infrastructure teams that thrive on rigid discipline. But for high-velocity product teams, it’s become a beautifully crafted anchor when what we need is a motor.

We’ve moved from “Let’s define everything in Git” to “Let’s ship faster, safer, and saner with the right tools for the job.”

Our current stack is a hybrid, a practical mix of the old and new:

Backstage to abstract away complexity for developers.
Push-based pipelines with strong guardrails for immediate, observable deployments.
Pulumi for typed, programmable, and composable infrastructure.
Minimal GitOps for what it does best: providing a clear, auditable trail of our intentions.

GitOps wasn’t a mistake; it was the strict but well-meaning grandparent of infrastructure management. It taught us discipline and the importance of getting approval before touching anything important. But now that we’re grown up, that level of supervision feels less like helpful guidance and more like having someone watch over your shoulder while you type, constantly asking, “Are you sure you want to save that file?” The world is moving on to flexibility, developer-first platforms, and code you can read without a decoder ring. If you’re still spending your nights appeasing the YAML gods with Pull Request sacrifices for trivial changes… you’re not just living in the past, you’re practically a fossil.

August 17, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

When docker compose stopped being magic

There was a time, not so long ago, when docker-compose up felt like performing a magic trick. You’d scribble a few arcane incantations into a YAML file and, poof, your entire development stack would spring to life. The database, the cache, your API, the frontend… all humming along obediently on localhost. Docker Compose wasn’t just a tool; it was the trusty Swiss Army knife in every developer’s pocket, the reliable friend who always had your back.

Until it didn’t.

Our breakup wasn’t a single, dramatic event. It was a slow fade, the kind of awkward drifting apart that happens when one friend grows and the other… well, the other is perfectly happy staying exactly where they are. It began with small annoyances, then grew into full-blown arguments. We eventually realized we were spending more time trying to fix our relationship with YAML than actually building things.

So, with a heavy heart and a sigh of relief, we finally said goodbye.

The cracks begin to show

As our team and infrastructure matured, our reliable friend started showing some deeply annoying habits. The magic tricks became frustratingly predictable failures.

Our services started giving each other the silent treatment. The networking between containers became as fragile and unpredictable as a Wi-Fi connection on a cross-country train. One moment they were chatting happily, the next they wouldn’t be caught dead in the same virtual network.
It was worse at keeping secrets than a gossip columnist. The lack of native, secure secret handling was, to put it mildly, a joke. We were practically writing passwords on sticky notes and hoping for the best.
It developed a severe case of multiple personality disorder. The same docker-compose.yml file would behave like a well-mannered gentleman on one developer’s machine, a rebellious teenager in staging, and a complete, raving lunatic in production. Consistency was not its strong suit.
The phrase “It works on my machine” became a ritualistic chant. We’d repeat it, hoping to appease the demo gods, but they are a fickle bunch and rarely listened. We needed reliability, not superstition.

We had to face the truth. Our old friend just couldn’t keep up.

Moving on to greener pastures

The final straw was the realization that we had become full-time YAML therapists. It was time to stop fixing and start building again. We didn’t just dump Compose; we replaced it, piece by piece, with tools that were actually designed for the world we live in now.

For real infrastructure, we chose real code

For our production and staging environments, we needed a serious, long-term commitment. We found it in the AWS Cloud Development Kit (CDK). Instead of vaguely describing our needs in YAML and hoping for the best, we started declaring our infrastructure with the full power and grace of TypeScript.

We went from a hopeful plea like this:

# docker-compose.yml
services:
  api:
    build: .
    ports:
      - "8080:8080"
    depends_on:
      - database
  database:
    image: "postgres:14-alpine"

To a confident, explicit declaration like this:

// lib/api-stack.ts
import * as cdk from 'aws-cdk-lib';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ecs_patterns from 'aws-cdk-lib/aws-ecs-patterns';

// ... inside your Stack class
const vpc = /* your existing VPC */;
const cluster = new ecs.Cluster(this, 'ApiCluster', { vpc });

// Create a load-balanced Fargate service and make it public
new ecs_patterns.ApplicationLoadBalancedFargateService(this, 'ApiService', {
  cluster: cluster,
  cpu: 256,
  memoryLimitMiB: 512,
  desiredCount: 2, // Let's have some redundancy
  taskImageOptions: {
    image: ecs.ContainerImage.fromRegistry("your-org/your-awesome-api"),
    containerPort: 8080,
  },
  publicLoadBalancer: true,
});

It’s reusable, it’s testable, and it’s cloud-native by default. No more crossed fingers.

For local development, we found a better roommate

Onboarding new developers had become a nightmare of outdated README files and environment-specific quirks. For local development, we needed something that just worked, every time, on every machine. We found our perfect new roommate in Dev Containers.

Now, we ship a pre-configured development environment right inside the repository. A developer opens the project in VS Code, it spins up the container, and they’re ready to go.

Here’s the simple recipe in .devcontainer/devcontainer.json:

{
  "name": "Node.js & PostgreSQL",
  "dockerComposeFile": "docker-compose.yml", // Yes, we still use it here, but just for this!
  "service": "app",
  "workspaceFolder": "/workspace",

  // Forward the ports you need
  "forwardPorts": [3000, 5432],

  // Run commands after the container is created
  "postCreateCommand": "npm install",

  // Add VS Code extensions
  "extensions": [
    "dbaeumer.vscode-eslint",
    "esbenp.prettier-vscode"
  ]
}

It’s fast, it’s reproducible, and our onboarding docs have been reduced to: “1. Install Docker. 2. Open in VS Code.”

To speak every Cloud language, we hired a translator

As our ambitions grew, we needed to manage resources across different cloud providers without learning a new dialect for each one. Crossplane became our universal translator. It lets us manage our infrastructure, whether it’s on AWS, GCP, or Azure, using the language we already speak fluently: the Kubernetes API.

Want a managed database in AWS? You don’t write Terraform. You write a Kubernetes manifest.

# rds-instance.yaml
apiVersion: database.aws.upbound.io/v1beta1
kind: RDSInstance
metadata:
  name: my-production-db
spec:
  forProvider:
    region: eu-west-1
    instanceClass: db.t3.small
    masterUsername: admin
    allocatedStorage: 20
    engine: postgres
    engineVersion: "14.5"
    skipFinalSnapshot: true
    # Reference to a secret for the password
    masterPasswordSecretRef:
      namespace: crossplane-system
      name: my-db-password
      key: password
  providerConfigRef:
    name: aws-provider-config

It’s declarative, auditable, and fits perfectly into a GitOps workflow.

For the creative grind, we got a better workflow

The constant cycle of code, build, push, deploy, test, repeat for our microservices was soul-crushing. Docker Compose never did this well. We needed something that could keep up with our creative flow. Skaffold gave us the instant gratification we craved.

One command, skaffold dev, and suddenly we had:

Live code syncing to our development cluster.
Automatic container rebuilds and redeployments when files change.
A unified configuration for both development and production pipelines.

No more editing three different files and praying. Just code.

The slow fade was inevitable

Docker Compose was a fantastic tool for a simpler time. It was perfect when our team was small, our application was a monolith, and “production” was just a slightly more powerful laptop.

But the world of software development has moved on. We now live in an era of distributed systems, cloud-native architecture, and relentless automation. We didn’t just stop using Docker Compose. We outgrew it. And we replaced it with tools that weren’t just built for the present, but are ready for the future.

August 6, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

What if Kubernetes was the wrong tool for almost everyone?

It was 11 PM on a Tuesday, and for the third time that week, we were on an emergency call. Not because our product had a critical bug, but because a routine deployment had, once again, mysteriously broken the internal DNS. Three of our sharpest engineers weren’t creating value; they were offering sacrifices to the capricious god of Istio, hoping it might bless our pods with connectivity.

As I stared at the sprawling diagram we’d made just to add a single, simple microservice, a heretical thought wormed its way into my brain: When did our job stop being about building software and become about… well, serving Kubernetes?

A rumor we couldn’t ignore

The next morning, amidst our collective YAML-induced hangover, someone dropped a screenshot into our team’s Slack channel. It was from some tiny, no-name startup, claiming they were getting 8x the performance of a standard Kubernetes stack at one-tenth of the cost.

The team’s reaction was a collective, cynical laugh. “Sure,” our lead SRE typed, “and my home server can out-render Pixar.” It was obviously marketing fluff. Propaganda. The post itself was deleted within an hour, but the screenshot had already gone viral. It was absurd, unbelievable, and we all knew it was nonsense.

But the idea, like a catchy, terrible pop song, got stuck in our heads.

Let’s just prove it’s impossible

That Friday, we decided to do it. We’d run a small, contained experiment, mostly for the bragging rights of publicly debunking the ridiculous claim. The plan was simple: take one of our standard, moderately complex services and see what it would take to run it on a simpler stack.

Our service, an image processor, wasn’t a behemoth, but its YAML file had grown… organically. Like a colony of particularly stubborn mold. It had sidecars, persistent volume claims, readiness probes, and enough annotations to qualify as a short novel.

Here’s a sanitized glimpse of the beast we were trying to tame:

# apiVersion: apps/v1
# kind: Deployment
# metadata:
#   name: image-processor-svc
#   labels:
#     app: image-processor
# spec:
#   replicas: 3
#   selector:
#     matchLabels:
#       app: image-processor
#   template:
#     metadata:
#       labels:
#         app: image-processor
#     spec:
#       containers:
#       - name: processor
#         image: our-repo/image-processor:v1.2.4
#         ports:
#         - containerPort: 8080
#         resources:
#           requests:
#             memory: "512Mi"
#             cpu: "250m"
#           limits:
#             memory: "1024Mi"
#             cpu: "500m"
#         readinessProbe:
#           httpGet:
#             path: /healthz
#             port: 8080
#           initialDelaySeconds: 5
#           periodSeconds: 10
#       - name: metrics-sidecar
#         image: prom/statsd-exporter:v0.22.0
#         args:
#         - "--statsd.mapping-config=/etc/statsd/mapper.yml"
#         ports:
#         - name: metrics
#           containerPort: 9102
#         volumeMounts:
#         - name: config-volume
#           mountPath: /etc/statsd
#   # ...and so on, for another 150 lines.

We figured it would take us all afternoon just to untangle it.

The uncomfortable silence of success

We set up a competing stack using Firecracker MicroVMs, essentially tiny, lightning-fast virtual machines. The goal was to run the same container, but without the entire Kubernetes universe orbiting around it.

By 3 PM, we had our first results. And that’s when the room went quiet. It was the kind of uncomfortable silence you get when you realize the joke you’ve been telling for months is actually on you.

The numbers weren’t just holding up; they were embarrassing us. We stared at the Grafana dashboard, waiting for the figures to make sense. They didn’t. Our projected monthly cloud bill for this single service didn’t just shrink; it plummeted.

We had spent years building a complex, expensive, and fragile machine, all to solve a problem that, it turned out, could be handled with a much, much simpler approach.

Our quest for sane infrastructure

That weekend experiment turned into a full-blown obsession. Inspired, we started building our own internal escape hatch. We cobbled together a tool that we jokingly called the “SQL-ifier.” The premise was simple and, to a Kubernetes purist, utterly profane: what if you could manage infrastructure with simple, readable rules instead of YAML incantations?

Instead of a 200-line YAML file for an autoscaler, what if you could just write this?

-- This is our internal, SQL-like syntax for managing MicroVMs.
-- It's not a public standard (yet!).

-- Rule: If CPU usage on the frontend service is over 80% for 3 minutes,
-- add two more instances, but never exceed 20 total.
ON high_cpu(>80%) FOR 3m
IF service.name = 'image-processor-svc'
DO SCALE service.name TO instances + 2
LIMIT 20;

-- Rule: If we see more than 10 critical payment errors in 1 minute,
-- immediately revert to the last stable version.
ON log_error(level='critical', service='payment-gateway') > 10 FOR 1m
DO ROLLBACK service 'payment-gateway' TO previous_stable;

It was declarative, readable, and, most importantly, it could be understood by a human being without needing a certification.

How did we all end up here?

This journey forced us to ask a bigger question. Why did we, and thousands of other smart teams, willingly chain ourselves to this complexity?

The answer is surprisingly human. We bought an 18-wheeler truck to do our weekly grocery shopping.

Sure, a giant truck can carry milk and eggs. But you spend most of your time finding a place to park it, paying for diesel, getting a special license to drive it, and explaining to your neighbors why you just flattened their mailbox. Kubernetes was built for Google-scale problems. Most of us run businesses that need the reliability of a Toyota Camry, not a fleet of space shuttles. We adopted the tool because everyone else did, mistaking its complexity for sophistication.

Is your infrastructure serving you?

This whole experience led us to create a simple “mirror test.” If you’re wondering if you’re in the same boat, ask your team these questions:

Do you dread upgrading your cluster more than you dread a root canal?
Is a significant portion of your engineering time spent on “infra-babysitting” instead of building product features?
Could you confidently explain your service mesh configuration to a new hire without a whiteboard and a two-hour meeting?

If you answered “yes” to two or more, you might not have an infrastructure problem. You might have a Kubernetes problem.

This isn’t a manifesto to uninstall kubectl tomorrow. Some organizations genuinely operate at a scale where Kubernetes is not just useful, but necessary. This is just a friendly nudge. A reminder to look up from your YAML files once in a while and ask: is this tool still serving me, or have I started serving it?

July 30, 2025 by Fernando SRE Cloud stuff DevOps stuff Kubernetes SRE stuff

How Headless services and StatefulSets work together in Kubernetes

Kubernetes is an open-source platform designed to seamlessly manage containerized applications. Imagine the manager at your favorite café coordinating baristas, chefs, and servers effortlessly, ensuring a smooth customer experience every single time. Kubernetes automates deployments, scaling, and operations, making it indispensable for today’s complex digital landscape.

Understanding headless services

At first glance, Headless Services might seem unusual, yet they’re essential Kubernetes components. Regular Kubernetes Services act as receptionists routing your calls; Headless Services, however, skip the receptionist altogether and connect you directly to individual pods via their unique IP addresses.

Consider them as a neighborhood directory listing direct phone numbers, eliminating the central switchboard. This direct approach is particularly beneficial when individual pod identity and communication are critical, such as with database clusters.

Example YAML for a headless service:

apiVersion: v1
kind: Service
metadata:
  name: my-headless-service
spec:
  clusterIP: None
  selector:
    app: my-app
  ports:
  - port: 80

Demystifying StatefulSets

StatefulSets uniquely manage stateful applications by assigning each pod a stable identity and persistent storage. Imagine a classroom where each student (pod) has an assigned desk (storage) that remains consistent, no matter how often they come and go.

Comparing StatefulSets and deployments

Deployments are ideal for stateless applications, where each instance is interchangeable and can be replaced without affecting the overall system. StatefulSets, however, excel with stateful applications, ensuring pods have stable identities and persistent storage, perfect for databases and message queues.

Example YAML for a StatefulSet:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: my-stateful-app
spec:
  serviceName: my-headless-service
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: app-container
        image: my-app-image
        ports:
        - containerPort: 80
        volumeMounts:
        - name: my-volume
          mountPath: /data
  volumeClaimTemplates:
  - metadata:
      name: my-volume
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 1Gi

The strength of pairing headless services with StatefulSets

Headless Services and StatefulSets each have significant strengths independently, but they truly shine when combined. Headless Services provide stable network identities for StatefulSet pods, akin to each member of a specialized team having their direct communication line for efficient collaboration.

Picture an emergency medical team; direct lines enable doctors and nurses to coordinate rapidly and precisely during critical situations. Similarly, distributed databases such as Cassandra or MongoDB rely heavily on this direct communication model to maintain data consistency and reliability.

Practical use-case

Consider a Cassandra database running on Kubernetes. StatefulSets ensure each Cassandra node has dedicated data storage and a unique identity. With Headless Services, these nodes communicate directly, consistently synchronizing data and ensuring seamless accessibility, irrespective of which node handles the incoming requests.

Concluding insights

Headless Services combined with StatefulSets form a powerful solution for managing stateful applications within Kubernetes. They address distinct challenges in state management and network stability, ensuring reliability and scalability for your applications.

Leveraging these Kubernetes capabilities equips your infrastructure for success, akin to empowering each team member with the necessary tools for clear communication and consistent performance. Embrace this dynamic duo for a more robust and efficient Kubernetes environment.

July 9, 2025 by Fernando SRE Cloud stuff DevOps stuff Kubernetes SRE stuff

How DevOps teams secure secrets and configurations

Setting up a new home isn’t merely about getting a set of keys. It’s about knowing the essentials: the location of the main water valve, the Wi-Fi password that connects you to the world, and the quirks of the thermostat that keeps you comfortable. You wouldn’t dream of scribbling your bank PIN on your debit card or leaving your front door keys conspicuously under the welcome mat. Yet, in the digital realm, many software development teams inadvertently adopt such precarious habits with their application’s critical information.

This oversight, the mismanagement of configurations and secrets, can unleash a torrent of problems: applications crashing due to incorrect settings, development cycles snarled by inconsistencies and gaping security vulnerabilities that invite disaster. But there’s a more enlightened path. Digital environments often feel like minefields; this piece explores practical strategies in DevOps for intelligent configuration and secret management, aiming to establish them as bastions of stability and security. This isn’t just about best practices; it’s about building a foundation for resilient, scalable, and secure software that lets you sleep better at night.

Configuration management, the blueprint for stability

What exactly is this “configuration” we speak of? Think of it as the unique set of instructions and adjustable parameters that dictate an application’s behavior. These are the database connection strings, the feature flags illuminating new functionalities, the API endpoints it communicates with, and the resource limits that keep it running smoothly.

Consider a chef crafting a signature dish. The core recipe remains constant, but slight adjustments to spices or ingredients can tailor it for different palates or dietary needs. Similarly, your application might run in various environments, development, testing, staging, and production. Each requires its nuanced settings. The art of configuration management is about managing these variations without rewriting the entire cookbook for every meal. It’s about having a master recipe (your codebase) and a well-organized spice rack (your externalized configurations).

The perils of digital disarray

Initially, embedding configuration settings directly into your application’s code might seem like a quick shortcut. However, this path is riddled with pitfalls that quickly escalate from minor annoyances to major operational headaches. Imagine the nightmare of deploying to production only to watch it crash and burn because a database URL was hardcoded for the staging environment. These aren’t just inconveniences; they’re potential disasters leading to:

Deployment debacles: Promoting code across environments becomes a high-stakes gamble.
Operational rigidity: Adapting to new requirements or scaling services turns into a monumental task.
Security nightmares: Sensitive information, even if not a “secret,” can be inadvertently exposed.
Consistency chaos: Different environments behave unpredictably due to divergent, hard-to-track settings.

Centralization, the tower of control

So, what’s the cornerstone of sanity in this domain? It’s an unwavering principle: separate configuration from code. But why is this separation so sacrosanct? Because it bestows upon us the power of flexibility, the gift of consistency, and a formidable shield against needless errors. By externalizing configurations, we gain:

Environmental harmony: Tailor settings for each environment without touching a single line of code.
Simplified updates: Modify configurations swiftly and safely.
Enhanced security: Reduce the attack surface by keeping settings out of the codebase.
Clear traceability: Understand what settings are active where, and when they were changed.

Meet the digital organizers, essential tools

Several powerful tools have emerged to help us master this discipline. Each offers a unique set of “superpowers”:

HashiCorp Consul: Think of it as your application ecosystem’s central nervous system, providing service discovery and a distributed key-value store. It knows where everything is and how it should behave.
AWS Systems Manager Parameter Store: A secure, hierarchical vault provided by AWS for your configuration data and secrets, like a meticulously organized digital filing cabinet.
etcd: A highly reliable, distributed key-value store that often serves as the memory bank for complex systems like Kubernetes.
Spring Cloud Config: Specifically for the Java and Spring ecosystems, it offers robust server and client-side support for externalized configuration in distributed systems, illustrating the core principles effectively.

Secrets management, guarding your digital crown jewels

Now, let’s talk about secrets. These are not just any configurations; they are the digital crown jewels of your applications. We’re referring to passwords that unlock databases, API keys that grant access to third-party services, cryptographic keys that encrypt and decrypt sensitive data, certificates that verify identity, and tokens that authorize actions.

Let’s be unequivocally clear: embedding these secrets directly into your code, even within the seemingly safe confines of a private version control repository, is akin to writing your bank account password on a postcard and mailing it. Sooner or later, unintended eyes will see it. The moment code containing a secret is cloned, branched, or backed up, that secret multiplies its chances of exposure.

The fortress approach, dedicated secret sanctuaries

Given their critical nature, secrets demand specialized handling. Generic configuration stores might not suffice. We need dedicated secret management tools, and digital fortresses designed with security as their paramount concern. These tools typically offer:

Ironclad encryption: Secrets are encrypted both at rest (when stored) and in transit (when accessed).
Granular access control: Precisely define who or what can access specific secrets.
Comprehensive audit trails: Log every access attempt, successful or not, providing invaluable forensic data.
Automated rotation: The ability to automatically change secrets regularly, minimizing the window of opportunity if a secret is compromised.

Champions of secret protection leading tools

HashiCorp Vault: Envision this as the Fort Knox for your digital secrets, built with layers of security and fine-grained access controls that would make a dragon proud of its hoard. It’s a comprehensive solution for managing secrets across diverse environments.
AWS Secrets Manager: Amazon’s dedicated secure vault, seamlessly integrated with other AWS services. It excels at managing, retrieving, and automatically rotating secrets like database credentials.
Azure Key Vault: Microsoft’s offering to safeguard cryptographic keys and other secrets used by cloud applications and services within the Azure ecosystem.
Google Cloud Secret Manager: Provides a secure and convenient way to store and manage API keys, passwords, certificates, and other sensitive data within the Google Cloud Platform.

Secure delivery, handing over the keys safely

Our configurations are neatly organized, and our secrets are locked down. But how do our applications, running in their various environments, get access to them when needed, without compromising all our hard work? This is the challenge of secure delivery. The goal is “just-in-time” access: the application receives the sensitive information precisely when it needs it, and not a moment sooner or later, and only the authorized application entity gets it.

Think of it as a highly secure courier service. The package (your secret or configuration) is only handed over to the verified recipient (your application) at the exact moment of need, and the courier (the injection mechanism) ensures no one else can peek inside or snatch it.

Common methods for this secure handover include:

Environment variables: A widespread method where configurations and secrets are passed as variables to the application’s runtime environment. Simple, but be cautious: like a quick note passed to the application upon startup, ensure it’s not inadvertently logged or exposed in process listings.
Volume mounts: Secrets or configuration files are securely mounted as a volume into a containerized application. The application reads them as if they were local files, but they are managed externally.
Sidecar or Init containers (in Kubernetes/Container orchestration): Specialized helper containers run alongside your main application container. The init container might fetch secrets before the main app starts, or a sidecar might refresh them periodically, making them available through a shared local volume or network interface.
Direct API calls: The application itself, equipped with proper credentials (like an IAM role on AWS), directly queries the configuration or secret management tool at runtime. This is a dynamic approach, ensuring the latest values are always fetched.

Wisdom in action with some practical examples

Theory is vital, but seeing these principles in action solidifies understanding. Let’s step into the shoes of a DevOps engineer for a moment. Our mission, should we choose to accept it, involves enabling our applications to securely access the information they need.

Example 1 Fetching secrets from AWS Secrets Manager with Python

Our Python application needs a database password, which is securely stored in AWS Secrets Manager. How do we achieve this feat without shouting the password across the digital rooftops?

# This Python snippet demonstrates fetching a secret from AWS Secrets Manager.
# Ensure your AWS SDK (Boto3) is configured with appropriate permissions.
import boto3
import json

# Define the secret name and AWS region
SECRET_NAME = "your_app/database_credentials" # Example secret name
REGION_NAME = "your-aws-region" # e.g., "us-east-1"

# Create a Secrets Manager client
client = boto3.client(service_name='secretsmanager', region_name=REGION_NAME)

try:
    # Retrieve the secret value
    get_secret_value_response = client.get_secret_value(SecretId=SECRET_NAME)
    
    # Secrets can be stored as a string or binary.
    # For a string, it's often JSON, so parse it.
    if 'SecretString' in get_secret_value_response:
        secret_string = get_secret_value_response['SecretString']
        secret_data = json.loads(secret_string) # Assuming the secret is stored as a JSON string
        db_password = secret_data.get('password') # Example key within the JSON
        print("Successfully retrieved and parsed the database password.")
        # Now you can use db_password to connect to your database
    else:
        # Handle binary secrets if necessary (less common for passwords)
        # decoded_binary_secret = base64.b64decode(get_secret_value_response['SecretBinary'])
        print("Secret is binary, not string. Further processing needed.")

except Exception as e:
    # Robust error handling is crucial.
    print(f"Error retrieving secret: {e}")
    # In a real application, you'd log this and potentially have retry logic or fail gracefully.

Notice how our digital courier (the code) not only delivers the package but also reports back if there is a snag. Robust error handling isn’t just good practice; it’s essential for troubleshooting in a complex world.

Example 2 GitHub Actions tapping into HashiCorp Vault

A GitHub Actions workflow needs an API key from HashiCorp Vault to deploy an application.

# This illustrative GitHub Actions workflow snippet shows how to fetch a secret from HashiCorp Vault.
# jobs:
#   deploy:
#     runs-on: ubuntu-latest
#     permissions: # Necessary for OIDC authentication with Vault
#       id-token: write
#       contents: read
#     steps:
#       - name: Checkout code
#         uses: actions/checkout@v3

#       - name: Import Secrets from HashiCorp Vault
#         uses: hashicorp/vault-action@v2.7.3 # Use a specific version
#         with:
#           url: ${{ secrets.VAULT_ADDR }} # URL of your Vault instance, stored as a GitHub secret
#           method: 'jwt' # Using JWT/OIDC authentication, common for CI/CD
#           role: 'your-github-actions-role' # The role configured in Vault for GitHub Actions
#           # For JWT auth, the token is automatically handled by the action using OIDC
#           secrets: |
#             secret/data/your_app/api_credentials api_key | MY_APP_API_KEY; # Path to secret, key in secret, desired Env Var name
#             secret/data/another_service service_url | SERVICE_ENDPOINT;

#       - name: Use the Secret in a deployment script
#         run: |
#           echo "The API key has been injected into the environment."
#           # Example: ./deploy.sh --api-key "${MY_APP_API_KEY}" --service-url "${SERVICE_ENDPOINT}"
#           # Or simply use the environment variable MY_APP_API_KEY directly in your script if it expects it
#           if [ -z "${MY_APP_API_KEY}" ]; then
#             echo "Error: API Key was not loaded!"
#             exit 1
#           fi
#           echo "API Key is available (first 5 chars): ${MY_APP_API_KEY:0:5}..."
#           echo "Service endpoint: ${SERVICE_ENDPOINT}"
#           # Proceed with deployment steps that use these secrets

Here, GitHub Actions securely authenticates to Vault (perhaps using OIDC for a tokenless approach) and injects the API key as an environment variable for subsequent steps.

Example 3 Reading database URL From AWS Parameter Store with Python

An application needs its database connection URL, which is stored, perhaps as a SecureString, in the AWS Systems Manager Parameter Store.

# This Python snippet demonstrates fetching a parameter from AWS Systems Manager Parameter Store.
import boto3

# Define the parameter name and AWS region
PARAMETER_NAME = "/config/your_app/database_url" # Example parameter name
REGION_NAME = "your-aws-region" # e.g., "eu-west-1"

# Create an SSM client
client = boto3.client(service_name='ssm', region_name=REGION_NAME)

try:
    # Retrieve the parameter value
    # WithDecryption=True is necessary if the parameter is a SecureString
    response = client.get_parameter(Name=PARAMETER_NAME, WithDecryption=True)
    
    db_url = response['Parameter']['Value']
    print(f"Successfully retrieved database URL: {db_url}")
    # Now you can use db_url to configure your database connection

except Exception as e:
    print(f"Error retrieving parameter: {e}")
    # Implement proper logging and error handling for your application

These snippets are windows into a world of secure and automated access, drastically reducing risk.

The gold standard, essential best practices

Adopting tools is only part of the equation. True mastery comes from embracing sound principles:

The golden rule of least privilege: Grant only the bare minimum permissions required for a task, and no more. Think of it as giving out keys that only open specific doors, not the master key to the entire digital kingdom. If an application only needs to read a specific secret, don’t give it write access or access to other secrets.
Embrace regular secret rotation: Why this constant churning? Because even the strongest locks can be picked given enough time, or keys can be inadvertently misplaced. Regular rotation is like changing the locks periodically, ensuring that even if an old key falls into the wrong hands, it no longer opens any doors. Many secret management tools can automate this.
Audit and monitor relentlessly: Keep meticulous records of who (or what) accessed which secrets or configurations, and when. These audit trails are invaluable for security analysis and troubleshooting.
Maintain strict environment separation: Configurations and secrets for development, staging, and production environments must be entirely separate and distinct. Never let a development secret grant access to production resources.
Automate with Infrastructure As Code (IaC): Define and manage your configuration stores and secret management infrastructure using code (e.g., Terraform, CloudFormation). This ensures consistency, repeatability, and version control for your security posture.
Secure your local development loop: Developers need access to some secrets too. Don’t let this be the weak link. Use local instances of tools like Vault, or employ .env files (which are never committed to version control) managed by tools like direnv to load them into the shell.

Just as your diligent house cleaner is given keys only to the areas they need to access and not the combination to your personal safe, applications and users should operate with the minimum necessary permissions.

Forging your secure DevOps future

The journey towards robust configuration and secret management might seem daunting, but its rewards are immense. It’s the bedrock upon which secure, reliable, and efficient DevOps practices are built. This isn’t just about ticking security boxes; it’s about fostering a culture of proactive defense, operational excellence, and ultimately, developer peace of mind. Think of it like consistent maintenance for a complex machine; a little diligence upfront prevents catastrophic failures down the line.

So, this digital universe, much like that forgotten corner of your fridge, just keeps spawning new and exciting forms of… stuff. By actually mastering these fundamental principles of configuration and secret hygiene, you’re not just building less-likely-to-explode applications; you’re doing future-you a massive favor. Think of it as pre-emptive aspirin for tomorrow’s inevitable headache. Go on, take a peek at your current setup. It might feel like volunteering for digital dental work, but that sweet, sweet relief when things don’t go catastrophically wrong? Priceless. Your users will probably just keep clicking away, blissfully unaware of the chaos you’ve heroically averted. And honestly, isn’t that the quiet victory we all crave?

May 7, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

Why enterprise DevOps initiatives fail and how to fix them

Getting DevOps right in large companies is tricky. It’s been around for nearly two decades, from developers wanting deployment control. It gained traction around 2011-2015, boosted by Gartner, SAFe, and AWS’s rise, pushing CIOs to learn from agile startups.

Despite this history, many DevOps initiatives stumble. Why? Often, the approach misses fundamental truths about making DevOps work in complex enterprises with multi-cloud setups, legacy systems, and pressure for faster results. Let’s explore common pitfalls and how to get back on track.

Thinking DevOps is just another IT project

This is crucial. DevOps isn’t just new tools or org charts; it’s a cultural shift. It’s about Dev, Ops, Sec, and the business working together smoothly, focused on customer value, agility, and stability.

Treating it like a typical project is like fixing a building’s crumbling foundation by painting the walls, you ignore the deep, structural changes needed. CIOs might focus narrowly on IT implementation, missing the vital cultural shift. Overlooking connections to customer value, security, scaling, and governance is easy but detrimental. Siloing DevOps leads to slower cycles and business disconnects.

How to Fix It: Ensure shared understanding of DevOps/Agile principles. Run workshops for Dev and Ops to map the value stream and find bottlenecks. Forge a shared vision balancing innovation speed and operational stability, the core DevOps tension.

Rushing continuous delivery without solid operations

The allure of CI/CD is strong, but pushing continuous deployment everywhere without robust operations is like building a race car without good brakes or steering, you might crash.

Not every app needs constant updates, nor do users always want them. Does the business grasp the cost of rigorous automated testing required for safe, frequent deployments? Do teams have the operational muscle: solid security, deep observability, mature AIOps, reliable rollbacks? Too often, we see teams compromise quality for speed.

The massive CrowdStrike outage is a stark reminder: pushing changes fast without sufficient safeguards is risky. To keep evolving… without breaking things, we need to test everything. Remember benchmarks: only 18% achieve elite performance (on-demand deploys, <5% failure, <1hr recovery); high performers deploy daily/weekly (<10% failure, <1 day recovery).

How to Fix It: Use a risk-based approach per application. For frequent deployments, demand rigorous testing, deep observability (using SRE principles like SLOs), canary releases, and clear Error Budgets.

Neglecting user and developer experiences

Focusing solely on automation pipelines forgets the humans involved: end-users and developers.

Feature flags, for instance, are often just used as on/off switches. They’re versatile tools for safer rollouts, A/B testing, and resilience, missing this potential is a loss.

Another pitfall: overloading developers by shifting too much infrastructure, testing, and security work “left” without proper support. This creates cognitive overload and kills productivity, imposing a “developer tax”, it’s unrealistic to expect developers to master everything.

How to Fix It: Discuss how DevOps practices impact people. Is the user experience good? Is the developer experience smooth, or are engineers drowning? Define clear roles. Consider a Platform Engineering team to provide self-service tools that reduce developer burden.

Letting tool choices run wild without standards

Empowering teams to choose tools is good, but complete freedom leads to chaos, like builders using incompatible materials. It creates technical debt and fragility.

Platform Engineering helps by providing reusable, self-service components (CI/CD, observability, etc.), creating “paved roads” with embedded standards. Most orgs now have platform teams, boosting productivity and quality. Focusing only on tools without solid architecture causes issues. “Automation can show quick wins… but poor architecture leads to operational headaches”.

How to Fix It: Balance team autonomy with clear standards via Platform Engineering or strong architectural guidance. Define tool adoption processes. Foster collaboration between DevOps, platform, architecture, and delivery teams on shared capabilities.

Expecting teams to magically handle risk

Shifting security “left” doesn’t automatically mean risks are managed effectively. Do teams have the time, expertise, and tools for proactive mitigation? Many orgs lack sufficient security support for all teams.

Thinking security is just managing vulnerability lists is reactive. True DevSecOps builds security in. Data security is also often overlooked, with severe consequences. AI code generation adds another layer requiring rigorous testing.

How to Fix It: Don’t just assume teams handle risk. Require risk mitigation and tech debt on roadmaps. Implement automated security testing, regular security reviews, and threat modeling. Define release management with risk checkpoints. Leverage SRE practices like production readiness reviews (PRRs).

The CIO staying Hands-Off until there’s a crisis

A fundamental mistake CIOs make is fully delegating DevOps and only getting involved during crises. Because DevOps often feels “in the weeds,” it tends to be pushed down the hierarchy. But DevOps is strategic, it’s about delivering value faster and more reliably.

Given DevOps’ evolution, expect varied interpretations. As a CIO, be proactively involved. Shape the culture, engage regularly (not just during crises), champion investments (platforms, training, SRE), and ensure alignment with business needs and risk tolerance.

How to Fix It: Engage early and consistently. Champion the culture shift. Ask about value delivery, risk management, and developer productivity. Sponsor platform/SRE teams. Ensure business alignment. Your active leadership is crucial.

Avoiding these pitfalls isn’t magic, DevOps is a continuous journey. But understanding these traps and focusing on culture, solid operations, user/developer experience, sensible standards, proactive risk management, and engaged leadership significantly boosts your chances of building a DevOps capability that delivers real business value.

April 19, 2025 by Fernando SRE DevOps stuff SRE stuff