CloudArchitecture

The silent bill killers lurking in your Terraform state

The first time I heard the term “sustainability smell,” I rolled my eyes. It sounded like a fluffy marketing phrase dreamed up to make cloud infrastructure sound as wholesome as a farmers’ market. Eco-friendly Terraform? Right. Next, you’ll tell me my data center is powered by happy thoughts and unicorn tears.

But then it clicked. The term wasn’t about planting trees with every terraform apply. It was about that weird feeling you get when you open a legacy repository. It’s the code equivalent of opening a Tupperware container you found in the back of the fridge. You don’t know what’s inside, but you’re pretty sure it’s going to be unpleasant.

Turns out, I’d been smelling these things for years without knowing what to call them. According to HashiCorp’s 2024 survey, a staggering 70% of infrastructure teams admit to over-provisioning resources. It seems we’re all building mansions for guests who never arrive. That, my friend, is the smell. It’s the scent of money quietly burning in the background.

What exactly is that funny smell in my code

A “sustainability smell” isn’t a bug. It won’t trigger a PagerDuty alert at 3 AM. It’s far more insidious. It’s a bad habit baked into your Terraform configuration that silently drains your budget and makes future maintenance a soul-crushing exercise in digital archaeology.

The most common offender is the legendary main.tf file that looks more like an epic novel. You know the one. It’s a sprawling, thousand-line behemoth where VPCs, subnets, ECS clusters, IAM roles, and that one S3 bucket from a forgotten 2021 proof-of-concept all live together in chaotic harmony. Trying to change one small thing in that file is like playing Jenga with a live grenade. You pull out one block, and suddenly three unrelated services start weeping.

I’ve stumbled through enough of these digital haunted houses to recognize the usual ghosts:

The over-provisioned powerhouse: An RDS instance with enough horsepower to manage the entire New York Stock Exchange, currently tasked with serving a blog that gets about ten visits a month. Most of them are from the author’s mom.
The zombie load balancer: Left behind after a one-off traffic spike, it now spends its days blissfully idle, forwarding zero traffic but diligently charging your account for the privilege of existing.
Hardcoded horrors: Instance sizes and IP addresses sprinkled directly into the code like cheap confetti. Need to scale? Good luck. You’ll be hunting down those values for the rest of the week.
The phantom snapshot: That old EBS snapshot you swore you deleted. It’s still there, lurking in the dark corners of your AWS account, accumulating charges with the quiet persistence of a glacier.

The silent killers that sink your budget

Let’s be honest, no one’s idea of a perfect Friday afternoon involves becoming a private investigator whose only client is a rogue t3.2xlarge instance that went on a very expensive vacation without permission. It’s tempting to just ignore it. It’s just one instance, right?

Wrong. These smells are the termites of your cloud budget. You don’t notice them individually, but they are silently chewing through your financial foundations. That “tiny” overcharge joins forces with its zombie friends, and suddenly your bill isn’t just creeping up; it’s sprinting.

But the real horror is for the next person who inherits your repo. They were promised the Terraform dream: a predictable, elegant blueprint. Instead, they get a haunted house. Every terraform apply becomes a jump scare, a game of Russian roulette where they pray they don’t awaken some ancient, costly beast.

Becoming a cloud cost detective

So, how do you hunt these ghosts? While tools like Checkov, tfsec, and terrascan are your trusty guard dogs, they’ll bark if you leave the front door wide open; they won’t notice that you’re paying the mortgage on a ten-bedroom mansion when you only live in the garage. For that, you need to do some old-fashioned detective work.

My ghost-hunting toolkit is simple:

Cross-Reference with reality: Check your declared instance sizes against their actual usage in CloudWatch. If your CPU utilization has been sitting at a Zen-like 2% for the past six months, you have a prime suspect.
Befriend the terraform plan command: Run it often. Run it before you even think about changing code. Treat it like a paranoid glance over your shoulder. It’s your best defense against unintended consequences.
Dig for treasure in AWS cost explorer: This is where the bodies are buried. Filter by service, by tag (you are tagging everything, right?), and look for the quiet, consistent charges. That weird $30 “other” charge that shows up every month? I’ve been ambushed by forgotten Route 53 hosted zones more times than I care to admit.

Your detective gadgets

Putting your budget directly into your code is a power move. It’s like putting a security guard inside the bank vault.

Here’s an aws_budgets_budget resource that will scream at you via SNS if you start spending too frivolously on your EC2 instances.

resource "aws_budgets_budget" "ec2_spending_cap" {
  name         = "budget-ec2-monthly-limit"
  budget_type  = "COST"
  limit_amount = "250.0"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  cost_filters = {
    Service = ["Amazon Elastic Compute Cloud - Compute"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"
    subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
  }
}

resource "aws_sns_topic" "budget_alerts" {
  name = "budget-alert-topic"
}

And for those phantom snapshots? Perform an exorcism with lifecycle rules. This little block of code tells S3 to act like a self-cleaning oven.

resource "aws_s3_bucket" "log_archive" {
  bucket = "my-app-log-archive-bucket"

  lifecycle_rule {
    id      = "log-retention-policy"
    enabled = true

    # Move older logs to a cheaper storage class
    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }

    # And then get rid of them entirely after a year
    expiration {
      days = 365
    }
  }
}

An exorcist’s guide to cleaner code

You can’t eliminate smells forever, but you can definitely keep them from taking over your house. There’s no magic spell, just a few simple rituals:

Embrace modularity: Stop building monoliths. Break your infrastructure into smaller, logical modules. It’s the difference between remodeling one room and having to rebuild the entire house just to change a light fixture.
Variables are your friends: Hardcoding an instance size is a crime against your future self. Use variables. It’s a tiny effort now that saves you a world of pain later.
Tag everything. No, really: Tagging feels like a chore, but it’s a lifesaver. When you’re hunting for the source of a mysterious charge, a good tagging strategy is your map and compass. Tag by project, by team, by owner, heck, tag it with your favorite sandwich. Just tag it.
Schedule a cleanup day: If it’s not on the calendar, it doesn’t exist. Dedicate a few hours every quarter to go ghost-hunting. Review idle resources, question oversized instances, and delete anything that looks dusty.

Your Terraform code is the blueprint for your infrastructure. And just like a real blueprint, any coffee stains, scribbled-out notes, or vague “we’ll figure this out later” sections get built directly into the final structure. If the plan calls for gold-plated plumbing in a closet that will never be used, that’s exactly what you’ll get. And you’ll pay for it. Every single month. These smells aren’t the spectacular, three-alarm fires that get everyone’s attention. They’re the slow, silent drips from a faucet in the basement. It’s just a dollar here for a phantom snapshot, five dollars there for an oversized instance. It’s nothing, right? But leave those drips unchecked long enough, and you don’t just get a high water bill. You come back to find you’ve cultivated a thriving mold colony and the floorboards are suspiciously soft. Ultimately, a clean repository isn’t just about being tidy. It’s about financial hygiene. So go on, open up that old repo. Be brave. The initial smell might be unpleasant, but it’s far better than the stench of a budget that has mysteriously evaporated into thin air.

The ugly truth about SRE Dashboards

Every engineer loves a good dashboard. The vibrant graphs, the neat panels, the comforting glow of a wall of green lights. It’s the digital equivalent of a clean garage; it feels productive, organized, and ready for anything.

But let’s be honest: your dashboards are probably lying to you. They’re like a well-intentioned friend who tells you everything’s fine when you’ve got a smudge of chocolate on your nose and a bird nesting in your hair. They show you the surface, but hide the messy, inconvenient truth.

I learned this the hard way, at 2 a.m., as all the best lessons are learned. We were on-call when production latency went absolutely bonkers. I stared at four massive dashboards, each with a dozen panels of metrics swirling on my screen: CPU, memory, queue depth, disk I/O, HPA stats, all the usual suspects. I was a detective with a thousand clues but no insights, scrolling through what felt like a colorful, confusing kaleidoscope.

An hour of this high-octane confusion later, we discovered the culprit: a single, rogue DNS misconfiguration in a downstream service. The dashboards, those beautiful, useless liars, had all been glowing green.

This isn’t just bad luck. It’s a design flaw.

Designed for reports, not for war

Most dashboards are built for managers who need to glance at high-level metrics during a meeting, not for engineers trying to solve a full-blown crisis. We obsess over the shiny vanity metrics: request counts and 99th percentile latency, while the real demons, the retry storms and misbehaving clients, hide in the shadows.

Think of it like this: your dashboard is a doctor who only checks your height and weight. You might look great on paper, but your appendix could be about to explode. The surface looks fine, but the guts are in chaos.

The graveyard of abandoned dashboards

Have you ever wondered where old dashboards go to die? The answer is: nowhere. They simply get abandoned, like a pet you can no longer care for. Metrics get deprecated, panels start showing N/A, and alerts get muted permanently. They become relics of a bygone era, cluttering your screens with useless data and false promises. It’s the digital equivalent of that one junk drawer in your kitchen; it feels organized at a glance, but you know deep down it’s a monument to things you’ll never use again.

Too much signal, too much noise

Adding more panels doesn’t automatically give you better visibility. At scale, dashboards become a cacophony of white noise. You spend 30 minutes scanning, 5 minutes guessing, and 10 minutes restarting pods just to see if the blinking stops. That’s not observability; that’s panic dressed up as process.

Imagine trying to find your house key on a keychain with 500 different keys on it. You can see all of them, but you can’t find the one you need when you’re standing in the rain.

So, how do you fix it? You stop making art and start getting answers.

From Metrics to Methods

We stopped dumping metrics onto giant boards and created what we called “Runbooks with Graphs.” Instead of a hundred metrics per service, we had a handful per failure mode. It’s a fundamental shift in perspective.

Here’s an example of what that looked like:

failure_mode: API_response_slowdown
title: "API Latency Exceeding SLO"
hypothesis: "Is the database overloaded?"
metrics:
  - name: "database_connections_count"
    query: "sum(database_connections_total)"
  - name: "database_query_latency_p99"
    query: "histogram_quantile(0.99, rate(database_query_latency_seconds_bucket[5m]))"
runbook_link: "https://your-wiki.com/runbooks/api_latency_troubleshooting"

This simple shift grouped our metrics by the why, not just the what.

Slaying Alert Fatigue

We took a good, hard look at our alerts and deleted 40% of them. Then, we rebuilt them from the ground up, basing them on symptoms, not raw metrics. This meant getting rid of things like this:

# BEFORE: A useless alert
- alert: HighCPULoad
  expr: avg(cpu_usage_rate) > 0.8
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High CPU on instance {{ $labels.instance }}"

And replacing it with something like this:

# AFTER: A meaningful, symptom-based alert
- alert: CustomerFacingSLOViolation
  expr: rate(http_requests_total{status_code!~"2.."}) / rate(http_requests_total) > 0.1
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Too many failed API requests - SLO violated"
    description: "The percentage of failed requests is over 10%."

Suddenly, the team trusted the alerts again. When the pager went off, it actually meant something was wrong for the customers, not just a server having a bad day.

Blackhole checks and truth bombs

If dashboards can lie, you need tools that don’t. We added synthetic tests and end-to-end user simulations. These act like a secret shopper for your service, proving something is broken, whether your metrics look good or not.

Here’s a simple example of a synthetic check:

const axios = require('axios');
async function checkAPIMetrics() {
  try {
    const response = await axios.get('https://api.yourcompany.com/v1/health');
    if (response.status !== 200) {
      throw new Error(`Health check failed with status: ${response.status}`);
    }
    console.log('API is healthy.');
  } catch (error) {
    console.error('API health check failed:', error.message);
    // Send alert to PagerDuty or Slack
  }
}
checkAPIMetrics();

Your internal metrics may say “OK,” but a synthetic user never lies about the customer’s experience.

The hard truth

Dashboards don’t solve outages. People do. They’re useful, but only if they’re maintained, contextual, and grounded in real-world operations. If your dashboards don’t reflect how failures actually unfold, they’re not observability, they’re art. And in the middle of a P1 incident, you don’t need art. You need answers.

This is the part where I’m supposed to give you a tidy, inspirational conclusion. Something about how we can all be better, more vigilant SREs. But let’s be realistic. The truth is, the world is full of dashboards that are just digital wallpaper, beautiful to look at, utterly useless in a crisis. They’re a collective delusion that makes us feel like we have everything under control, when in reality, we’re just scrolling through colorful confusion, hoping something will catch our eye.

So, before you build another massive, 50-panel dashboard, stop and ask yourself: is this going to help me at 2 a.m., with my coffee pot empty and a panic-stricken developer on the other end of the line? Or is it just another pretty lie to add to the collection?

How many of your dashboards are truly battle-ready? And which ones are just decorative?

August 28, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

127.0.0.1 and its 16 million invisible roommates

Let’s be honest. You’ve typed 127.0.0.1 more times than you’ve called your own mother. We treat it like the sole, heroic occupant of the digital island we call localhost. It’s the only phone number we know by heart, the only doorbell we ever ring.

Well, brace yourself for a revelation that will fundamentally alter your relationship with your machine. 127.0.0.1 is not alone. In fact, it lives in a sprawling, chaotic metropolis with over 16 million other addresses, all of them squatting inside your computer, rent-free.

Ignoring these neighbors condemns you to a life of avoidable port conflicts and flimsy localhost tricks. But give them a chance, and you’ll unlock cleaner dev setups, safer tests, and fewer of those classic “Why is my test API saying hello to the entire office Wi-Fi?” moments of sheer panic.

So buckle up. We’re about to take the scenic tour of the neighborhood that the textbooks conveniently forgot to mention.

Your computer is secretly a megacity

The early architects of the internet, in their infinite wisdom, set aside the entire 127.0.0.0/8 block of addresses for this internal monologue. That’s 16,777,216 unique addresses, from 127.0.0.1 all the way to 127.255.255.254. Every single one of them is designed to do one thing: loop right back to your machine. It’s the ultimate homebody network.

Think of your computer not as a single-family home with one front door, but as a gigantic apartment building with millions of mailboxes. And for years, you’ve been stubbornly sending all your mail to apartment #1.

Most operating systems only bother to introduce you to 127.0.0.1, but the kernel knows the truth. It treats any address in the 127.x.y.z range as a VIP guest with an all-access pass back to itself. This gives you a private, internal playground for wiring up your applications.

A handy rule of thumb? Any address starting with 127 is your friend. 127.0.0.2, 127.10.20.30, even 127.1.1.1, they all lead home.

Everyday magic tricks with your newfound neighbors

Once you realize you have a whole city at your disposal, you can stop playing port Tetris. Here are a few party tricks your localhost never told you it could do.

The art of peaceful coexistence

We’ve all been there. It’s 2 AM, and two of your microservices are having a passive-aggressive standoff over port 8080. They both want it, and neither will budge. You could start juggling ports like a circus performer, or you could give them each their own house.

Assign each service its own loopback address. Now they can both listen on port 8080 without throwing a digital tantrum.

First, give your new addresses some memorable names in your /etc/hosts file (or C:\Windows\System32\drivers\etc\hosts on Windows).

# /etc/hosts

127.0.0.1       localhost
127.0.1.1       auth-service.local
127.0.1.2       inventory-service.local

Now, you can run both services simultaneously.

# Terminal 1: Start the auth service
$ go run auth/main.go --bind 127.0.1.1:8080

# Terminal 2: Start the inventory service
$ python inventory/app.py --host 127.0.1.2 --port 8080

Voilà. http://auth-service.local:8080 and http://inventory-service.local:8080 are now living in perfect harmony. No more port drama.

The safety of an invisible fence

Binding a service to 0.0.0.0 is the developer equivalent of leaving your front door wide open with a neon sign that says, “Come on in, check out my messy code, maybe rifle through my database.” It’s convenient, but it invites the entire network to your private party.

Binding to a 127.x.y.z address, however, is like building an invisible fence. The service is only accessible from within the machine itself. This is your insurance policy against accidentally exposing a development database full of ridiculous test data to the rest of the company.

Advanced sorcery for the brave

Ready to move beyond the basics? Treating the 127 block as a toolkit unlocks some truly powerful patterns.

Taming local TLS

Testing services that require TLS can be a nightmare. With your new loopback addresses, it becomes trivial. You can create a single local Certificate Authority (CA) and issue a certificate with Subject Alternative Names (SANs) for each of your local services.

# /etc/hosts again

127.0.2.1   api-gateway.secure.local
127.0.2.2   user-db.secure.local
127.0.2.3   billing-api.secure.local

Now, api-gateway.secure.local can talk to user-db.secure.local over HTTPS, with valid certificates, all without a single packet leaving your laptop. This is perfect for testing mTLS, SNI, and other scenarios where your client needs to be picky about its connections.

Concurrent tests without the chaos

Running automated acceptance tests that all expect to connect to a database on port 5432 can be a race condition nightmare. By pinning each test runner to its own unique 127 address, you can spin them all up in parallel. Each test gets its own isolated world, and your CI pipeline finishes in a fraction of the time.

The fine print and other oddities

This newfound power comes with a few quirks you should know about. This is the part of the tour where we point out the strange neighbor who mows his lawn at midnight.

The container dimension: Inside a Docker container, 127.0.0.1 refers to the container itself, not the host machine. It’s a whole different loopback universe in there. To reach the host from a container, you need to use the special gateway address provided by your platform (like host.docker.internal).
The IPv6 minimalist: IPv6 scoffs at IPv4’s 16 million addresses. For loopback, it gives you one: ::1. That’s it. This explains the classic mystery of “it works with 127.0.0.1 but fails with localhost.” Often, localhost resolves to ::1 first, and if your service is only listening on IPv4, it won’t answer the door. The lesson? Be explicit, or make sure your service listens on both.
The SSRF menace: If you’re building security filters to prevent Server-Side Request Forgery (SSRF), remember that blocking just 127.0.0.1 is like locking the front door but leaving all the windows open. You must block the entire 127.0.0.0/8 range and ::1.

Your quick start eviction notice for port conflicts

Ready to put this into practice? Here’s a little starter kit you can paste today.

First, add some friendly names to your hosts file.

# Add these to your /etc/hosts file
127.0.10.1  api.dev.local
127.0.10.2  db.dev.local
127.0.10.3  cache.dev.local

Next, on Linux or macOS, you can formally add these as aliases to your loopback interface. This isn’t always necessary for binding, but it’s tidy.

# For Linux
sudo ip addr add 127.0.10.1/8 dev lo
sudo ip addr add 127.0.10.2/8 dev lo
sudo ip addr add 127.0.10.3/8 dev lo

# For macOS
sudo ifconfig lo0 alias 127.0.10.1
sudo ifconfig lo0 alias 127.0.10.2
sudo ifconfig lo0 alias 127.0.10.3

Now, you can bind three different services, all to their standard ports, without a single collision.

# Run your API on its default port
api-server --bind api.dev.local:3000

# Run Postgres on its default port
postgres -D /path/to/data -c listen_addresses=db.dev.local

# Run Redis on its default port
redis-server --bind cache.dev.local

Check that everyone is home and listening.

# Check the API
curl http://api.dev.local:3000/health

# Check the database (requires psql client)
psql -h db.dev.local -U myuser -d mydb -c "SELECT 1"

# Check the cache
redis-cli -h cache.dev.local ping
# Expected output: PONG

Welcome to the neighborhood

Your laptop isn’t a one-address town; it’s a small city with streets you haven’t named and doors you haven’t opened. For too long, you’ve been forcing all your applications to live in a single, crowded, noisy studio apartment at 127.0.0.1. The database is sleeping on the couch, the API server is hogging the bathroom, and the caching service is eating everyone else’s food from the fridge. It’s digital chaos.

Giving each service its own loopback address is like finally moving them into their own apartments in the same building. It’s basic digital hygiene. Suddenly, there’s peace. There’s order. You can visit each one without tripping over the others. You stop being a slumlord for your own processes and become a proper city planner.

So go ahead, break the monogamous, and frankly codependent, relationship you’ve had with 127.0.0.1. Explore the neighborhood. Hand out a few addresses. Let your development environment behave like a well-run, civilized society instead of a digital mosh pit. Your sanity and your services will thank you for it. After all, good fences make good neighbors, even when they’re all living inside your head.

August 25, 2025 by Fernando SRE DevOps stuff SRE stuff

Terraform scales better without a centralized remote state

It’s 4:53 PM on a Friday. You’re pushing a one-line change to an IAM policy. A change so trivial, so utterly benign, that you barely give it a second thought. You run terraform apply, lean back in your chair, and dream of the weekend. Then, your terminal returns a greeting from the abyss: Error acquiring state lock.

Somewhere across the office, or perhaps across the country, a teammate has just started a plan on their own, seemingly innocuous change. You are now locked in a digital standoff. The weekend is officially on hold. Your shared Terraform state file, once a symbol of collaboration and a single source of truth, has become a temperamental roommate who insists on using the kitchen right when you need to make dinner. And they’re a very, very slow cook.

Our Terraform honeymoon phase

It wasn’t always like this. Most of us start our Terraform journey in a state of blissful simplicity. Remember those early days? A single, elegant main.tf file, a tidy remote backend in an S3 bucket, and a DynamoDB table to handle the locking. It was the infrastructure equivalent of a brand-new, minimalist apartment. Everything had its place. Deployments were clean, predictable, and frankly, a little bit boring.

Our setup looked something like this, a testament to a simpler time:

# in main.tf
terraform {
  backend "s3" {
    bucket         = "our-glorious-infra-state-prod"
    key            = "global/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-lock-prod"
    encrypt        = true
  }
}

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
  # ... and so on
}

It worked beautifully. Until it didn’t. The problem with minimalist apartments is that they don’t stay that way. You add a person, then another. You buy more furniture. Soon, you’re tripping over things, and that one clean kitchen becomes a chaotic battlefield of conflicting needs.

The kitchen gets crowded

As our team and infrastructure grew, our once-pristine state file started to resemble a chaotic shared kitchen during rush hour. The initial design, meant for a single chef, was now buckling under the pressure of a full restaurant staff.

The state lock standoff

The first and most obvious symptom was the state lock. It’s less of a technical “race condition” and more of a passive-aggressive duel between two colleagues who both need the only good frying pan at the exact same time. The result? Burnt food, frayed nerves, and a CI/CD pipeline that spends most of its time waiting in line.

The mystery of the shared spice rack

With everyone working out of the same state file, we lost any sense of ownership. It became a communal spice rack where anyone could move, borrow, or spill things. You’d reach for the salt (a production security group) only to find someone had replaced it with sugar (a temporary rule for a dev environment). Every Terraform apply felt like a gamble. You weren’t just deploying your change; you were implicitly signing off on the current, often mysterious, state of the entire kitchen.

The pre-apply prayer

This led to a pervasive culture of fear. Before running an apply, engineers would perform a ritualistic dance of checks, double-checks, and frantic Slack messages: “Hey, is anyone else touching prod right now?” The Terraform plan output would scroll for pages, a cryptic epic poem of changes, 95% of which had nothing to do with you. You’d squint at the screen, whispering a little prayer to the DevOps gods that you wouldn’t accidentally tear down the customer database because of a subtle dependency you missed.

The domino effect of a single spilled drink

Worst of all was the tight coupling. Our infrastructure became a house of cards. A team modifying a network ACL for their new microservice could unintentionally sever connectivity for a legacy monolith nobody had touched in years. It was the architectural equivalent of trying to change a lightbulb and accidentally causing the entire building’s plumbing to back up.

An uncomfortable truth appears

For a while, we blamed Terraform. We complained about its limitations, its verbosity, and its sharp edges. But eventually, we had to face an uncomfortable truth: the tool wasn’t the problem. We were. Our devotion to the cult of the single centralized state—the idea that one file to rule them all was the pinnacle of infrastructure management—had turned our single source of truth into a single point of failure.

The great state breakup

The solution was as terrifying as it was liberating: we had to break up with our monolithic state. It was time to move out of the chaotic shared house and give every team their own well-equipped studio apartment.

Giving everyone their own kitchenette

First, we dismantled the monolith. We broke our single Terraform configuration into dozens of smaller, isolated stacks. Each stack managed a specific component or application, like a VPC, a Kubernetes cluster, or a single microservice’s infrastructure. Each had its own state file.

Our directory structure transformed from a single folder into a federation of independent projects:

infra/
├── networking/
│   ├── vpc.tf
│   └── backend.tf      # Manages its own state for the VPC
├── databases/
│   ├── rds-main.tf
│   └── backend.tf      # Manages its own state for the primary RDS
└── services/
    ├── billing-api/
    │   ├── ecs-service.tf
    │   └── backend.tf  # Manages state for just the billing API
    └── auth-service/
        ├── iam-roles.tf
        └── backend.tf  # Manages state for just the auth service

The state lock standoffs vanished overnight. Teams could work in parallel without tripping over each other. The blast radius of any change was now beautifully, reassuringly small.

Letting infrastructure live with its application

Next, we embraced GitOps patterns. Instead of a central infrastructure repository, we decided that infrastructure code should live with the application it supports. It just makes sense. The code for an API and the infrastructure it runs on are a tightly coupled couple; they should live in the same house. This meant code reviews for application features and infrastructure changes happened in the same pull request, by the same team.

Tasting the soup before serving it

Finally, we made surprises a thing of the past by validating plans before they ever reached the main branch. We set up simple CI workflows that would run a Terraform plan on every pull request. No more mystery meat deployments. The plan became a clear, concise contract of what was about to happen, reviewed and approved before merge.

A snippet from our GitHub Actions workflow looked like this:

name: 'Terraform Plan Validation'
on:
  pull_request:
    paths:
      - 'infra/**'
      - '.github/workflows/terraform-plan.yml'

jobs:
  plan:
    name: 'Terraform Plan'
    runs-on: ubuntu-latest
    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Setup Terraform
      uses: hashicorp/setup-terraform@v3
      with:
        terraform_version: 1.5.0

    - name: Terraform Init
      run: terraform init -backend=false

    - name: Terraform Plan
      run: terraform plan -no-color

Stories from the other side

This wasn’t just a theoretical exercise. A fintech firm we know split its monolithic repo into 47 micro-stacks. Their deployment speed shot up by 70%, not because they wrote code faster, but because they spent less time waiting and untangling conflicts. Another startup moved from a central Terraform setup to the AWS CDK (TypeScript), embedding infra in their app repos. They cut their time-to-deploy in half, freeing their SRE team from being gatekeepers and allowing them to become enablers.

Guardrails not gates

Terraform is still a phenomenally powerful tool. But the way we use it has to evolve. A centralized remote state, when not designed for scale, becomes a source of fragility, not strength. Just because you can put all your eggs in one basket doesn’t mean you should, especially when everyone on the team needs to carry that basket around.

The most scalable thing you can do is let teams build independently. Give them ownership, clear boundaries, and the tools to validate their work. Build guardrails to keep them safe, not gates to slow them down. Your Friday evenings will thank you for it.

August 23, 2025 by Fernando SRE Cloud stuff DevOps stuff Git SRE stuff

Confessions of a recovering GitOps addict

There’s a moment in every tech trend’s lifecycle when the magic starts to wear off. It’s like realizing the artisanal, organic, free-range coffee you’ve been paying eight dollars for just tastes like… coffee. For me, and many others in the DevOps trenches, that moment has arrived for GitOps.

We once hailed it as the silver bullet, the grand unifier, the one true way. Now, I’m here to tell you that the romance is over. And something much more practical is taking its place.

The alluring promise of a perfect world

Let’s be honest, we all fell hard for GitOps. The promise was intoxicating. A single source of truth for our entire infrastructure, nestled right in the warm, familiar embrace of Git. Pull Requests became the sacred gates through which all changes must pass. CI/CD pipelines were our holy scrolls, and tools like ArgoCD and Flux were the messiahs delivering us from the chaos of manual deployments.

It was a world of perfect order. Every change was audited, every state was declared, and every rollback was just a git revert away. It felt clean. It felt right. It felt… professional. For a while, it was the hero we desperately needed.

The tyranny of the pull request

But paradise had a dark side, and it was paved with endless YAML files. The first sign of trouble wasn’t a catastrophic failure, but a slow, creeping bureaucracy that we had built for ourselves.

Need to update a single, tiny secret? Prepare for the ritual. First, the offering: a Pull Request. Then, the prayer for the high priests (your colleagues) to grant their blessing (the approval). Then, the sacrifice (the merge). And finally, the tense vigil, watching ArgoCD’s sync status like it’s a heart monitor, praying it doesn’t flatline.

The lag became a running joke. Your change is merged… but has it landed in production? Who knows! The sync bot seems to be having a bad day. When everything is on fire at 2 AM, Git is like that friend who proudly tells you, “Well, according to my notes, the plan was for there not to be a fire.” Thanks, Git. Your record of intent is fascinating, but I need a fire hose, not a historian.

We hit our wall during what should have been a routine update.

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: auth-service
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: auth-service-container
        image: our-app:v1.12.4
        envFrom:
        - secretRef:
            name: production-credentials

A simple change to the production-credentials secret required updating an encrypted file, PR-ing it, and then explaining in the commit message something like, “bumping secret hash for reasons”. Nobody understood it. Infrastructure changes started to require therapy sessions just to get merged.

And then, the tools fought back

When a system creates more friction than it removes, a rebellion is inevitable. And the rebels have arrived, not with pitchforks, but with smarter, more flexible tools.

First, the idea that developers should be fluent in YAML began to die. Internal Developer Platforms (IDPs) like Backstage and Port started giving developers what they always wanted: self-service with guardrails. Instead of wrestling with YAML syntax, they click a button in a portal to provision a database or spin up a new environment. Git becomes a log of what happened, not a bottleneck to make things happen.

Second, we remembered that pushing things can be good. The pull-based model was trendy, but let’s face it: push is immediate. Push is observable. We’ve gone back to CI pipelines pushing manifests directly into clusters, but this time they’re wearing body armor.

# This isn't your old wild-west kubectl apply
# It's a command wrapped in an approval system, with observability baked in.
deploy-cli --service auth-service --env production --approve

The change is triggered precisely when we want it, not when a bot feels like syncing. Finally, we started asking a radical question: why are we describing infrastructure in a static markup language when we could be programming it? Tools like Pulumi and Crossplane entered the scene. Instead of hundreds of lines of YAML, we’re writing code that feels alive.

import * as aws from "@pulumi/aws";

// Create an S3 bucket with versioning enabled.
const bucket = new aws.s3.Bucket("user-uploads-bucket", {
    versioning: {
        enabled: true,
    },
    acl: "private",
});

Infrastructure can now react to events, be composed into reusable modules, and be written in a language with types and logic. YAML simply can’t compete with that.

A new role for the abdicated king

So, is GitOps dead? No, that’s just clickbait. But it has been demoted. It’s no longer the king ruling every action; it’s more like a constitutional monarch, a respected elder statesman.

It’s fantastic for auditing, for keeping a high-level record of intended state, and for infrastructure teams that thrive on rigid discipline. But for high-velocity product teams, it’s become a beautifully crafted anchor when what we need is a motor.

We’ve moved from “Let’s define everything in Git” to “Let’s ship faster, safer, and saner with the right tools for the job.”

Our current stack is a hybrid, a practical mix of the old and new:

Backstage to abstract away complexity for developers.
Push-based pipelines with strong guardrails for immediate, observable deployments.
Pulumi for typed, programmable, and composable infrastructure.
Minimal GitOps for what it does best: providing a clear, auditable trail of our intentions.

GitOps wasn’t a mistake; it was the strict but well-meaning grandparent of infrastructure management. It taught us discipline and the importance of getting approval before touching anything important. But now that we’re grown up, that level of supervision feels less like helpful guidance and more like having someone watch over your shoulder while you type, constantly asking, “Are you sure you want to save that file?” The world is moving on to flexibility, developer-first platforms, and code you can read without a decoder ring. If you’re still spending your nights appeasing the YAML gods with Pull Request sacrifices for trivial changes… you’re not just living in the past, you’re practically a fossil.

August 17, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

That awkward moment when On-Prem is cheaper

Let’s be honest. For the better part of a decade, the public cloud has been the charismatic, free-spending friend who gets you out of any jam. Need to throw a last-minute party for a million users? They’ve got the hardware. Need to scale an app overnight? They’re already warming up the car. It was fast, it was elastic, and it saved you from the tedious, greasy work of racking your own servers. The only price was a casual, “You can pay me back later.”

Well, it’s later. The bill has arrived, and it has more cryptic line items than a forgotten ancient language. The finance department is calling, and they don’t sound happy.

This isn’t an angry stampede for the exits. Nobody is burning their AWS credits in protest. It’s more of a pragmatic reshuffle, a collective moment of clarity. Teams are looking at their sprawling digital estates and asking a simple question: Does everything really need to live in this expensive, all-inclusive resort? The result is a new normal where the cloud is still essential, just not universal.

The financial hangover

The cloud is wonderfully elastic. Elastic things, by their nature, bounce. So do monthly statements. Teams that scaled at lightning speed are now waking up to a familiar financial hangover with four distinct symptoms. First, there’s the billing complexity. Your monthly invoice isn’t a bill; it’s a mystery novel written by a sadist. Thousands of line items, tiered pricing, and egress charges transform the simple act of “moving data” into a budget-devouring monster.

-- A query that looks innocent but costs a fortune in data egress
SELECT
    event_id,
    user_id,
    payload
FROM
    user_events_production.events_archive
WHERE
    event_date BETWEEN '2025-07-01' AND '2025-07-31'
    AND region != 'eu-central-1'; -- Oh, you wanted to move 5TB across continents? That'll be extra.

Second is the unpredictable demand. A few busy weeks, a successful marketing campaign, or a minor viral event can undo months of careful savings plans. You budget for a quiet month, and suddenly you’re hosting the Super Bowl.

Then come the hidden multipliers. These are the gremlins of your infrastructure. Tiny, seemingly insignificant charges for cross-AZ traffic, managed service premiums, and per-request pricing that quietly multiply in the dark, feasting on your budget.

Finally, there’s the convenience tax. You paid a premium to turn the pain of operations into someone else’s problem. But for workloads that are steady, predictable, and bandwidth-heavy, that convenience starts to look suspiciously like setting money on fire. Those workloads are starting to look much cheaper on hardware you own or lease, where capital expenditure and depreciation replace the tyranny of per-hour-everything.

The gilded cage of convenience

Cloud providers don’t lock you in with malice. They seduce you with helpfulness. They offer a proprietary database so powerful, an event bus so seamless, an identity layer so integrated that before you know it, your application is woven into the very fabric of their ecosystem.

Leaving isn’t a migration; it’s a full-scale renovation project. It’s like living in a luxury hotel. They don’t forbid you from leaving, but once you’re used to the 24/7 room service, are you really going to go back to cooking for yourself?

Faced with this gilded cage, smart teams are now insisting on a kind of technological prenuptial agreement. It’s not about a lack of trust; it’s about preserving future freedom. Where practical, they prefer:

Open databases or engines with compatible wire protocols.
Kubernetes with portable controllers over platform-specific orchestration.
OpenTelemetry for metrics and traces that can travel.
Terraform or Crossplane to describe infrastructure in a way that isn’t tied to one vendor.

This isn’t purity theater. It simply reduces the penalty for changing your mind later.

# A portable infrastructure module
# It can be pointed at AWS, GCP, or even an on-prem vSphere cluster
# with the right provider.

resource "kubernetes_namespace" "app_namespace" {
  metadata {
    name = "my-awesome-app"
  }
}

resource "helm_release" "app_database" {
  name       = "app-postgres"
  repository = "https://charts.bitnami.com/bitnami"
  chart      = "postgresql"
  namespace  = kubernetes_namespace.app_namespace.metadata[0].name

  values = [
    "${file("values/postgres-prod.yaml")}"
  ]
}

A new menu of choices

The choice is no longer just between a hyperscaler and a dusty server cupboard under the stairs. The menu has expanded:

Private cloud: Using platforms like OpenStack or Kubernetes on bare metal in a modern colocation facility.
Alternative clouds: A growing number of providers are offering simpler pricing and less lock-in.
Hybrid models: Keeping sensitive data close to home while bursting to public regions for peak demand.
Edge locations: For workloads that need to be physically close to users and hate round-trip latency.

The point isn’t to flee the public cloud. The point is workload fitness. You wouldn’t wear hiking boots to a wedding, so why run a predictable, data-heavy analytics pipeline on a platform optimized for spiky, uncertain web traffic?

A personality test for your workload

So, how do you decide what stays and what goes? You don’t need a crystal ball. You just need to give each workload a quick personality test. Ask these six questions:

Is its demand mostly steady or mostly spiky? Is it a predictable workhorse or a temperamental rock star?
Is its data large and chatty or small and quiet?
Is latency critical? Does it need instant responses or is a few dozen milliseconds acceptable?
Are there strict data residency or compliance rules?
Does it rely on a proprietary managed service that would be a nightmare to replace?
Can we measure its unit economics? Do we know the cost per request, per user, or per gigabyte processed?

Steady and heavy often wins on owned or leased hardware. Spiky and uncertain still loves the elasticity of the hyperscalers. Regulated and locality-bound prefer the control of a private or hybrid setup. And if a workload gets its superpowers from a proprietary managed service, you either keep it where its powers live or make peace with a less super version of your app.

What does this mean for you, Architect

If you’re a DevOps engineer or a Cloud Architect, congratulations. Your job description just grew a new wing. You are no longer just a builder of digital infrastructure; you are now part financial planner, part supply chain expert, and part treaty negotiator.

Your playbook now includes:

FinOps literacy: The ability to connect design choices to money in a way the business understands and trusts.
Portability patterns: Designing services that can move without a complete rewrite.
Hybrid networking: Weaving together different environments without creating a haunted house of routing tables and DNS entries.
Observability without borders: Using vendor-neutral signals to see what’s happening from end to end.
Procurement fluency: The skill to make apples-to-apples comparisons between amortized hardware, managed services, and colocation contracts.

Yes, it’s time to carry a pocket calculator again, at least metaphorically.

The unsexy path to freedom

The journey back from the cloud is paved with unglamorous but essential work. It’s not a heroic epic; it’s a series of small, carefully planned steps. The risks are real. You have to account for the people cost of patching and maintaining private platforms, the lead times for hardware, and the shadow dependencies on convenient features you forgot you were using.

The antidote is small steps, honest metrics, and boringly detailed runbooks. Start with a proof-of-concept, create a migration plan that moves slices, not the whole cake, and have rollback criteria that a non-engineer can understand.

This is just a course correction

The Great Cloud Exit is less a rebellion and more a rationalization. Think of it as finally cleaning out your closet after a decade-long shopping spree. The public cloud gave us a phenomenal decade of speed, and we bought one of everything. Now, we’re sorting through the pile. That spiky, unpredictable web service? It still looks great in the elastic fabric of a hyperscaler. That massive, steady-state analytics database? It’s like a heavy wool coat that was never meant for the tropics; it’s time to move it to a more suitable climate, like your own data center. And that experimental service you spun up in 2019 and forgot about? That’s the impulse buy sequin jacket you’re never going to wear. Time to donate it.

Treating workload placement as a design problem instead of a loyalty test is liberating. It’s admitting you don’t need a Swiss Army knife when all you have to do is turn a single screw. Choosing the right environment for the job results in a system that costs less and complains less. It performs better because it’s not being forced to do something it was never designed for.

This leads to the most important outcome: options. In a landscape that changes faster than we can update our résumés, flexibility is the only superpower that truly matters. The ability to move, to adapt, and to choose without facing a punishing exit fee or a six-month rewrite, that’s the real prize. The cloud isn’t the destination anymore; it’s just one very useful stop on the map.

August 11, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

What is AWS Nucleus, and why Is it poised to replace EC2?

It all started with a coffee and a bill. My usual morning routine. But this particular Tuesday, the AWS bill had an extra kick that my espresso lacked. The cost for a handful of m5.large instances had jumped nearly 40% over the past year. I almost spat out my coffee.

I did what any self-respecting Cloud Architect does: I blamed myself. Did I forget to terminate a dev environment? Did I leave a data transfer running to another continent? But no. After digging through the labyrinth of Cost Explorer, the truth was simpler and far more sinister: EC2 was quietly getting more expensive. Spot instances had become as predictable as a cat on a hot tin plate, and my “burstable” CPUs seemed to run out of breath if they had to do more than jog for a few minutes.

EC2, our old, reliable friend. The bedrock of the cloud. It felt like watching your trusty old car suddenly start demanding premium fuel and imported spare parts just to get to the grocery store. Something was off.

And then, it happened. A slip-up in a public Reddit forum. A senior AWS engineer accidentally posted a file named ec2-phaseout-q4–2027.pdf. It was deleted in minutes, but the internet, as we know, has the memory of an elephant with a grudge.

(Disclaimer for the nervous: This PDF is my narrative device. A ghost in the machine. A convenient plot twist. But the trends it points to? The rising costs, the architectural creaks? Those are very, very real. Now, where were we?)

The document was a bombshell. It laid out a plan to deprecate over 80% of current EC2 instance families by the end of 2027, paving the way for a “next-gen compute platform.” Was this real? I made some calls. The first partner laughed it off. The second went quiet, a little too quiet. The third, after I promised to buy them beers for a month, whispered: “We’re already planning the transition for our enterprise clients.”

Bingo.

Why our beloved EC2 is becoming a museum piece

My lead engineer summed it up beautifully last week. “Running real-time ML on today’s EC2,” he sighed, “feels like asking a 2010 laptop to edit 4K video. It’ll do it, but it’ll scream in agony the whole time, and you’d better have a fire extinguisher handy.”

He’s not wrong. For general-purpose apps, EC2 is still a trusty workhorse. But for the demanding, high-performance workloads that are becoming the norm? You can practically see the gray hairs and hear the joints creaking.

This isn’t just about cost. It’s about architecture. EC2 was built for a different era, an era before serverless was cool, before WebAssembly (WASM) was a thing, and before your toaster needed to run a Kubernetes cluster. The cracks are starting to show.

Meet AWS Nucleus, the secret successor

No press release. No re:Invent keynote. But if you’re connected to AWS insiders, you’ve probably heard whispers of a project internally codenamed “Nucleus.” We got access to this stealth-mode compute platform, and it’s unlike anything we’ve used before.

What does it feel like? Think of it this way: if Lambda and Fargate had a baby, and that baby was raised by a bare-metal server with a PhD in performance, you’d get Nucleus. It has the speed and direct hardware access of a dedicated machine, but with the auto-scaling magic of serverless.

Here are some of the early capabilities we’ve observed:

No more cold starts. Unlike Lambda, which can sometimes feel like it’s waking up from a deep nap.
Direct hardware access. Full control over GPU and SSD resources without the usual virtualization overhead.
Predictive autoscaling. It analyzes traffic patterns and scales before the spike hits, not during.
WASM-native runtime. Support for Node.js, Python, Go, and Rust is baked in from the ground up.

It’s not generally available yet, but internal teams and a select few partners are already building on it.

A 30-day head-to-head test

Yes, we triple checked those cost figures. Even if AWS adjusts the pricing after the preview, the efficiency gap is too massive to ignore.

Your survival guide for the coming shift

Let’s be clear, there’s no need to panic and delete all your EC2 instances. But if this memo is even half-right, you don’t want to be caught flat-footed in a few years. Here’s what we’re doing, and what you might want to start experimenting with.

Step 1: Become a cloud whisperer

Start by pinging your AWS Solutions Architect, not directly about “Nucleus,” but something softer:

“Hey, we’re exploring options for more performant, cost-effective compute. Are there any next-gen runtimes or private betas AWS is piloting that we could look into?”

You’ll be surprised what folks share if you ask the right way.

Step 2: test on the shadow platform

Some partners already have early access CLI builds. If you get your hands on one, you’ll notice some familiar patterns.

# Initialize a new service from a template
nucleus init my-api --template=fastapi

# Deploy with a single command
nucleus deploy --env=staging --free-tier

Disclaimer: Not officially available. Use in isolated test environments only. Do not run your production database on this.

Step 3: Run a hybrid setup

If you get preview access, try bridging the old with the new. Here’s a hypothetical Terraform snippet of what that might look like:

# Our legacy EC2 instance for the old monolith
resource "aws_instance" "legacy_worker" {
  ami           = "ami-0b5eea76982371e9" # An old Amazon Linux 2 AMI
  instance_type = "t3.medium"
}

# The new Nucleus service for a microservice
resource "aws_nucleus_service" "new_api" {
  runtime       = "go1.19"
  source_path   = "./app/api"
  
  # This is the magic part: linking to the old world
  vpc_ec2_links = [aws_instance.legacy_worker.id]
}

We ran a few test loads between legacy workers and the new compute, no regressions, and latency even dropped.

Step 4: Estimate the savings yourself

Even with preview pricing, the gap is noticeable. A simple Python script can give you a rough idea.

# Fictional library to estimate costs
import aws_nucleus_estimator

# Your current monthly bill for a specific workload
current_ec2_cost = 4200 

# Estimate based on vCPU hours and memory
# (These numbers are for illustration only)
estimated_nucleus_cost = aws_nucleus_estimator.estimate(
    vcpu_hours=1200, 
    memory_gb_hours=2400
)

print(f"Rough monthly savings: ${current_ec2_cost - estimated_nucleus_cost}")

This is bigger than just EC2

Let’s be honest. This shift isn’t just about cutting costs or shrinking cold start times. It’s about redefining what “compute” even means. EC2 isn’t being deprecated because it’s broken. It’s being phased out because modern workloads have evolved, and the old abstractions are starting to feel like training wheels we forgot to take off.

A broader pattern is emerging across the industry. What AWS is allegedly doing with Nucleus mirrors a larger movement:

Google Cloud is reportedly piloting a Cloud Run variant that uses a WASM-based runtime.
Microsoft Azure is quietly testing a system to blur the line between containers and functions.
Oracle, surprisingly, has been sponsoring development tools optimized for WASM-native environments.

The foundational idea is clear: cloud platforms are moving toward fast-boot, auto-scaling, WASM-capable compute that sits somewhere between Lambda and Kubernetes, but without the overhead of either.

Is EC2 the new legacy?

It’s strange to say, but EC2 is starting to feel like “bare metal” did a decade ago: powerful, essential, but something you try to abstract away.

One of our SREs shared this gem the other day:

“A couple of our junior engineers thought EC2 was some kind of disaster recovery tool for Kubernetes.”

That’s from a Fortune 100 company. When your flagship infrastructure service starts raising eyebrows from fresh grads, you know a generational shift is underway.

The cloud is evolving, again. But this isn’t a gentle, planned succession. It’s a Cambrian explosion in real-time. New, bizarre forms of compute are crawling out of the digital ooze, and the old titans, once thought invincible, are starting to look slow and clumsy. They don’t get a gold watch and a retirement party. They become fossils, their skeletons propping up the new world.

EC2 isn’t dying tomorrow. It’s becoming a geological layer. It’s the bedrock, the sturdy but unglamorous foundation upon which nimbler, more specialized predators will hunt. The future isn’t about killing the virtual machine; it’s about making it an invisible implementation detail. In the same way, most of us stopped thinking about the physical server racks in a data center, we’ll soon stop thinking about the VM. We’ll just care about the work that needs doing.

So no, EC2 isn’t dying. It’s becoming a legend. And in the fast-moving world of technology, legends belong in museums, admired from a safe distance.

August 11, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

If your Kubernetes YAML looks Like hieroglyphics, this post is for you

It all started, as most tech disasters do, with a seductive whisper. “Just describe your infrastructure with YAML,” Kubernetes cooed. “It’ll be easy,” it said. And we, like fools in a love story, believed it.

At first, it was a beautiful romance. A few files, a handful of lines. It was elegant. It was declarative. It was… manageable. But entropy, the nosy neighbor of every DevOps team, had other plans. Our neat little garden of YAML files soon mutated into a sprawling, untamed jungle of configuration.

We had 12 microservices jostling for position, spread across 4 distinct environments, each with its own personality quirks and dark secrets. Before we knew it, we weren’t writing infrastructure anymore; we were co-authoring a Byzantine epic in a language seemingly designed by bureaucrats with a fetish for whitespace.

The question that broke the camel’s back

The day of reckoning didn’t arrive with a server explosion or a database crash. It came with a question. A question that landed in our team’s Slack channel with the subtlety of a dropped anvil, courtesy of a junior engineer who hadn’t yet learned to fear the YAML gods.

“Hey, why does our staging pod have a different CPU limit than prod?”

Silence. A deep, heavy, digital silence. The kind of silence that screams, “Nobody has a clue.”

What followed was an archaeological dig into the fossil record of our own repository. We unearthed layers of abstractions we had so cleverly built, peeling them back one by one. The trail led us through a hellish labyrinth:

We started at deployment.yaml, the supposed source of all truth.
That led us to values.yaml, the theoretical source of all truth.
From there, we spelunked into values.staging.yaml, where truth began to feel… relative.
We stumbled upon a dusty patch-cpu-emergency.yaml, a fossil from a long-forgotten crisis.
Then we navigated the dark forest of custom/kustomize/base/deployment-overlay.yaml.
And finally, we reached the Rosetta Stone of our chaos: an argocd-app-of-apps.yaml.

The revelation was as horrifying as finding a pineapple on a pizza: we had declared the same damn value six times, in three different formats, using two tools that secretly despised each other. We weren’t managing the configuration. We were performing a strange, elaborate ritual and hoping the servers would be pleased.

That’s when we knew. This wasn’t a configuration problem. It was an existential crisis. We were, without a doubt, deep in YAML Hell.

The tools that promised heaven and delivered purgatory

Let’s talk about the “friends” who were supposed to help. These tools promised to be our saviors, but without discipline, they just dug our hole deeper.

Helm, the chaotic magician

Helm is like a powerful but slightly drunk magician. When it works, it pulls a rabbit out of a hat. When it doesn’t, it sets the hat on fire, and the rabbit runs off with your wallet.

The Promise: Templating! Variables! A whole ecosystem of charts!

The Reality: Debugging becomes a form of self-torment that involves piping helm template into grep and praying. You end up with conditionals inside your templates that look like this:

image:
  repository: {{ .Values.image.repository | quote }}
  tag: {{ .Values.image.tag | default .Chart.AppVersion }}
  pullPolicy: {{ .Values.image.pullPolicy | default "IfNotPresent" }}

This looks innocent enough. But then someone forgets to pass image.tag for a specific environment, and you silently deploy :latest to production on a Friday afternoon. Beautiful.

Kustomize the master of patches

Kustomize is the “sensible” one. It’s built into kubectl. It promises clean, layered configurations. It’s like organizing your Tupperware drawer with labels.

The Promise: A clean base and tidy overlays for each environment.

The Reality: Your patch files quickly become a mystery box. You see this in your kustomization.yaml:

patchesStrategicMerge:
  - increase-replica-count.yaml
  - add-resource-limits.yaml
  - disable-service-monitor.yaml

Where are these files? What do they change? Why does disable-service-monitor.yaml only apply to the dev environment? Good luck, detective. You’ll need it.

ArgoCD, the all-seeing eye (that sometimes blinks)

GitOps is the dream. Your Git repo is the single source of truth. No more clicking around in a UI. ArgoCD or Flux will make it so.

The Promise: Declarative, automated sync from Git to cluster. Rollbacks are just a git revert away.

The Reality: If your Git repo is a dumpster fire of conflicting YAML, ArgoCD will happily, dutifully, and relentlessly sync that dumpster fire to production. It won’t stop you. One bad merge, and you’ve automated a catastrophe.

Our escape from YAML hell was a five-step sanity plan

We knew we couldn’t burn it all down. We had to tame the beast. So, we gathered the team, drew a line in the sand, and created five commandments for configuration sanity.

1. We built a sane repo structure

The first step was to stop the guesswork. We enforced a simple, predictable layout for every single service.

├── base/
│   ├── deployment.yaml
│   ├── service.yaml
│   └── configmap.yaml
└── overlays/
    ├── dev/
    │   ├── kustomization.yaml
    │   └── values.yaml
    ├── staging/
    │   ├── kustomization.yaml
    │   └── values.yaml
    └── prod/
        ├── kustomization.yaml
        └── values.yaml

This simple change eliminated 80% of the “wait, which file do I edit?” conversations.

2. One source of truth for values

This was a sacred vow. Each environment gets one values.yaml file. That’s it. We purged the heretics:

values-prod-final.v2.override.yaml
backup-of-values.yaml
donotdelete-temp-config.yaml

If a value wasn’t in the designated values.yaml for that environment, it didn’t exist. Period.

3. We stopped mixing Helm and Kustomize

You have to pick a side. We made a rule: if a service requires complex templating logic, use Helm. If it primarily needs simple overlays (like changing replica counts or image tags per environment), use Kustomize. Using both on the same service is like trying to write a sentence in two languages at once. It’s a recipe for suffering.

4. We render everything before deploying

Trust, but verify. We added a mandatory step in our CI pipeline to render the final YAML before it ever touches the cluster.

# For Helm + Kustomize setups
helm template . --values overlays/prod/values.yaml | \
kustomize build | \
kubeval -

This simple script does three magical things:

It validates that the output is syntactically correct YAML.
It lets us see exactly what is about to be applied.
It has completely eliminated the “well, that’s not what I expected” class of production incidents.

5. We built a simple config CLI

To make the right way the easy way, we built a small internal CLI tool. Now, instead of navigating the YAML jungle, an engineer simply runs:

$ ops-cli config generate --app=user-service --env=prod

This tool:

Pulls the correct base templates and overlay values.
Renders the final, glorious YAML.
Validates it against our policies.
Shows the developer a diff of what will change in the cluster.
Saves lives and prevents hair loss.

YAML is now a tool again, not a trap.

The afterlife is peaceful

YAML didn’t ruin our lives. We did, by refusing to treat it with the respect it demands. Templating gives you incredible power, but with great power comes great responsibility… and redundancy, and confusion, and pull requests with 10,000 lines of whitespace changes. Now, we treat our YAML like we treat our application code. We lint it. We test it. We render it. And most importantly, we’ve built a system that makes it difficult to do the wrong thing. It’s the institutional equivalent of putting childproof locks on the kitchen cabinets. A determined toddler could probably still get to the cleaning supplies, but it would require a conscious, frustrating effort. Our system doesn’t make us smarter; it just makes our inevitable moments of human fallibility less catastrophic. It’s the guardrail on the scenic mountain road of configuration. You can still drive off the cliff, but you have to really mean it. Our infrastructure is no longer a hieroglyphic. It’s just… configuration. And the resulting boredom is a beautiful thing.

August 3, 2025 by Fernando SRE Cloud stuff DevOps stuff Kubernetes SRE stuff

What if Kubernetes was the wrong tool for almost everyone?

It was 11 PM on a Tuesday, and for the third time that week, we were on an emergency call. Not because our product had a critical bug, but because a routine deployment had, once again, mysteriously broken the internal DNS. Three of our sharpest engineers weren’t creating value; they were offering sacrifices to the capricious god of Istio, hoping it might bless our pods with connectivity.

As I stared at the sprawling diagram we’d made just to add a single, simple microservice, a heretical thought wormed its way into my brain: When did our job stop being about building software and become about… well, serving Kubernetes?

A rumor we couldn’t ignore

The next morning, amidst our collective YAML-induced hangover, someone dropped a screenshot into our team’s Slack channel. It was from some tiny, no-name startup, claiming they were getting 8x the performance of a standard Kubernetes stack at one-tenth of the cost.

The team’s reaction was a collective, cynical laugh. “Sure,” our lead SRE typed, “and my home server can out-render Pixar.” It was obviously marketing fluff. Propaganda. The post itself was deleted within an hour, but the screenshot had already gone viral. It was absurd, unbelievable, and we all knew it was nonsense.

But the idea, like a catchy, terrible pop song, got stuck in our heads.

Let’s just prove it’s impossible

That Friday, we decided to do it. We’d run a small, contained experiment, mostly for the bragging rights of publicly debunking the ridiculous claim. The plan was simple: take one of our standard, moderately complex services and see what it would take to run it on a simpler stack.

Our service, an image processor, wasn’t a behemoth, but its YAML file had grown… organically. Like a colony of particularly stubborn mold. It had sidecars, persistent volume claims, readiness probes, and enough annotations to qualify as a short novel.

Here’s a sanitized glimpse of the beast we were trying to tame:

# apiVersion: apps/v1
# kind: Deployment
# metadata:
#   name: image-processor-svc
#   labels:
#     app: image-processor
# spec:
#   replicas: 3
#   selector:
#     matchLabels:
#       app: image-processor
#   template:
#     metadata:
#       labels:
#         app: image-processor
#     spec:
#       containers:
#       - name: processor
#         image: our-repo/image-processor:v1.2.4
#         ports:
#         - containerPort: 8080
#         resources:
#           requests:
#             memory: "512Mi"
#             cpu: "250m"
#           limits:
#             memory: "1024Mi"
#             cpu: "500m"
#         readinessProbe:
#           httpGet:
#             path: /healthz
#             port: 8080
#           initialDelaySeconds: 5
#           periodSeconds: 10
#       - name: metrics-sidecar
#         image: prom/statsd-exporter:v0.22.0
#         args:
#         - "--statsd.mapping-config=/etc/statsd/mapper.yml"
#         ports:
#         - name: metrics
#           containerPort: 9102
#         volumeMounts:
#         - name: config-volume
#           mountPath: /etc/statsd
#   # ...and so on, for another 150 lines.

We figured it would take us all afternoon just to untangle it.

The uncomfortable silence of success

We set up a competing stack using Firecracker MicroVMs, essentially tiny, lightning-fast virtual machines. The goal was to run the same container, but without the entire Kubernetes universe orbiting around it.

By 3 PM, we had our first results. And that’s when the room went quiet. It was the kind of uncomfortable silence you get when you realize the joke you’ve been telling for months is actually on you.

The numbers weren’t just holding up; they were embarrassing us. We stared at the Grafana dashboard, waiting for the figures to make sense. They didn’t. Our projected monthly cloud bill for this single service didn’t just shrink; it plummeted.

We had spent years building a complex, expensive, and fragile machine, all to solve a problem that, it turned out, could be handled with a much, much simpler approach.

Our quest for sane infrastructure

That weekend experiment turned into a full-blown obsession. Inspired, we started building our own internal escape hatch. We cobbled together a tool that we jokingly called the “SQL-ifier.” The premise was simple and, to a Kubernetes purist, utterly profane: what if you could manage infrastructure with simple, readable rules instead of YAML incantations?

Instead of a 200-line YAML file for an autoscaler, what if you could just write this?

-- This is our internal, SQL-like syntax for managing MicroVMs.
-- It's not a public standard (yet!).

-- Rule: If CPU usage on the frontend service is over 80% for 3 minutes,
-- add two more instances, but never exceed 20 total.
ON high_cpu(>80%) FOR 3m
IF service.name = 'image-processor-svc'
DO SCALE service.name TO instances + 2
LIMIT 20;

-- Rule: If we see more than 10 critical payment errors in 1 minute,
-- immediately revert to the last stable version.
ON log_error(level='critical', service='payment-gateway') > 10 FOR 1m
DO ROLLBACK service 'payment-gateway' TO previous_stable;

It was declarative, readable, and, most importantly, it could be understood by a human being without needing a certification.

How did we all end up here?

This journey forced us to ask a bigger question. Why did we, and thousands of other smart teams, willingly chain ourselves to this complexity?

The answer is surprisingly human. We bought an 18-wheeler truck to do our weekly grocery shopping.

Sure, a giant truck can carry milk and eggs. But you spend most of your time finding a place to park it, paying for diesel, getting a special license to drive it, and explaining to your neighbors why you just flattened their mailbox. Kubernetes was built for Google-scale problems. Most of us run businesses that need the reliability of a Toyota Camry, not a fleet of space shuttles. We adopted the tool because everyone else did, mistaking its complexity for sophistication.

Is your infrastructure serving you?

This whole experience led us to create a simple “mirror test.” If you’re wondering if you’re in the same boat, ask your team these questions:

Do you dread upgrading your cluster more than you dread a root canal?
Is a significant portion of your engineering time spent on “infra-babysitting” instead of building product features?
Could you confidently explain your service mesh configuration to a new hire without a whiteboard and a two-hour meeting?

If you answered “yes” to two or more, you might not have an infrastructure problem. You might have a Kubernetes problem.

This isn’t a manifesto to uninstall kubectl tomorrow. Some organizations genuinely operate at a scale where Kubernetes is not just useful, but necessary. This is just a friendly nudge. A reminder to look up from your YAML files once in a while and ask: is this tool still serving me, or have I started serving it?

July 30, 2025 by Fernando SRE Cloud stuff DevOps stuff Kubernetes SRE stuff

The AWS bill shrank by 70%, but so did my DevOps dignity

Fourteen thousand dollars a month.

That number wasn’t on a spreadsheet for a new car or a down payment on a house. It was the line item from our AWS bill simply labeled “API Gateway.” It stared back at me from the screen, judging my life choices. For $14,000, you could hire a small team of actual gateway guards, probably with very cool sunglasses and earpieces. All we had was a service. An expensive, silent, and, as we would soon learn, incredibly competent service.

This is the story of how we, in a fit of brilliant financial engineering, decided to fire our expensive digital bouncer and replace it with what we thought was a cheaper, more efficient swinging saloon door. It’s a tale of triumph, disaster, sleepless nights, and the hard-won lesson that sometimes, the most expensive feature is the one you don’t realize you’re using until it’s gone.

Our grand cost-saving gamble

Every DevOps engineer has had this moment. You’re scrolling through the AWS Cost Explorer, and you spot it: a service that’s drinking your budget like it’s happy hour. For us, API Gateway was that service. Its pricing model felt… punitive. You pay per request, and you pay for the data that flows through it. It’s like paying a toll for every car and then an extra fee based on the weight of the passengers.

Then, we saw the alternative: the Application Load Balancer (ALB). The ALB was the cool, laid-back cousin. Its pricing was simple, based on capacity units. It was like paying a flat fee for an all-you-can-eat buffet instead of by the gram.

The math was so seductive it felt like we were cheating the system.

We celebrated. We patted ourselves on the back. We had slain the beast of cloud waste! The migration was smooth. Response times even improved slightly, as an ALB adds less overhead than the feature-rich API Gateway. Our architecture was simpler. We had won.

Or so we thought.

When reality crashes the party

API Gateway isn’t just a toll booth. It’s a fortress gate with a built-in, military-grade security detail you get for free. It inspects IDs, pats down suspicious characters, and keeps a strict count of who comes and goes. When we swapped it for an ALB, we essentially replaced that fortress gate with a flimsy screen door. And then we were surprised when the bugs got in.

The trouble didn’t start with a bang. It started with a whisper.

Week 1: The first unwanted visitor. A script kiddie, probably bored in their parents’ basement, decided to poke our new, undefended endpoint with a classic SQL injection probe. It was the digital equivalent of someone jiggling the handle on your front door. With API Gateway, its built-in Web Application Firewall (WAF) would have spotted the malicious pattern and drop-kicked the request into the void before it ever reached our application. With the ALB, the request sailed right through. Our application code was robust enough to reject it, but the fact that it got to knock on the application’s door at all was a bad sign. We dismissed it. “An amateur,” we scoffed.

Week 3: The client who drank too much. A bug in a third-party client integration caused it to get stuck in a loop. A very, very fast loop. It began hammering one of our endpoints with 10,000 requests per second. The servers, bless their hearts, tried to keep up. Alarms started screaming. The on-call engineer, who was attempting to have a peaceful dinner, saw his phone light up like a Christmas tree. API Gateway’s built-in rate limiting would have calmly told the misbehaving client, “You’ve had enough, buddy. Come back later.” It would have throttled the requests automatically. We had to scramble, manually blocking the IP while the system groaned under the strain.

Week 6: The full-blown apocalypse. This was it. The big one. A proper Distributed Denial-of-Service (DDoS) attack. It wasn’t sophisticated, just a tidal wave of garbage traffic from thousands of hijacked machines. Our ALB, true to its name, diligently balanced this flood of garbage across all our servers, ensuring that every single one of them was completely overwhelmed in a perfectly distributed fashion. The site went down. Hard.

The next 72 hours were a blur of caffeine, panicked Slack messages, and emergency calls with our new best friends at Cloudflare. We had saved $9,800 on our monthly bill, but we were now paying for it with our sanity and our customers’ trust.

Waking up to the real tradeoff

The hidden cost wasn’t just the emergency WAF setup or the premium support plan. It was the engineering hours. The all-nighters. The “I’m so sorry, it won’t happen again” emails to customers.

We had made a classic mistake. We looked at API Gateway’s price tag and saw a cost. We should have seen an itemized receipt for services rendered:

Built-in WAF: Free bouncer that stops common attacks.
Rate Limiting: A Free bartender who cuts off rowdy clients.
Request Validation: Free doorman that checks for malformed IDs (JSON).
Authorization: Free security guard that validates credentials (JWTs/OAuth).

An ALB does none of this. It just balances the load. That’s its only job. Expecting it to handle security is like expecting the mailman to guard your house. It’s not what he’s paid for.

Building a smarter hybrid fortress

After the fires were out and we’d all had a chance to sleep, we didn’t just switch back. That would have been too easy (and an admission of complete defeat). We came back wiser, like battle-hardened veterans of the cloud wars. We realized that not all traffic is created equal.

We landed on a hybrid solution, the architectural equivalent of having both a friendly greeter and a heavily-armed guard.

1. Let the ALB handle the easy stuff. For high-volume, low-risk traffic like serving images or tracking pixels, the ALB is perfect. It’s cheap, fast, and the security risk is minimal. Here’s how we route the “dumb” traffic in serverless.yml:

# serverless.yml
functions:
  handleMedia:
    handler: handler.alb  # A simple handler for media
    events:
      - alb:
          listenerArn: arn:aws:elasticloadbalancing:us-east-1:123456789012:listener/app/my-main-alb/a1b2c3d4e5f6a7b8
          priority: 100
          conditions:
            path: /media/*  # Any request for images, videos, etc.

2. Protect the crown jewels with API Gateway. For anything critical, user authentication, data processing, and payment endpoints, we put it back behind the fortress of API Gateway. The extra cost is a small price to pay for peace of mind.

# serverless.yml
functions:
  handleSecureData:
    handler: handler.auth  # A handler with business logic
    events:
      - httpApi:
          path: /secure/v2/{proxy+}
          method: any

3. Add the security we should have had all along. This was non-negotiable. We layered on security like a paranoid onion.

AWS WAF on the ALB: We now pay for a WAF to protect our “dumb” endpoints. It’s an extra cost, but cheaper than an outage.
Cloudflare: Sits in front of everything, providing world-class DDoS protection.
Custom Rate Limiting: For nuanced cases, we built our own rate limiter. A Lambda function checks a Redis cache to track request counts from specific IPs or API keys.

Here’s a conceptual Python snippet for what that Lambda might look like:

# A simplified Lambda function for rate limiting with Redis
import redis
import os

# Connect to Redis
redis_client = redis.Redis(
    host=os.environ['REDIS_HOST'], 
    port=6379, 
    db=0
)

RATE_LIMIT_THRESHOLD = 100  # requests
TIME_WINDOW_SECONDS = 60    # per minute

def lambda_handler(event, context):
    # Get an identifier, e.g., from the source IP or an API key
    client_key = event['requestContext']['identity']['sourceIp']
    
    # Increment the count for this key
    current_count = redis_client.incr(client_key)
    
    # If it's a new key, set an expiration time
    if current_count == 1:
        redis_client.expire(client_key, TIME_WINDOW_SECONDS)
    
    # Check if the limit is exceeded
    if current_count > RATE_LIMIT_THRESHOLD:
        # Return a "429 Too Many Requests" error
        return {
            "statusCode": 429,
            "body": "Rate limit exceeded. Please try again later."
        }
    
    # Otherwise, allow the request to proceed to the application
    return {
        "statusCode": 200, 
        "body": "Request allowed"
    }

The final scorecard

So, after all that drama, was it worth it? Let’s look at the final numbers.

Verdict: For us, yes, it was ultimately worth it. Our traffic profile justified the complexity. But we traded money for time and complexity. We saved cash on the AWS bill, but the initial investment in engineering hours was massive. My DevOps dignity took a hit, but it has since recovered, scarred but wiser.

Your personal sanity check

Before you follow in our footsteps, do yourself a favor and perform an audit. Don’t let a shiny, low price tag lure you onto the rocks.

1. Calculate your break-even point. First, figure out what API Gateway is costing you.

# Get your usage data from API Gateway 
aws apigateway get-usage \ --usage-plan-id your_plan_id_here \ 
   --start-date 2024-01-01 
   --end-date 2024-01-31 \ 
   --query 'items'

Then, estimate your ALB costs based on request volume.

# Get request count to estimate ALB costs
aws cloudwatch get-metric-statistics \
    --namespace AWS/ApplicationELB \
    --metric-name RequestCount \
    --dimensions Name=LoadBalancer,Value=app/my-main-alb/a1b2c3d4e5f6a7b8 \
    --start-time 2024-01-01T00:00:00Z \
    --end-time 2024-01-31T23:59:59Z \
    --period 86400 \
    --statistics Sum

2. Test your defenses (or lack thereof). Don’t wait for a real attack. Simulate one. Tools like Burp Suite or sqlmap can show you just how exposed you are.

# A simple sqlmap test against a vulnerable parameter 
sqlmap -u "https://api.yoursite.com/endpoint?id=1" --batch

If you switch to an ALB without a WAF, you might be horrified by what you find.

In the end, choosing between API Gateway and an ALB isn’t just a technical decision. It’s a business decision about risk, and one that goes far beyond the monthly bill. It’s about calculating the true Total Cost of Ownership (TCO), where the “O” includes your engineers’ time, your customers’ trust, and your sleep schedule. Are you paying with dollars on an invoice, or are you paying with 3 AM incident calls and the slow erosion of your team’s morale?

Peace of mind has a price tag. For us, we discovered its value was around $7,700 a month. We just had to get DDoS’d to learn how to read the receipt. Don’t make the same mistake. Look at your architecture not just for what it costs, but for what it protects. Because the cheapest option on paper can quickly become the most expensive one in reality.

July 27, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff