CloudArchitecture

BigQuery learns to read between the lines

Keyword search is the friend who hears every word and misses the point. Vector search is the friend who nods, squints a little, and says, “You want a safe family SUV that will not make your wallet cry.” This story is about teaching BigQuery to be the second friend.

I wanted semantic search without renting another database, shipping nightly exports, or maintaining yet another dashboard only I remember to feed. The goal was simple and a little cheeky: keep the data in BigQuery, add embeddings with Vertex AI, create a vector index, and still use boring old SQL to filter by price and mileage. Results should read like good advice, not a word-count contest.

Below is a practical pattern that works well for catalogs, internal knowledge bases, and “please find me the thing I mean” situations. It is light on ceremony, honest about trade‑offs, and opinionated where it needs to be.

Why keyword search keeps missing the point

  • Humans ask for meanings, not tokens. “Family SUV that does not guzzle” is intent, not keywords.
  • Catalogs are messy. Price, mileage, features, and descriptions live in different columns and dialects.
  • Traditional search treats text like a bag of Scrabble tiles. Embeddings turn it into geometry where similar meanings sit near each other.

If you have ever typed “cheap laptop with decent battery” and received a gaming brick with neon lighting, you know the problem.

Keep data where it already lives

No new database. BigQuery already stores your rows, talks SQL, and now speaks vectors. The plan

  1. Build a clean content string per row so the model has a story to understand.
  2. Generate embeddings in BigQuery via a remote Vertex AI model.
  3. Store those vectors in a table and, when it makes sense, add a vector index.
  4. Search with a natural‑language query embedding and filter with plain SQL.

Map of the idea:

Prepare a clean narrative for each row

Your model will eat whatever you feed it. Feed it something tidy. The goal is a single content field with labeled parts, so the embedding has clues.

-- Demo names and values are fictitious
CREATE OR REPLACE TABLE demo_cars.search_base AS
SELECT
  listing_id,
  make,
  model,
  year,
  price_usd,
  mileage_km,
  body_type,
  fuel,
  features,
  CONCAT(
    'make=', make, ' | ',
    'model=', model, ' | ',
    'year=', CAST(year AS STRING), ' | ',
    'price_usd=', CAST(price_usd AS STRING), ' | ',
    'mileage_km=', CAST(mileage_km AS STRING), ' | ',
    'body=', body_type, ' | ',
    'fuel=', fuel, ' | ',
    'features=', ARRAY_TO_STRING(features, ', ')
  ) AS content
FROM demo_cars.listings
WHERE status = 'active';

Housekeeping that pays off

  • Normalize units and spellings early. “20k km” is cute; 20000 is useful.
  • Keep labels short and consistent. Your future self will thank you.
  • Avoid stuffing everything. Noise in, noise out.

Turn text into vectors without hand waving

We will assume you have a BigQuery remote model that points to your Vertex AI text‑embedding endpoint. Choose a modern embedding model and be explicit about task type, use RETRIEVAL_DOCUMENT for rows and RETRIEVAL_QUERY for user queries. That hint matters.

Embed the documents

-- Store document embeddings alongside your base table
CREATE OR REPLACE TABLE demo_cars.search_with_vec AS
SELECT
  b.listing_id,
  b.make, b.model, b.year, b.price_usd, b.mileage_km, b.body_type, b.fuel, b.features,
  e.ml_generate_embedding_result AS embedding
FROM demo_cars.search_base AS b,
UNNEST([
  STRUCT(
    (SELECT ml_generate_embedding_result
     FROM ML.GENERATE_EMBEDDING(
       MODEL `demo.embed_text`,
       (SELECT b.content AS content),
       STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_DOCUMENT' AS task_type)
     )) AS ml_generate_embedding_result
  )
]) AS e;

That cross join with a single STRUCT is a neat way to add one vector per row without creating a separate subquery table. If you prefer, materialize embeddings in a separate table and JOIN on listing_id to minimize churn.

Build an index when it helps and skip it when it does not

BigQuery can scan vectors without an index, which is fine for small tables and prototypes. For larger tables, add an IVF index with cosine distance.

-- Optional but recommended beyond a few thousand rows
CREATE VECTOR INDEX demo_cars.search_with_vec_idx
ON demo_cars.search_with_vec(embedding)
OPTIONS(
  distance_type = 'COSINE',
  index_type = 'IVF',
  ivf_options = '{"num_lists": 128}'
);

Rules of thumb

  • Start without an index for quick experiments. Add the index when latency or cost asks for it.
  • Tune num_lists only after measuring. Guessing is cardio for your CPU.

Ask in plain English, filter in plain SQL

Here is the heart of it. One short block that embeds the query, runs vector search, then applies filters your finance team actually understands.

-- Natural language wish
DECLARE user_query STRING DEFAULT 'family SUV with lane assist under 18000 USD';

WITH q AS (
  SELECT ml_generate_embedding_result AS qvec
  FROM ML.GENERATE_EMBEDDING(
    MODEL `demo.embed_text`,
    (SELECT user_query AS content),
    STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_QUERY' AS task_type)
  )
)
SELECT s.listing_id, s.make, s.model, s.year, s.price_usd, s.mileage_km, s.body_type
FROM VECTOR_SEARCH(
  TABLE demo_cars.search_with_vec, 'embedding',
  TABLE q, query_column_to_search => 'qvec',
  top_k => 20, distance_type => 'COSINE'
) AS s
WHERE s.price_usd <= 18000
  AND s.body_type = 'SUV'
ORDER BY s.price_usd ASC;

This is the “hybrid search” pattern, shoulder to shoulder, semantics finds plausible candidates, SQL draws the hard lines. You get relevance and guardrails.

Measure quality and cost without a research grant

You do not need a PhD rubric, just a habit.

Relevance sanity check

  • Write five real queries from your users. Note how many good hits appear in the top ten. If it is fewer than six, look at your content field. It is almost always the content.

Latency

  • Time the query with and without the vector index. Keep an eye on top‑k and filters. If you filter out 90% of candidates, you can often keep top‑k low.

Cost

  • Avoid regenerating embeddings. Upserts should only touch changed rows. Schedule small nightly or hourly batches, not heroic full refreshes.

Where things wobble and how to steady them

Vague user queries

  • Add example phrasing in your product UI. Even two placeholders nudge users into better intent.

Sparse or noisy text

  • Enrich your content with compact labels and the two or three features people actually ask for. Resist the urge to dump raw logs.

Synonyms of the trade

  • Lightweight mapping helps. If your users say “lane keeping” and your data says “lane assist,” consider normalizing in content.

Region mismatches

  • Keep your dataset, remote connection, and model in compatible regions. Latency enjoys proximity. Downtime enjoys misconfigurations.

Run it day after day without drama

A few operational notes that keep the lights on

  • Track changes by listing_id and only re‑embed those rows.
  • Rebuild or refresh the index on a schedule that fits your churn. Weekly is plenty for most catalogs.
  • Keep one “golden query set” around for spot checks after schema or model changes.

Takeaways you can tape to your monitor

  • Keep data in BigQuery and add meaning with embeddings.
  • Build one tidy content per row. Labels beat prose.
  • Use RETRIEVAL_DOCUMENT for rows and RETRIEVAL_QUERY for the user’s text.
  • Start without an index; add IVF with cosine when volume demands it.
  • Let vectors shortlist and let SQL make the final call.

Tiny bits you might want later

An alternative query that biases toward newer listings

DECLARE user_query STRING DEFAULT 'compact hybrid with good safety under 15000 USD';
WITH q AS (
  SELECT ml_generate_embedding_result AS qvec
  FROM ML.GENERATE_EMBEDDING(
    MODEL `demo.embed_text`,
    (SELECT user_query AS content),
    STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_QUERY' AS task_type)
  )
)
SELECT s.listing_id, s.make, s.model, s.year, s.price_usd
FROM VECTOR_SEARCH(
  TABLE demo_cars.search_with_vec, 'embedding',
  TABLE q, query_column_to_search => 'qvec',
  top_k => 15, distance_type => 'COSINE'
) AS s
WHERE s.price_usd <= 15000
ORDER BY s.year DESC, s.price_usd ASC
LIMIT 10;

Quick checklist before you ship

  • The remote model exists and is reachable from BigQuery.
  • Dataset and connection share a region you actually meant to use.
  • content strings are consistent and free of junk units.
  • Embeddings updated only for changed rows.
  • Vector index present on tables that need it and not on those that do not.

If keyword search is literal‑minded, this setup is the polite interpreter who knows what you meant, forgives your typos, and still respects the house rules. You keep your data in one place, you use one language to query it, and you get answers that feel like common sense rather than a thesaurus attack. That is the job.

Ingress and egress on EKS made understandable

Getting traffic in and out of a Kubernetes cluster isn’t a magic trick. It’s more like running the city’s most exclusive nightclub. It’s a world of logistics, velvet ropes, bouncers, and a few bureaucratic tollbooths on the way out. Once you figure out who’s working the front door and who’s stamping passports at the exit, the rest is just good manners.

Let’s take a quick tour of the establishment.

A ninety-second tour of the premises

There are really only two journeys you need to worry about in this club.

Getting In: A hopeful guest (the client) looks up the address (DNS), arrives at the front door, and is greeted by the head bouncer (Load Balancer). The bouncer checks the guest list and directs them to the right party room (Service), where they can finally meet up with their friend (the Pod).

Getting Out: One of our Pods needs to step out for some fresh air. It gets an escort from the building’s internal security (the Node’s ENI), follows the designated hallways (VPC routing), and is shown to the correct exit—be it the public taxi stand (NAT Gateway), a private car service (VPC Endpoint), or a connecting tunnel to another venue (Transit Gateway).

The secret sauce in EKS is that our Pods aren’t just faceless guests; the AWS VPC CNI gives them real VPC IP addresses. This means the building’s security rules, Security Groups, route tables, and NACLs aren’t just theoretical policies. They are the very real guards and locked doors that decide whether a packet’s journey ends in success or a silent, unceremonious death.

Getting past the velvet rope

In Kubernetes, Ingress is the set of rules that governs the front door. But rules on paper are useless without someone to enforce them. That someone is a controller, a piece of software that translates your guest list into actual, physical bouncers in AWS.

The head of security for EKS is the AWS Load Balancer Controller. You hand it an Ingress manifest, and it sets up the door staff.

  • For your standard HTTP web traffic, it deploys an Application Load Balancer (ALB). Think of the ALB as a meticulous, sharp-dressed bouncer who doesn’t just check your name. It inspects your entire invitation (the HTTP request), looks at the specific event you’re trying to attend (/login or /api/v1), and only then directs you to the right room.
  • For less chatty protocols like raw TCP, UDP, or when you need sheer, brute-force throughput, it calls in a Network Load Balancer (NLB). The NLB is the big, silent type. It checks that you have a ticket and shoves you toward the main hall. It’s incredibly fast but doesn’t get involved in the details.

This whole operation can be made public or private. For internal-only events, the controller sets up an internal ALB or NLB and uses a private Route 53 zone, hiding the party from the public internet entirely.

The modern VIP system

The classic Ingress system works, but it can feel a bit like managing your guest list with a stack of sticky notes. The rules for routing, TLS, and load balancer behavior are all crammed into a single resource, creating a glorious mess of annotations.

This is where the Gateway API comes in. It’s the successor to Ingress, designed by people who clearly got tired of deciphering annotation soup. Its genius lies in separating responsibilities.

  • The Platform team (the club owners) manages the Gateway. They decide where the entrances are, what protocols are allowed (HTTP, TCP), and handle the big-picture infrastructure like TLS certificates.
  • The Application teams (the party hosts) manage Routes (HTTPRoute, TCPRoute, etc.). They just point to an existing Gateway and define the rules for their specific application, like “send traffic for app.example.com/promo to my service.”

This creates a clean separation of duties, offers richer features for traffic management without resorting to custom annotations, and makes your setup far more portable across different environments.

The art of the graceful exit

So, your Pods are happily running inside the club. But what happens when they need to call an external API, pull an image, or talk to a database? They need to get out. This is egress, and it’s mostly about navigating the building’s corridors and exits.

  • The public taxi stand: For general internet access from private subnets, Pods are sent to a NAT Gateway. It works, but it’s like a single, expensive taxi stand for the whole neighborhood. Every trip costs money, and if it gets too busy, you’ll see it on your bill. Pro tip: Put one NAT in each Availability Zone to avoid paying extra for your Pods to take a cross-town cab just to get to the taxi stand.
  • The private car service: When your Pods need to talk to other AWS services (like S3, ECR, or Secrets Manager), sending them through the public internet is a waste of time and money. Use
    VPC endpoints instead. Think of this as a pre-booked black car service. It creates a private, secure tunnel directly from your VPC to the AWS service. It’s faster, cheaper, and the traffic never has to brave the public internet.
  • The diplomatic passport: The worst way to let Pods talk to AWS APIs is by attaching credentials to the node itself. That’s like giving every guest in the club a master key. Instead, we use
    IRSA (IAM Roles for Service Accounts). This elegantly binds an IAM role directly to a Pod’s service account. It’s the equivalent of issuing your Pod a diplomatic passport. It can present its credentials to AWS services with full authority, no shared keys required.

Setting the house rules

By default, Kubernetes networking operates with the cheerful, chaotic optimism of a free-for-all music festival. Every Pod can talk to every other Pod. In production, this is not a feature; it’s a liability. You need to establish some house rules.

Your two main tools for this are Security Groups and NetworkPolicy.

Security Groups are your Pod’s personal bodyguards. They are stateful and wrap around the Pod’s network interface, meticulously checking every incoming and outgoing connection against a list you define. They are an AWS-native tool and very precise.

NetworkPolicy, on the other hand, is the club’s internal security team. You need to hire a third-party firm like Calico or Cilium to enforce these rules in EKS, but once you do, you can create powerful rules like “Pods in the ‘database’ room can only accept connections from Pods in the ‘backend’ room on port 5432.”

The most sane approach is to start with a default deny policy. This is the bouncer’s universal motto: “If your name’s not on the list, you’re not getting in.” Block all egress by default, then explicitly allow only the connections your application truly needs.

A few recipes from the bartender

Full configurations are best kept in a Git repository, but here are a few cocktail recipes to show the key ingredients.

Recipe 1: Public HTTPS with a custom domain. This Ingress manifest tells the AWS Load Balancer Controller to set up a public-facing ALB, listen on port 443, use a specific TLS certificate from ACM, and route traffic for app.yourdomain.com to the webapp service.

# A modern Ingress for your web application
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: webapp-ingress
  annotations:
    # Set the bouncer to be public
    alb.ingress.kubernetes.io/scheme: internet-facing
    # Talk to Pods directly for better performance
    alb.ingress.kubernetes.io/target-type: ip
    # Listen for secure traffic
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    # Here's the TLS certificate to wear
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789012:certificate/your-cert-id
spec:
  ingressClassName: alb
  rules:
    - host: app.yourdomain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: webapp-service
                port:
                  number: 8080

Recipe 2: A diplomatic passport for S3 access. This gives our Pod a ServiceAccount annotated with an IAM role ARN. Any Pod that uses this service account can now talk to AWS APIs (like S3) with the permissions granted by that role, thanks to IRSA.

# The ServiceAccount with its IAM credentials
apiVersion: v1
kind: ServiceAccount
metadata:
  name: s3-reader-sa
  annotations:
    # This is the diplomatic passport: the ARN of the IAM role
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/EKS-S3-Reader-Role
---
# The Deployment that uses the passport
apiVersion: apps/v1
kind: Deployment
metadata:
  name: report-generator
spec:
  replicas: 1
  selector:
    matchLabels: { app: reporter }
  template:
    metadata:
      labels: { app: reporter }
    spec:
      # Use the service account we defined above
      serviceAccountName: s3-reader-sa
      containers:
        - name: processor
          image: your-repo/report-generator:v1.5.2
          ports:
            - containerPort: 8080

A short closing worth remembering

When you boil it all down, Ingress is just the etiquette you enforce at the front door. Egress is the paperwork required for a clean exit. In EKS, the etiquette is defined by Kubernetes resources, while the paperwork is pure AWS networking. Neither one cares about your intentions unless you write them down clearly.

So, draw the path for traffic both ways, pick the right doors for the job, give your Pods a proper identity, and set the tolls where they make sense. If you do, the cluster will behave, the bill will behave, and your on-call shifts might just start tasting a lot more like sleep.

Your metrics are lying

It’s 3 AM. The pager screams, a digital banshee heralding doom. You stumble to your desk, eyes blurry, to find a Slack channel ablaze with panicked messages. The checkout service is broken. Customers are furious. Revenue is dropping.

You pull up the dashboards, your sacred scrolls of system health. Everything is… fine. P95 latency is a flat line of angelic calm. CPU usage is so low it might as well be on a tropical vacation. The error count is zero. According to your telemetry, the system is a picture of perfect health.

And yet, the world is on fire.

Welcome to the great lie of modern observability. We’ve become masters at measuring signals while remaining utterly clueless about the story they’re supposed to tell. This isn’t a guide about adding more charts to your dashboard collection. It’s about teaching your system to stop mumbling in arcane metrics and start speaking human. It’s about making it tell you the truth.

The seductive lie of the green dashboard

We were told to worship the “golden signals”: latency, traffic, errors, and saturation. They’re like a hospital patient’s vital signs. They can tell you if the patient is alive, but they can’t tell you why they’re miserable, what they argued about at dinner, or if they’re having an existential crisis.

Our systems are having existential crises all the time.

  • Latency lies when the real work is secretly handed off to a background queue. The user gets a quick “OK!” while their request languishes in a forgotten digital purgatory.
  • Traffic lies when a buggy client gets stuck in a retry loop, making it look like you’re suddenly the most popular app on the internet.
  • Errors lie when you only count the exceptions you had the foresight to catch, ignoring the vast, silent sea of things failing in ways you never imagined.

Golden signals are fine for checking if a server has a pulse. But they are completely useless for answering the questions that actually keep you up at night, like, “Why did the CEO’s demo fail five minutes before the big meeting?”

The truth serum: Semantic Observability

The antidote to this mess is what we’ll call semantic observability. It’s a fancy term for a simple idea: instrumenting the meaning of what your system is doing. It’s about capturing the plot, not just the setting.

Instead of just logging Request received, we record the business-meaningful story:

  • Domain events: The big plot points. UserSignedUp, CartAbandoned, InvoiceSettled, FeatureFlagEvaluated. These are the chapters of your user’s journey.
  • Intent assertions: What the system swore it would do. “I will try this payment gateway up to 3 times,” or “I promise to send this notification to the user’s phone.”
  • Outcome checks: The dramatic conclusion. Did the money actually move? Was the email really delivered? This is the difference between “I tried” and “I did.”

Let’s revisit our broken checkout service. Imagine a user is buying a book right after you’ve flipped on a new feature flag for a “revolutionary” payment path.

With classic observability, you see nothing. With semantic observability, you can ask your system questions like a detective interrogating a witness:

  • “Show me all the customers who tried to check out in the last 30 minutes but didn’t end up with a successful order.”
  • “Of those failures, how many had the new shiny-payment-path feature flag enabled?”
  • “Follow the trail for one of those failed orders. What was the last thing they intended to do, and what was the actual, tragic outcome?”

Notice we haven’t mentioned CPU once. We’re asking about plot, motive, and consequence.

Your detective’s toolkit (Minimal OTel patterns)

You don’t need a fancy new vendor to do this. You just need to use your existing OpenTelemetry tools with a bit more narrative flair.

  1. Teach your spans to gossip: Don’t just create a span; stuff its pockets with juicy details. Use span attributes to carry the context. Instead of just a request_id, add feature.flag.variant, customer.tier, and order.value. Make it tell you if this is a VIP customer buying a thousand-dollar item or a tire-kicker with a free-tier coupon.
  2. Mark the scene of the crime: Use events on spans to log key transitions. FraudCheckPassed, PaymentAuthorized, EnteringRetryLoop. These are the chalk outlines of your system’s behavior.
  3. Connect the dots: For asynchronous workflows (like that queue we mentioned), use span links to connect the cause to the effect. This builds a causal chain so you can see how a decision made seconds ago in one service led to a dumpster fire in another.

Rule of thumb: If a human is asking the question during an incident, a machine should be able to answer it with a single query.

The case of intent vs. outcome

This is the most powerful trick in the book. Separate what your system meant to do from what actually happened.

  • The intent: At the start of a process, emit an event: NotificationIntent with details like target: email and deadline: t+5s.
  • The outcome: When (or if) it finishes, emit another: NotificationDelivered with latency: 2.5s and channel: email.

Now, your master query isn’t about averages. It’s about broken promises: “Show me all intents that don’t have a matching successful outcome within their SLA.”

Suddenly, your SLOs aren’t some abstract percentage. They are a direct measure of your system’s integrity: its intent satisfied rate.

Your first 30 days as a telemetry detective

Week 1: Pick a single case. Don’t boil the ocean. Focus on one critical user journey, like “User adds to cart -> Pays -> Order created.” List the 5-10 key “plot points” (domain events) and 3 “promises” (intent assertions) in that story.

Week 2: Plant the evidence. Go into your code and start enriching your existing traces. Add those gossipy attributes about feature flags and customer tiers. Add events. Link your queues.

Week 3: Build your “Why” query. Create the one query that would have saved you during the last outage. Something like, “Show me degraded checkouts, grouped by feature flag and customer cohort.” Put a link to it at the top of your on-call runbook.

Week 4: Close the loop. Define an SLO on your new “intent satisfied rate.” Watch it like a hawk. Review your storage costs and turn on tail-based sampling to keep the interesting stories (the errors, the weird edge cases) without paying to record every boring success story.

Anti-Patterns to gently escort out the door

  • Dashboard worship: If your incident update includes a screenshot of a CPU graph, you owe everyone an apology. Show them the business impact, the cohort of affected users, the broken promise.
  • Logorrhea: The art of producing millions of lines of logs that say absolutely nothing. One good semantic event is worth a thousand INFO: process running logs.
  • Tag confetti: Using unbounded tags like user_id for everything, turning your observability bill into a piece of abstract art that costs more than a car.
  • Schrödinger’s feature flag: Shipping a new feature behind a flag but forgetting to record the flag’s decision in your telemetry. The flag is simultaneously the cause of and solution to all your problems, and you have no way of knowing which.

The moral of the story

Observability isn’t about flying blind without metrics. It’s about refusing to outsource your understanding of the system to a pile of meaningless averages.

Instrument intent. Record outcomes. Connect causes.

When your system can clearly explain what it tried to do and what actually happened, on-call stops feeling like hunting for ghosts in a haunted house and starts feeling like science. And you might even get a full night’s sleep.

Stop building cathedrals in Terraform

It’s 9 AM on a Tuesday. You, a reasonably caffeinated engineer, open a pull request to add a single tag to an S3 bucket. A one-line change. You run terraform plan and watch in horror as your screen scrolls with a novel’s worth of green, yellow, and red text. Two hundred and seventeen resources want to be updated.

Welcome to a special kind of archaeological dig. Somewhere, buried three folders deep, a “reusable” module you haven’t touched in six months has decided to redecorate your entire production environment. The brochure promised elegance and standards. The reality is a Tuesday spent doing debugging, cardio, and praying to the Git gods.

Small teams, in particular, fall into this trap. You don’t need to build a glorious cathedral of abstractions just to hang a picture on the wall. You need a hammer, a nail, and enough daylight to see what you’re doing.

The allure of the perfect system

Let’s be honest, custom Terraform modules are seductive. They whisper sweet nothings in your ear about the gospel of DRY (Don’t Repeat Yourself). They promise a future where every resource is a perfect, standardized snowflake, lovingly stamped out from a single, blessed template. It’s the engineering equivalent of having a perfectly organized spice rack where all the labels face forward.

In theory, it’s beautiful. In practice, for a small, fast-moving team, it’s a tax. A heavy one. An indirection tax.

What starts as a neat wrapper today becomes a Matryoshka doll of complexity by next quarter. Inputs multiply. Defaults are buried deeper than state secrets. Soon, flipping a single boolean in a variables.tf file feels like rewiring a nuclear submarine with the lights off. The module is no longer serving you; you are now its humble servant.

It’s like buying one of those hyper-specific kitchen gadgets, like a banana slicer. Yes, it slices bananas. Perfectly. But now you own a piece of plastic whose only job is to do something a knife you already owned could do just fine. That universal S3 module you built is the junk drawer of your infrastructure. Sure, it holds everything, but now you have to rummage past a broken can opener and three instruction manuals just to find a spoon.

A heuristic for staying sane

So, what’s the alternative? Anarchy? Copy-pasting HCL like a digital barbarian? Yes. Sort of.

Here’s a simple, sanity-preserving heuristic:

Duplicate once without shame. Duplicate twice with comments. On the third time, and only then, consider extracting a module.

Until you hit that third, clear, undeniable repetition of a pattern, plain HCL is your best friend. It wins on speed, clarity, and keeping the blast radius of any change predictably small. You avoid abstracting a solution before you even fully understand the problem.

Let’s see it in action. You need a simple, private S3 bucket for your new service.

The cathedral-builder’s approach might look like this:

# service-alpha/main.tf

module "service_alpha_bucket" {
  source = "git::ssh://git@github.com/your-org/terraform-modules.git//s3/private-bucket?ref=v1.4.2"

  bucket_name      = "service-alpha-data-logs-2025"
  enable_versioning = true
  force_destroy    = false # Safety first!
  lifecycle_days   = 90
  tags = {
    Service   = "alpha"
    ManagedBy = "Terraform"
  }
}

It looks clean, but what happens when you need to add a specific replication rule? Or a weird CORS policy for a one-off integration? You’re off to another repository to wage war with the module’s maintainer (who is probably you, from six months ago).

Now, the boring, sane, ship-it-today approach:

# service-alpha/main.tf

resource "aws_s3_bucket" "data_bucket" {
  bucket = "service-alpha-data-logs-2025"

  tags = {
    Service   = "alpha"
    ManagedBy = "Terraform"
  }
}

resource "aws_s3_bucket_versioning" "data_bucket_versioning" {
  bucket = aws_s3_bucket.data_bucket.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "data_bucket_lifecycle" {
  bucket = aws_s3_bucket.data_bucket.id

  rule {
    id     = "log-expiration"
    status = "Enabled"
    expiration {
      days = 90
    }
  }
}

resource "aws_s3_bucket_public_access_block" "data_bucket_access" {
  bucket                  = aws_s3_bucket.data_bucket.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

Is it more lines of code? Yes. Is it gloriously, beautifully, and unapologetically obvious? Absolutely. You can read it, understand it, and change it in thirty seconds. No context switching. No spelunking through another codebase. Just a bucket, doing bucket things.

Where a module is not a swear word

Okay, I’m not a total monster. Modules have their place. They are the right tool when you are building the foundations, not the furniture.

A module earns its keep when it defines a stable, slow-moving, and genuinely complex pattern that you truly want to be identical everywhere. Think of it like the plumbing and electrical wiring of a house. You don’t reinvent it for every room.

Good candidates for a module include:

  • VPC and core networking: The highway system of your cloud. Build it once, build it well, and then leave it alone.
  • Kubernetes cluster baselines: The core EKS/GKE/AKS setup, IAM roles, and node group configurations.
  • Security and telemetry agents: The non-negotiable stuff that absolutely must run on every single instance.
  • IAM roles for CI/CD: A standardized way for your deployment pipeline to get the permissions it needs.

The key difference? These things change on a scale of months or years, not days or weeks.

Your escape plan from module purgatory

What if you’re reading this and nodding along in despair, already trapped in a gilded cage of your own abstractions? Don’t panic. There’s a way out, and it doesn’t require a six-month migration project.

  • Freeze everything: First, go to every service that uses the problematic module and pin the version number. ref=v1.4.2. No more floating on main. You’ve just stopped the bleeding.
  • Take inventory: In one service, run a Terraform state list to see the exact resources managed by the module.
  • Perform the adoption: This is the magic trick. Write the plain HCL code for those resources directly in your service’s configuration. Then, tell Terraform that the old resource (inside the module) and your new resource (the plain HCL) are actually the same thing. You do this with a moved block or the Terraform state mv command.

Let’s say your module created a bucket. The state address is module.service_alpha_bucket.aws_s3_bucket.this[0]. Your new plain resource is aws_s3_bucket.data_bucket.

You would run:

terraform state mv 'module.service_alpha_bucket.aws_s3_bucket.this[0]' aws_s3_bucket.data_bucket

  • Verify and obliterate: Run terraform plan. It should come back with “No changes. Your infrastructure matches the configuration.” The plan is clean. You are now free. Delete the module block, pop the champagne, and submit your PR. Repeat for other services, one at a time. No heroics.

Fielding objections from the back row

When you propose this radical act of simplicity, someone will inevitably raise their hand.

  • “But we need standards!” You absolutely do. Standardize on things that matter: tags, naming conventions, and security policies. Enforce them with tools like tflint, checkov, and OPA/Gatekeeper. A linter yelling at you in a PR is infinitely better than a module silently deploying the wrong thing everywhere.
  • “What about junior developers? They need a paved road!” They do. A haunted mega-module with 50 input variables is not a paved road; it’s a labyrinth with a minotaur. A better “paved road” is a folder of well-documented, copy-pasteable examples of plain HCL for common tasks.
  • “Compliance will have questions!” Good. Let them. A tiny, focused, version-pinned module for your IAM boundary policy is a fantastic answer. A sprawling, do-everything wrapper module that changes every week is a compliance nightmare waiting to happen.

The gospel of ‘Good Enough’ for now

Stop trying to solve tomorrow’s problems today. That perfect, infinitely configurable abstraction you’re dreaming of is a solution in search of a problem you don’t have yet.

Don’t optimize for DRY. Optimize for change.

Small teams don’t need fewer lines of HCL; they need fewer places to look when something breaks at 3 PM on a Friday. They need clarity, not cleverness. Keep your power tools for the heavy-duty work. Save the cathedral for when you’ve actually founded a religion.

For now, ship the bucket, and go get lunch.

Avoiding serverless chaos with 3 essential Lambda patterns

Your first Lambda function was a thing of beauty. Simple, elegant, it did one job and did it well. Then came the second. And the tenth. Before you knew it, you weren’t running an application; you were presiding over a digital ant colony, with functions scurrying in every direction without a shred of supervision.

AWS Lambda, the magical service that lets us run code without thinking about servers, can quickly devolve into a chaotic mess of serverless spaghetti. Each function lives happily in its own isolated bubble, and when demand spikes, AWS kindly hands out more and more bubbles. The result? An anarchic party of concurrent executions.

But don’t despair. Before you consider a career change to alpaca farming, let’s introduce three seasoned wranglers who will bring order to your serverless circus. These are the architectural patterns that separate the rookies from the maestros in the art of building resilient, scalable systems.

Meet the micromanager boss

First up is a Lambda with a clipboard and very little patience. This is the Command Pattern function. Its job isn’t to do the heavy lifting—that’s what the interns are for. Its sole purpose is to act as the gatekeeper, the central brain that receives an order, scrutinizes it (request validation), consults its dusty rulebook (business logic), and then barks commands at its underlings to do the actual work.

It’s the perfect choice for workflows where bringing in AWS Step Functions would be like using a sledgehammer to crack a nut. It centralizes decision-making and maintains a crystal-clear separation between those who think and those who do.

When to hire this boss

  • For small to medium workflows that need a clear, single point of control.
  • When you need a bouncer at the door to enforce rules before letting anyone in.
  • If you appreciate a clean hierarchy: one boss, many workers.

A real-world scenario

An OrderProcessor Lambda receives a new order via API Gateway. It doesn’t trust anyone. It first validates the payload, saves a record to DynamoDB so it can’t get lost, and only then does it invoke other Lambdas: one to handle the payment, another to send a confirmation email, and a third to notify the shipping department. The boss orchestrates; the workers execute. Clean and effective.

Visually, it looks like a central hub directing traffic:

Here’s how that boss might delegate a task to the notifications worker:

// The Command Lambda (e.g., process-order-command)
import { LambdaClient, InvokeCommand } from "@aws-sdk/client-lambda";

const lambdaClient = new LambdaClient({ region: "us-east-1" });

export const handler = async (event) => {
    const orderDetails = JSON.parse(event.body);

    // 1. Validate and save the order (your business logic here)
    console.log(`Processing order ${orderDetails.orderId}...`);
    // ... logic to save to DynamoDB ...

    // 2. Delegate to the notification worker
    const invokeParams = {
        FunctionName: 'arn:aws:lambda:us-east-1:123456789012:function:send-confirmation-email',
        InvocationType: 'Event', // Fire-and-forget
        Payload: JSON.stringify({
            orderId: orderDetails.orderId,
            customerEmail: orderDetails.customerEmail,
        }),
    };

    await lambdaClient.send(new InvokeCommand(invokeParams));

    return {
        statusCode: 202, // Accepted
        body: JSON.stringify({ message: "Order received and is being processed." }),
    };
};

The dark side of micromanagement

Be warned. This boss can become a bottleneck. If all decisions flow through one function, it can get overwhelmed. It also risks becoming a “God Object,” a monstrous function that knows too much and does too much, making it a nightmare to maintain and a single, terrifying point of failure.

Enter the patient courier

So, what happens when the micromanager gets ten thousand requests in one second? It chokes, your system grinds to a halt, and you get a frantic call from your boss. The Command Pattern’s weakness is its synchronous nature. We need a buffer. We need an intermediary.

This is where the Messaging Pattern comes in, embodying the art of asynchronous patience. Here, instead of talking directly, services drop messages into a queue or stream (like SQS, SNS, or Kinesis). A consumer Lambda then picks them up whenever it’s ready. This builds healthy boundaries between your services, absorbs sudden traffic bursts like a sponge, and ensures that if something goes wrong, the message can be retried.

When to Call the Courier

  • For bursty or unpredictable workloads that would otherwise overwhelm your system.
  • To isolate slow or unreliable third-party services from your main request path.
  • When you need to offload heavy tasks to be processed in the background.
  • If you need a guarantee that a task will be executed at least once, with a safety net (a Dead-Letter Queue) for messages that repeatedly fail.

A Real-World Scenario

A user clicks “Checkout.” Instead of processing everything right away, the API Lambda simply drops an OrderPlaced event into an SQS queue and immediately returns a success message to the user. On the other side, a ProcessOrderQueue Lambda consumes events from the queue at its own pace. It reserves inventory, charges the credit card, and sends notifications. If the payment service is down, SQS holds the message, and the Lambda tries again later. No lost orders, no frustrated users.

The flow decouples the producer from the consumer:

The producer just needs to drop the message and walk away:

// The Producer Lambda (e.g., checkout-api)
import { SQSClient, SendMessageCommand } from "@aws-sdk/client-sqs";

const sqsClient = new SQSClient({ region: "us-east-1" });

export const handler = async (event) => {
    const orderDetails = JSON.parse(event.body);

    const command = new SendMessageCommand({
        QueueUrl: "[https://sqs.us-east-1.amazonaws.com/123456789012/OrderProcessingQueue](https://sqs.us-east-1.amazonaws.com/123456789012/OrderProcessingQueue)",
        MessageBody: JSON.stringify(orderDetails),
        MessageGroupId: orderDetails.orderId // For FIFO queues
    });

    await sqsClient.send(command);

    return {
        statusCode: 200,
        body: JSON.stringify({ message: "Your order is confirmed!" }),
    };
};

The price of patience

This resilience isn’t free. The biggest trade-off is added latency; you’re introducing an extra step. It also makes end-to-end tracing more complex. Debugging a journey that spans across a queue can feel like trying to track a package with no tracking number.

Unleash the Ttown crier

Sometimes, one piece of news needs to be told to everyone, all at once, without waiting for them to ask. You don’t want a single boss delegating one by one, nor a courier delivering individual letters. You need a proclamation.

The Fan-Out Pattern is your digital town crier. A single event is published to a central hub (typically an SNS topic or EventBridge), which then broadcasts it to any services that have subscribed. Each subscriber is a Lambda function that kicks into action in parallel, completely unaware of the others.

When to shout from the rooftops

  • When a single event needs to trigger multiple, independent downstream processes.
  • For building real-time, event-driven architectures where services react to changes.
  • In high-scale systems where parallel processing is a must.

A real-world scenario

An OrderPlaced event is published to an SNS topic. Instantly, this triggers multiple Lambdas in parallel: one to update inventory, another to send a confirmation email, and a third for the analytics pipeline. The beauty is that the publisher doesn’t know or care who is listening. You can add a fifth or sixth subscriber later without ever touching the original publishing code.

One event triggers many parallel actions:

The publisher’s job is delightfully simple:

// The Publisher Lambda (e.g., reservation-service)
import { SNSClient, PublishCommand } from "@aws-sdk/client-sns";

const snsClient = new SNSClient({ region: "us-east-1" });

export const handler = async (event) => {
    // ... logic to create a reservation ...
    const reservationDetails = {
        reservationId: "res-xyz-123",
        customerEmail: "jane.doe@example.com",
    };

    const command = new PublishCommand({
        TopicArn: "arn:aws:sns:us-east-1:123456789012:NewReservationsTopic",
        Message: JSON.stringify(reservationDetails),
        MessageAttributes: {
            'eventType': {
                DataType: 'String',
                StringValue: 'RESERVATION_CONFIRMED'
            }
        }
    });

    await snsClient.send(command);

    return { status: "SUCCESS", reservationId: reservationDetails.reservationId };
};

The dangers of a loud voice

With great power comes a great potential for a massive, distributed failure. A single poison-pill event could trigger dozens of Lambdas, each failing and retrying, leading to an invocation storm and a bill that will make your eyes water. Careful monitoring and robust error handling in each subscriber are non-negotiable.

Choosing your champions

There you have it: the Micromanager, the Courier, and the Town Crier. Three patterns that form the bedrock of almost any serverless architecture worth its salt.

  • Use the Command Pattern when you need a firm hand on the tiller.
  • Adopt the Messaging Pattern to give your services breathing room and resilience.
  • Leverage the Fan-Out Pattern when one event needs to efficiently kickstart a flurry of activity.

The real magic begins when you combine them. But for now, start seeing your Lambdas not as a chaotic mob of individual functions, but as a team of specialists. With a little architectural guidance, they can build systems that are complex, resilient, and, best of all, cause you far fewer operational headaches.

Serverless without the wait

I once bought a five-minute rice cooker that spent four of those minutes warming up with a pathetic hum. It delivered the goods, eventually, but the promise felt… deceptive. For years, AWS Lambda felt like that gadget. It was the perfect kitchen tool for the odd jobs: a bit of glue code here, a light API there. It was the brilliant, quick-fire microwave of our architecture.

Then our little kitchen grew into a full-blown restaurant. Our “hot path”, the user checkout process, became the star dish on our menu. And our diners, quite rightly, expected it to be served hot and fast every time, not after a polite pause while the oven preheated. That polite pause was our cold start, and it was starting to leave a bad taste.

This isn’t a story about how we fell out of love with Lambda. We still adore it. This is the story of how we moved our main course to an industrial-grade, always-on stove. It’s about what we learned by obsessively timing every step of the process and why we still keep that trusty microwave around for the side dishes it cooks so perfectly. Because when your p95 latency needs to be boringly predictable, keeping the kitchen warm isn’t a preference; it’s a law of physics.

What forced us to remodel the kitchen

No single event pushed us over the edge. It was more of a slow-boiling frog situation, a gradual realization that our ambitions were outgrowing our tools. Three culprits conspired against our sub-300ms dream.

First, our traffic got moody. What used to be a predictable tide of requests evolved into sudden, sharp tsunamis during business hours. We needed a sea wall, not a bucket.

Second, our user expectations tightened. We set a rather tyrannical goal of a sub-300ms p95 for our checkout and search paths. Suddenly, the hundreds of milliseconds Lambda spent stretching and yawning before its first cup of coffee became a debt we couldn’t afford.

Finally, our engineers were getting tired. We found ourselves spending more time performing sacred rituals to appease the cold start gods, fiddling with layers, juggling provisioned concurrency, than we did shipping features our users actually cared about. When your mechanics spend more time warming up the engine than driving the car, you know something’s wrong.

The punchline isn’t that Lambda is “bad.” It’s that our requirements changed. When your performance target drops below the cost of a cold start plus dependency initialization, physics sends you a sternly worded letter.

Numbers don’t lie, but anecdotes do

We don’t ask you to trust our feelings. We ask you to trust the stopwatch. Replicate this experiment, adjust it for your own tech stack, and let the data do the talking. The setup below is what we used to get our own facts straight. All results are our measurements as of September 2025.

The test shape

  • Endpoint: Returns a simple 1 KB JSON payload.
  • Comparable Compute: Lambda set to 512 MB vs. an ECS Fargate container task with 0.5 vCPU and 1 GB of memory.
  • Load Profile: A steady, closed-loop 100 requests per second (RPS) for 10 minutes.
  • Metrics Reported: p50, p90, p95, p99 latency, and the dreaded error rate.

Our trusty tools

  • Load Generator: The ever-reliable k6.
  • Metrics: A cocktail of CloudWatch and Prometheus.
  • Dashboards: Grafana, to make the pretty charts that managers love.

Your numbers will be different. That’s the entire point. Run the tests, get your own data, and then make a decision based on evidence, not a blog post (not even this one).

Where our favorite gadget struggled

Under the harsh lights of our benchmark, Lambda’s quirks on our hot path became impossible to ignore.

  • Cold start spikes: Provisioned Concurrency can tame these, but it’s like hiring a full-time chauffeur to avoid a random 10-minute wait for a taxi. It costs you a constant fee, and during a real rush hour, you might still get stuck in traffic.
  • The startup toll: Initializing SDKs and warming up connections added tens to hundreds of milliseconds. This “entry fee” was simply too high to hide under our 300ms p95 goal.
  • The debugging labyrinth: Iterating was slow. Local emulators helped, but parity was a myth that occasionally bit us. Debugging felt like detective work with half the clues missing.

Lambda continues to be a genius for event glue, sporadic jobs, and edge logic. It just stopped being the right tool to serve our restaurant’s most popular dish at rush hour.

Calling in the heavy artillery

We moved our high-traffic endpoints to container-native services. For us, that meant ECS on Fargate fronted by an Application Load Balancer (ALB). The core idea is simple: keep a few processes warm and ready at all times.

Here’s why it immediately helped:

  • Warm processes: No more cold start roulette. Our application was always awake, connection pools were alive, and everything was ready to go instantly.
  • Standardized packaging: We traded ZIP files for standard Docker images. What we built and tested on our laptops was, byte for byte, what we shipped to production.
  • Civilized debugging: We could run the exact same image locally and attach a real debugger. It was like going from candlelight to a floodlight.
  • Smarter scaling: We could maintain a small cadre of warm tasks as a baseline and then scale out aggressively during peaks.

A quick tale of the tape

Here’s a simplified look at how the two approaches stacked up for our specific needs.

Our surprisingly fast migration plan

We did this in days, not weeks. The key was to be pragmatic, not perfect.

1. Pick your battles: We chose our top three most impactful endpoints with the worst p95 latency.

2. Put it in a box: We converted the function handler into a tiny web service. It’s less dramatic than it sounds.

# Dockerfile (Node.js example)
FROM node:22-slim
WORKDIR /usr/src/app

COPY package*.json ./
RUN npm ci --only=production

COPY . .

ENV NODE_ENV=production PORT=3000
EXPOSE 3000
CMD [ "node", "server.js" ]
// server.js
const http = require('http');
const port = process.env.PORT || 3000;

const server = http.createServer((req, res) => {
  if (req.url === '/health') {
    res.writeHead(200, { 'Content-Type': 'text/plain' });
    return res.end('ok');
  }

  // Your actual business logic would live here
  const body = JSON.stringify({ success: true, timestamp: Date.now() });
  res.writeHead(200, { 'Content-Type': 'application/json' });
  res.end(body);
});

server.listen(port, () => {
  console.log(`Server listening on port ${port}`);
});

3. Set up the traffic cop: We created a new target group for our service and pointed a rule on our Application Load Balancer to it.

{
  "family": "payment-api",
  "networkMode": "awsvpc",
  "cpu": "512",
  "memory": "1024",
  "requiresCompatibilities": ["FARGATE"],
  "executionRoleArn": "arn:aws:iam::987654321098:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::987654321098:role/paymentTaskRole",
  "containerDefinitions": [
    {
      "name": "app-container",
      "image": "[987654321098.dkr.ecr.us-east-1.amazonaws.com/payment-api:2.1.0](https://987654321098.dkr.ecr.us-east-1.amazonaws.com/payment-api:2.1.0)",
      "portMappings": [{ "containerPort": 3000, "protocol": "tcp" }],
      "environment": [{ "name": "NODE_ENV", "value": "production" }]
    }
  ]
}

4. The canary in the coal mine: We used weighted routing to dip our toes in the water. We started by sending just 5% of traffic to the new container service.

# Terraform Route 53 weighted canary
resource "aws_route53_record" "api_primary_lambda" {
  zone_id = var.zone_id
  name    = "api.yourapp.com"
  type    = "A"

  alias {
    name                   = aws_api_gateway_domain_name.main.cloudfront_domain_name
    zone_id                = aws_api_gateway_domain_name.main.cloudfront_zone_id
    evaluate_target_health = true
  }

  set_identifier = "primary-lambda-path"
  weight         = 95
}

resource "aws_route53_record" "api_canary_container" {
  zone_id = var.zone_id
  name    = "api.yourapp.com"
  type    = "A"

  alias {
    name                   = aws_lb.main_alb.dns_name
    zone_id                = aws_lb.main_alb.zone_id
    evaluate_target_health = true
  }

  set_identifier = "canary-container-path"
  weight         = 5
}

5. Stare at the graphs: For one hour, we watched four numbers like hawks: p95 latency, error rates, CPU/memory headroom on the new service, and our estimated cost per million requests.

6. Go all in (or run away): The graphs stayed beautifully, boringly flat. So we shifted to 50%, then 100%. The whole affair was done in an afternoon.

The benchmark kit you can steal

Don’t just read about it. Run a quick test yourself.

// k6 script (save as test.js)
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  vus: 100,
  duration: '5m',
  thresholds: {
    'http_req_duration': ['p(95)<250'], // Aim for a 250ms p95
    'checks': ['rate>0.999'],
  },
};

export default function () {
  const url = __ENV.TARGET_URL || '[https://api.yourapp.com/checkout/v2/quote](https://api.yourapp.com/checkout/v2/quote)';
  const res = http.get(url);
  check(res, { 'status is 200': r => r.status === 200 });
  sleep(0.2); // Small pause between requests
}

Run it from your terminal like this:

k6 run -e TARGET_URL=https://your-canary-endpoint.com test.js

Our results for context

These aren’t universal truths; they are snapshots of our world. Your mileage will vary.

The numbers in bold are what kept us up at night and what finally let us sleep. For our steady traffic, the always-on container was not only faster and more reliable, but it was also shaping up to be cheaper.

Lambda is still in our toolbox

We didn’t throw the microwave out. We just stopped using it to cook the Thanksgiving turkey. Here’s where we still reach for Lambda without a second thought:

  • Sporadic or bursty workloads: Those once-a-day reports or rare event handlers are perfect for scale-to-zero.
  • Event glue: It’s the undisputed champion of transforming S3 puts, reacting to DynamoDB streams, and wiring up EventBridge.
  • Edge logic: For tiny header manipulations or rewrites, Lambda@Edge and CloudFront Functions are magnificent.

Lambda didn’t fail us. We outgrew its default behavior for a very specific, high-stakes workload. We cheated physics by keeping our processes warm, and in return, our p95 stopped stretching like hot taffy.

If your latency targets and traffic shape look anything like ours, please steal our tiny benchmark kit. Run a one-day canary. See what the numbers tell you. The goal isn’t to declare one tool a winner, but to spend less time arguing with physics and more time building things that people love.

The silent bill killers lurking in your Terraform state

The first time I heard the term “sustainability smell,” I rolled my eyes. It sounded like a fluffy marketing phrase dreamed up to make cloud infrastructure sound as wholesome as a farmers’ market. Eco-friendly Terraform? Right. Next, you’ll tell me my data center is powered by happy thoughts and unicorn tears.

But then it clicked. The term wasn’t about planting trees with every terraform apply. It was about that weird feeling you get when you open a legacy repository. It’s the code equivalent of opening a Tupperware container you found in the back of the fridge. You don’t know what’s inside, but you’re pretty sure it’s going to be unpleasant.

Turns out, I’d been smelling these things for years without knowing what to call them. According to HashiCorp’s 2024 survey, a staggering 70% of infrastructure teams admit to over-provisioning resources. It seems we’re all building mansions for guests who never arrive. That, my friend, is the smell. It’s the scent of money quietly burning in the background.

What exactly is that funny smell in my code

A “sustainability smell” isn’t a bug. It won’t trigger a PagerDuty alert at 3 AM. It’s far more insidious. It’s a bad habit baked into your Terraform configuration that silently drains your budget and makes future maintenance a soul-crushing exercise in digital archaeology.

The most common offender is the legendary main.tf file that looks more like an epic novel. You know the one. It’s a sprawling, thousand-line behemoth where VPCs, subnets, ECS clusters, IAM roles, and that one S3 bucket from a forgotten 2021 proof-of-concept all live together in chaotic harmony. Trying to change one small thing in that file is like playing Jenga with a live grenade. You pull out one block, and suddenly three unrelated services start weeping.

I’ve stumbled through enough of these digital haunted houses to recognize the usual ghosts:

  • The over-provisioned powerhouse: An RDS instance with enough horsepower to manage the entire New York Stock Exchange, currently tasked with serving a blog that gets about ten visits a month. Most of them are from the author’s mom.
  • The zombie load balancer: Left behind after a one-off traffic spike, it now spends its days blissfully idle, forwarding zero traffic but diligently charging your account for the privilege of existing.
  • Hardcoded horrors: Instance sizes and IP addresses sprinkled directly into the code like cheap confetti. Need to scale? Good luck. You’ll be hunting down those values for the rest of the week.
  • The phantom snapshot: That old EBS snapshot you swore you deleted. It’s still there, lurking in the dark corners of your AWS account, accumulating charges with the quiet persistence of a glacier.

The silent killers that sink your budget

Let’s be honest, no one’s idea of a perfect Friday afternoon involves becoming a private investigator whose only client is a rogue t3.2xlarge instance that went on a very expensive vacation without permission. It’s tempting to just ignore it. It’s just one instance, right?

Wrong. These smells are the termites of your cloud budget. You don’t notice them individually, but they are silently chewing through your financial foundations. That “tiny” overcharge joins forces with its zombie friends, and suddenly your bill isn’t just creeping up; it’s sprinting.

But the real horror is for the next person who inherits your repo. They were promised the Terraform dream: a predictable, elegant blueprint. Instead, they get a haunted house. Every terraform apply becomes a jump scare, a game of Russian roulette where they pray they don’t awaken some ancient, costly beast.

Becoming a cloud cost detective

So, how do you hunt these ghosts? While tools like Checkov, tfsec, and terrascan are your trusty guard dogs, they’ll bark if you leave the front door wide open; they won’t notice that you’re paying the mortgage on a ten-bedroom mansion when you only live in the garage. For that, you need to do some old-fashioned detective work.

My ghost-hunting toolkit is simple:

  1. Cross-Reference with reality: Check your declared instance sizes against their actual usage in CloudWatch. If your CPU utilization has been sitting at a Zen-like 2% for the past six months, you have a prime suspect.
  2. Befriend the terraform plan command: Run it often. Run it before you even think about changing code. Treat it like a paranoid glance over your shoulder. It’s your best defense against unintended consequences.
  3. Dig for treasure in AWS cost explorer: This is where the bodies are buried. Filter by service, by tag (you are tagging everything, right?), and look for the quiet, consistent charges. That weird $30 “other” charge that shows up every month? I’ve been ambushed by forgotten Route 53 hosted zones more times than I care to admit.

Your detective gadgets

Putting your budget directly into your code is a power move. It’s like putting a security guard inside the bank vault.

Here’s an aws_budgets_budget resource that will scream at you via SNS if you start spending too frivolously on your EC2 instances.

resource "aws_budgets_budget" "ec2_spending_cap" {
  name         = "budget-ec2-monthly-limit"
  budget_type  = "COST"
  limit_amount = "250.0"
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  cost_filters = {
    Service = ["Amazon Elastic Compute Cloud - Compute"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"
    subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
  }
}

resource "aws_sns_topic" "budget_alerts" {
  name = "budget-alert-topic"
}

And for those phantom snapshots? Perform an exorcism with lifecycle rules. This little block of code tells S3 to act like a self-cleaning oven.

resource "aws_s3_bucket" "log_archive" {
  bucket = "my-app-log-archive-bucket"

  lifecycle_rule {
    id      = "log-retention-policy"
    enabled = true

    # Move older logs to a cheaper storage class
    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }

    # And then get rid of them entirely after a year
    expiration {
      days = 365
    }
  }
}

An exorcist’s guide to cleaner code

You can’t eliminate smells forever, but you can definitely keep them from taking over your house. There’s no magic spell, just a few simple rituals:

  1. Embrace modularity: Stop building monoliths. Break your infrastructure into smaller, logical modules. It’s the difference between remodeling one room and having to rebuild the entire house just to change a light fixture.
  2. Variables are your friends: Hardcoding an instance size is a crime against your future self. Use variables. It’s a tiny effort now that saves you a world of pain later.
  3. Tag everything. No, really: Tagging feels like a chore, but it’s a lifesaver. When you’re hunting for the source of a mysterious charge, a good tagging strategy is your map and compass. Tag by project, by team, by owner, heck, tag it with your favorite sandwich. Just tag it.
  4. Schedule a cleanup day: If it’s not on the calendar, it doesn’t exist. Dedicate a few hours every quarter to go ghost-hunting. Review idle resources, question oversized instances, and delete anything that looks dusty.

Your Terraform code is the blueprint for your infrastructure. And just like a real blueprint, any coffee stains, scribbled-out notes, or vague “we’ll figure this out later” sections get built directly into the final structure. If the plan calls for gold-plated plumbing in a closet that will never be used, that’s exactly what you’ll get. And you’ll pay for it. Every single month. These smells aren’t the spectacular, three-alarm fires that get everyone’s attention. They’re the slow, silent drips from a faucet in the basement. It’s just a dollar here for a phantom snapshot, five dollars there for an oversized instance. It’s nothing, right? But leave those drips unchecked long enough, and you don’t just get a high water bill. You come back to find you’ve cultivated a thriving mold colony and the floorboards are suspiciously soft. Ultimately, a clean repository isn’t just about being tidy. It’s about financial hygiene. So go on, open up that old repo. Be brave. The initial smell might be unpleasant, but it’s far better than the stench of a budget that has mysteriously evaporated into thin air.

The ugly truth about SRE Dashboards

Every engineer loves a good dashboard. The vibrant graphs, the neat panels, the comforting glow of a wall of green lights. It’s the digital equivalent of a clean garage; it feels productive, organized, and ready for anything.

But let’s be honest: your dashboards are probably lying to you. They’re like a well-intentioned friend who tells you everything’s fine when you’ve got a smudge of chocolate on your nose and a bird nesting in your hair. They show you the surface, but hide the messy, inconvenient truth.

I learned this the hard way, at 2 a.m., as all the best lessons are learned. We were on-call when production latency went absolutely bonkers. I stared at four massive dashboards, each with a dozen panels of metrics swirling on my screen: CPU, memory, queue depth, disk I/O, HPA stats, all the usual suspects. I was a detective with a thousand clues but no insights, scrolling through what felt like a colorful, confusing kaleidoscope.

An hour of this high-octane confusion later, we discovered the culprit: a single, rogue DNS misconfiguration in a downstream service. The dashboards, those beautiful, useless liars, had all been glowing green.

This isn’t just bad luck. It’s a design flaw.

Designed for reports, not for war

Most dashboards are built for managers who need to glance at high-level metrics during a meeting, not for engineers trying to solve a full-blown crisis. We obsess over the shiny vanity metrics: request counts and 99th percentile latency, while the real demons, the retry storms and misbehaving clients, hide in the shadows.

Think of it like this: your dashboard is a doctor who only checks your height and weight. You might look great on paper, but your appendix could be about to explode. The surface looks fine, but the guts are in chaos.

The graveyard of abandoned dashboards

Have you ever wondered where old dashboards go to die? The answer is: nowhere. They simply get abandoned, like a pet you can no longer care for. Metrics get deprecated, panels start showing N/A, and alerts get muted permanently. They become relics of a bygone era, cluttering your screens with useless data and false promises. It’s the digital equivalent of that one junk drawer in your kitchen; it feels organized at a glance, but you know deep down it’s a monument to things you’ll never use again.

Too much signal, too much noise

Adding more panels doesn’t automatically give you better visibility. At scale, dashboards become a cacophony of white noise. You spend 30 minutes scanning, 5 minutes guessing, and 10 minutes restarting pods just to see if the blinking stops. That’s not observability; that’s panic dressed up as process.

Imagine trying to find your house key on a keychain with 500 different keys on it. You can see all of them, but you can’t find the one you need when you’re standing in the rain.

So, how do you fix it? You stop making art and start getting answers.

From Metrics to Methods

We stopped dumping metrics onto giant boards and created what we called “Runbooks with Graphs.” Instead of a hundred metrics per service, we had a handful per failure mode. It’s a fundamental shift in perspective.

Here’s an example of what that looked like:

failure_mode: API_response_slowdown
title: "API Latency Exceeding SLO"
hypothesis: "Is the database overloaded?"
metrics:
  - name: "database_connections_count"
    query: "sum(database_connections_total)"
  - name: "database_query_latency_p99"
    query: "histogram_quantile(0.99, rate(database_query_latency_seconds_bucket[5m]))"
runbook_link: "https://your-wiki.com/runbooks/api_latency_troubleshooting"

This simple shift grouped our metrics by the why, not just the what.

Slaying Alert Fatigue

We took a good, hard look at our alerts and deleted 40% of them. Then, we rebuilt them from the ground up, basing them on symptoms, not raw metrics. This meant getting rid of things like this:

# BEFORE: A useless alert
- alert: HighCPULoad
  expr: avg(cpu_usage_rate) > 0.8
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High CPU on instance {{ $labels.instance }}"

And replacing it with something like this:

# AFTER: A meaningful, symptom-based alert
- alert: CustomerFacingSLOViolation
  expr: rate(http_requests_total{status_code!~"2.."}) / rate(http_requests_total) > 0.1
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Too many failed API requests - SLO violated"
    description: "The percentage of failed requests is over 10%."

Suddenly, the team trusted the alerts again. When the pager went off, it actually meant something was wrong for the customers, not just a server having a bad day.

Blackhole checks and truth bombs

If dashboards can lie, you need tools that don’t. We added synthetic tests and end-to-end user simulations. These act like a secret shopper for your service, proving something is broken, whether your metrics look good or not.

Here’s a simple example of a synthetic check:

const axios = require('axios');
async function checkAPIMetrics() {
  try {
    const response = await axios.get('https://api.yourcompany.com/v1/health');
    if (response.status !== 200) {
      throw new Error(`Health check failed with status: ${response.status}`);
    }
    console.log('API is healthy.');
  } catch (error) {
    console.error('API health check failed:', error.message);
    // Send alert to PagerDuty or Slack
  }
}
checkAPIMetrics();

Your internal metrics may say “OK,” but a synthetic user never lies about the customer’s experience.

The hard truth

Dashboards don’t solve outages. People do. They’re useful, but only if they’re maintained, contextual, and grounded in real-world operations. If your dashboards don’t reflect how failures actually unfold, they’re not observability, they’re art. And in the middle of a P1 incident, you don’t need art. You need answers.

This is the part where I’m supposed to give you a tidy, inspirational conclusion. Something about how we can all be better, more vigilant SREs. But let’s be realistic. The truth is, the world is full of dashboards that are just digital wallpaper, beautiful to look at, utterly useless in a crisis. They’re a collective delusion that makes us feel like we have everything under control, when in reality, we’re just scrolling through colorful confusion, hoping something will catch our eye.

So, before you build another massive, 50-panel dashboard, stop and ask yourself: is this going to help me at 2 a.m., with my coffee pot empty and a panic-stricken developer on the other end of the line? Or is it just another pretty lie to add to the collection?

How many of your dashboards are truly battle-ready? And which ones are just decorative?

127.0.0.1 and its 16 million invisible roommates

Let’s be honest. You’ve typed 127.0.0.1 more times than you’ve called your own mother. We treat it like the sole, heroic occupant of the digital island we call localhost. It’s the only phone number we know by heart, the only doorbell we ever ring.

Well, brace yourself for a revelation that will fundamentally alter your relationship with your machine. 127.0.0.1 is not alone. In fact, it lives in a sprawling, chaotic metropolis with over 16 million other addresses, all of them squatting inside your computer, rent-free.

Ignoring these neighbors condemns you to a life of avoidable port conflicts and flimsy localhost tricks. But give them a chance, and you’ll unlock cleaner dev setups, safer tests, and fewer of those classic “Why is my test API saying hello to the entire office Wi-Fi?” moments of sheer panic.

So buckle up. We’re about to take the scenic tour of the neighborhood that the textbooks conveniently forgot to mention.

Your computer is secretly a megacity

The early architects of the internet, in their infinite wisdom, set aside the entire 127.0.0.0/8 block of addresses for this internal monologue. That’s 16,777,216 unique addresses, from 127.0.0.1 all the way to 127.255.255.254. Every single one of them is designed to do one thing: loop right back to your machine. It’s the ultimate homebody network.

Think of your computer not as a single-family home with one front door, but as a gigantic apartment building with millions of mailboxes. And for years, you’ve been stubbornly sending all your mail to apartment #1.

Most operating systems only bother to introduce you to 127.0.0.1, but the kernel knows the truth. It treats any address in the 127.x.y.z range as a VIP guest with an all-access pass back to itself. This gives you a private, internal playground for wiring up your applications.

A handy rule of thumb? Any address starting with 127 is your friend. 127.0.0.2, 127.10.20.30, even 127.1.1.1, they all lead home.

Everyday magic tricks with your newfound neighbors

Once you realize you have a whole city at your disposal, you can stop playing port Tetris. Here are a few party tricks your localhost never told you it could do.

The art of peaceful coexistence

We’ve all been there. It’s 2 AM, and two of your microservices are having a passive-aggressive standoff over port 8080. They both want it, and neither will budge. You could start juggling ports like a circus performer, or you could give them each their own house.

Assign each service its own loopback address. Now they can both listen on port 8080 without throwing a digital tantrum.

First, give your new addresses some memorable names in your /etc/hosts file (or C:\Windows\System32\drivers\etc\hosts on Windows).

# /etc/hosts

127.0.0.1       localhost
127.0.1.1       auth-service.local
127.0.1.2       inventory-service.local

Now, you can run both services simultaneously.

# Terminal 1: Start the auth service
$ go run auth/main.go --bind 127.0.1.1:8080

# Terminal 2: Start the inventory service
$ python inventory/app.py --host 127.0.1.2 --port 8080

Voilà. http://auth-service.local:8080 and http://inventory-service.local:8080 are now living in perfect harmony. No more port drama.

The safety of an invisible fence

Binding a service to 0.0.0.0 is the developer equivalent of leaving your front door wide open with a neon sign that says, “Come on in, check out my messy code, maybe rifle through my database.” It’s convenient, but it invites the entire network to your private party.

Binding to a 127.x.y.z address, however, is like building an invisible fence. The service is only accessible from within the machine itself. This is your insurance policy against accidentally exposing a development database full of ridiculous test data to the rest of the company.

Advanced sorcery for the brave

Ready to move beyond the basics? Treating the 127 block as a toolkit unlocks some truly powerful patterns.

Taming local TLS

Testing services that require TLS can be a nightmare. With your new loopback addresses, it becomes trivial. You can create a single local Certificate Authority (CA) and issue a certificate with Subject Alternative Names (SANs) for each of your local services.

# /etc/hosts again

127.0.2.1   api-gateway.secure.local
127.0.2.2   user-db.secure.local
127.0.2.3   billing-api.secure.local

Now, api-gateway.secure.local can talk to user-db.secure.local over HTTPS, with valid certificates, all without a single packet leaving your laptop. This is perfect for testing mTLS, SNI, and other scenarios where your client needs to be picky about its connections.

Concurrent tests without the chaos

Running automated acceptance tests that all expect to connect to a database on port 5432 can be a race condition nightmare. By pinning each test runner to its own unique 127 address, you can spin them all up in parallel. Each test gets its own isolated world, and your CI pipeline finishes in a fraction of the time.

The fine print and other oddities

This newfound power comes with a few quirks you should know about. This is the part of the tour where we point out the strange neighbor who mows his lawn at midnight.

  • The container dimension: Inside a Docker container, 127.0.0.1 refers to the container itself, not the host machine. It’s a whole different loopback universe in there. To reach the host from a container, you need to use the special gateway address provided by your platform (like host.docker.internal).
  • The IPv6 minimalist: IPv6 scoffs at IPv4’s 16 million addresses. For loopback, it gives you one: ::1. That’s it. This explains the classic mystery of “it works with 127.0.0.1 but fails with localhost.” Often, localhost resolves to ::1 first, and if your service is only listening on IPv4, it won’t answer the door. The lesson? Be explicit, or make sure your service listens on both.
  • The SSRF menace: If you’re building security filters to prevent Server-Side Request Forgery (SSRF), remember that blocking just 127.0.0.1 is like locking the front door but leaving all the windows open. You must block the entire 127.0.0.0/8 range and ::1.

Your quick start eviction notice for port conflicts

Ready to put this into practice? Here’s a little starter kit you can paste today.

First, add some friendly names to your hosts file.

# Add these to your /etc/hosts file
127.0.10.1  api.dev.local
127.0.10.2  db.dev.local
127.0.10.3  cache.dev.local

Next, on Linux or macOS, you can formally add these as aliases to your loopback interface. This isn’t always necessary for binding, but it’s tidy.

# For Linux
sudo ip addr add 127.0.10.1/8 dev lo
sudo ip addr add 127.0.10.2/8 dev lo
sudo ip addr add 127.0.10.3/8 dev lo

# For macOS
sudo ifconfig lo0 alias 127.0.10.1
sudo ifconfig lo0 alias 127.0.10.2
sudo ifconfig lo0 alias 127.0.10.3

Now, you can bind three different services, all to their standard ports, without a single collision.

# Run your API on its default port
api-server --bind api.dev.local:3000

# Run Postgres on its default port
postgres -D /path/to/data -c listen_addresses=db.dev.local

# Run Redis on its default port
redis-server --bind cache.dev.local

Check that everyone is home and listening.

# Check the API
curl http://api.dev.local:3000/health

# Check the database (requires psql client)
psql -h db.dev.local -U myuser -d mydb -c "SELECT 1"

# Check the cache
redis-cli -h cache.dev.local ping
# Expected output: PONG

Welcome to the neighborhood

Your laptop isn’t a one-address town; it’s a small city with streets you haven’t named and doors you haven’t opened. For too long, you’ve been forcing all your applications to live in a single, crowded, noisy studio apartment at 127.0.0.1. The database is sleeping on the couch, the API server is hogging the bathroom, and the caching service is eating everyone else’s food from the fridge. It’s digital chaos.

Giving each service its own loopback address is like finally moving them into their own apartments in the same building. It’s basic digital hygiene. Suddenly, there’s peace. There’s order. You can visit each one without tripping over the others. You stop being a slumlord for your own processes and become a proper city planner.

So go ahead, break the monogamous, and frankly codependent, relationship you’ve had with 127.0.0.1. Explore the neighborhood. Hand out a few addresses. Let your development environment behave like a well-run, civilized society instead of a digital mosh pit. Your sanity and your services will thank you for it. After all, good fences make good neighbors, even when they’re all living inside your head.

Terraform scales better without a centralized remote state

It’s 4:53 PM on a Friday. You’re pushing a one-line change to an IAM policy. A change so trivial, so utterly benign, that you barely give it a second thought. You run terraform apply, lean back in your chair, and dream of the weekend. Then, your terminal returns a greeting from the abyss: Error acquiring state lock.

Somewhere across the office, or perhaps across the country, a teammate has just started a plan on their own, seemingly innocuous change. You are now locked in a digital standoff. The weekend is officially on hold. Your shared Terraform state file, once a symbol of collaboration and a single source of truth, has become a temperamental roommate who insists on using the kitchen right when you need to make dinner. And they’re a very, very slow cook.

Our Terraform honeymoon phase

It wasn’t always like this. Most of us start our Terraform journey in a state of blissful simplicity. Remember those early days? A single, elegant main.tf file, a tidy remote backend in an S3 bucket, and a DynamoDB table to handle the locking. It was the infrastructure equivalent of a brand-new, minimalist apartment. Everything had its place. Deployments were clean, predictable, and frankly, a little bit boring.

Our setup looked something like this, a testament to a simpler time:

# in main.tf
terraform {
  backend "s3" {
    bucket         = "our-glorious-infra-state-prod"
    key            = "global/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-lock-prod"
    encrypt        = true
  }
}

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
  # ... and so on
}

It worked beautifully. Until it didn’t. The problem with minimalist apartments is that they don’t stay that way. You add a person, then another. You buy more furniture. Soon, you’re tripping over things, and that one clean kitchen becomes a chaotic battlefield of conflicting needs.

The kitchen gets crowded

As our team and infrastructure grew, our once-pristine state file started to resemble a chaotic shared kitchen during rush hour. The initial design, meant for a single chef, was now buckling under the pressure of a full restaurant staff.

The state lock standoff

The first and most obvious symptom was the state lock. It’s less of a technical “race condition” and more of a passive-aggressive duel between two colleagues who both need the only good frying pan at the exact same time. The result? Burnt food, frayed nerves, and a CI/CD pipeline that spends most of its time waiting in line.

The mystery of the shared spice rack

With everyone working out of the same state file, we lost any sense of ownership. It became a communal spice rack where anyone could move, borrow, or spill things. You’d reach for the salt (a production security group) only to find someone had replaced it with sugar (a temporary rule for a dev environment). Every Terraform apply felt like a gamble. You weren’t just deploying your change; you were implicitly signing off on the current, often mysterious, state of the entire kitchen.

The pre-apply prayer

This led to a pervasive culture of fear. Before running an apply, engineers would perform a ritualistic dance of checks, double-checks, and frantic Slack messages: “Hey, is anyone else touching prod right now?” The Terraform plan output would scroll for pages, a cryptic epic poem of changes, 95% of which had nothing to do with you. You’d squint at the screen, whispering a little prayer to the DevOps gods that you wouldn’t accidentally tear down the customer database because of a subtle dependency you missed.

The domino effect of a single spilled drink

Worst of all was the tight coupling. Our infrastructure became a house of cards. A team modifying a network ACL for their new microservice could unintentionally sever connectivity for a legacy monolith nobody had touched in years. It was the architectural equivalent of trying to change a lightbulb and accidentally causing the entire building’s plumbing to back up.

An uncomfortable truth appears

For a while, we blamed Terraform. We complained about its limitations, its verbosity, and its sharp edges. But eventually, we had to face an uncomfortable truth: the tool wasn’t the problem. We were. Our devotion to the cult of the single centralized state—the idea that one file to rule them all was the pinnacle of infrastructure management—had turned our single source of truth into a single point of failure.

The great state breakup

The solution was as terrifying as it was liberating: we had to break up with our monolithic state. It was time to move out of the chaotic shared house and give every team their own well-equipped studio apartment.

Giving everyone their own kitchenette

First, we dismantled the monolith. We broke our single Terraform configuration into dozens of smaller, isolated stacks. Each stack managed a specific component or application, like a VPC, a Kubernetes cluster, or a single microservice’s infrastructure. Each had its own state file.

Our directory structure transformed from a single folder into a federation of independent projects:

infra/
├── networking/
│   ├── vpc.tf
│   └── backend.tf      # Manages its own state for the VPC
├── databases/
│   ├── rds-main.tf
│   └── backend.tf      # Manages its own state for the primary RDS
└── services/
    ├── billing-api/
    │   ├── ecs-service.tf
    │   └── backend.tf  # Manages state for just the billing API
    └── auth-service/
        ├── iam-roles.tf
        └── backend.tf  # Manages state for just the auth service

The state lock standoffs vanished overnight. Teams could work in parallel without tripping over each other. The blast radius of any change was now beautifully, reassuringly small.

Letting infrastructure live with its application

Next, we embraced GitOps patterns. Instead of a central infrastructure repository, we decided that infrastructure code should live with the application it supports. It just makes sense. The code for an API and the infrastructure it runs on are a tightly coupled couple; they should live in the same house. This meant code reviews for application features and infrastructure changes happened in the same pull request, by the same team.

Tasting the soup before serving it

Finally, we made surprises a thing of the past by validating plans before they ever reached the main branch. We set up simple CI workflows that would run a Terraform plan on every pull request. No more mystery meat deployments. The plan became a clear, concise contract of what was about to happen, reviewed and approved before merge.

A snippet from our GitHub Actions workflow looked like this:

name: 'Terraform Plan Validation'
on:
  pull_request:
    paths:
      - 'infra/**'
      - '.github/workflows/terraform-plan.yml'

jobs:
  plan:
    name: 'Terraform Plan'
    runs-on: ubuntu-latest
    steps:
    - name: Checkout
      uses: actions/checkout@v4

    - name: Setup Terraform
      uses: hashicorp/setup-terraform@v3
      with:
        terraform_version: 1.5.0

    - name: Terraform Init
      run: terraform init -backend=false

    - name: Terraform Plan
      run: terraform plan -no-color

Stories from the other side

This wasn’t just a theoretical exercise. A fintech firm we know split its monolithic repo into 47 micro-stacks. Their deployment speed shot up by 70%, not because they wrote code faster, but because they spent less time waiting and untangling conflicts. Another startup moved from a central Terraform setup to the AWS CDK (TypeScript), embedding infra in their app repos. They cut their time-to-deploy in half, freeing their SRE team from being gatekeepers and allowing them to become enablers.

Guardrails not gates

Terraform is still a phenomenally powerful tool. But the way we use it has to evolve. A centralized remote state, when not designed for scale, becomes a source of fragility, not strength. Just because you can put all your eggs in one basket doesn’t mean you should, especially when everyone on the team needs to carry that basket around.

The most scalable thing you can do is let teams build independently. Give them ownership, clear boundaries, and the tools to validate their work. Build guardrails to keep them safe, not gates to slow them down. Your Friday evenings will thank you for it.