SRE

The mutability mirage in Cloud

We’ve all been there. A DevOps engineer squints at a script, muttering, “But I changed it, it has to be mutable.” Meanwhile, the cloud infrastructure blinks back, unimpressed, as if to say, “Sure, you swapped the sign. That doesn’t make the building mutable.”

This isn’t just a coding quirk. It’s a full-blown identity crisis in the world of cloud architecture and DevOps, where confusing reassignment with mutability can lead to anything from baffling bugs to midnight firefighting sessions. Let’s dissect why your variables are lying to you, and why it matters more than you think.

The myth of the mutable variable

Picture this: You’re editing a configuration file for a cloud service. You tweak a value, redeploy, and poof, it works. Naturally, you assume the system is mutable. But what if it isn’t? What if the platform quietly discarded your old configuration and spun up a new one, like a magician swapping a rabbit for a hat?

This is the heart of the confusion. In programming, mutability isn’t about whether something changes; it’s about how it changes. A mutable object alters its state in place, like a chameleon shifting colors. An immutable one? It’s a one-hit wonder: once created, it’s set in stone. Any “change” is just a new object in disguise.

What mutability really means

Let’s cut through the jargon. A mutable object, say, a Python list, lets you tweak its contents without breaking a sweat. Add an item, remove another, and it’s still the same list. Check its memory address with id(), and it stays consistent.

Now take a string. Try to “modify” it:

greeting = "Hello"  
greeting += " world"

Looks like a mutation, right? Wrong. The original greeting is gone, replaced by a new string. The memory address? Different. The variable name greeting is just a placeholder, now pointing to a new object, like a GPS rerouting you to a different street.

This isn’t pedantry. It’s the difference between adjusting the engine of a moving car and replacing the entire car because you wanted a different color.

The great swap

Why does this illusion persist? Because programming languages love to hide the smoke and mirrors. In functional programming, for instance, operations like map() or filter() return new collections, never altering the original. Yet the syntax, data = transform(data), feels like mutation.

Even cloud infrastructure plays this game. Consider immutable server deployments: you don’t “update” an AWS EC2 instance. You bake a new AMI and replace the old one. The outcome is change, but the mechanism is substitution. Confusing the two leads to chaos, like assuming you can repaint a house without leaving the living room.

The illusion of change

Here’s where things get sneaky. When you write:

counter = 5  
counter += 1

You’re not mutating the number 5. You’re discarding it for a shiny new 6. The variable counter is merely a label, not the object itself. It’s like renaming a book after you’ve already read it, The Great Gatsby didn’t change; you just called it The Even Greater Gatsby and handed it to someone else.

This trickery is baked into language design. Python’s tuples are immutable, but you can reassign the variable holding them. Java’s String class is famously unyielding, yet developers swear they “changed” it daily. The culprit? Syntax that masks object creation as modification.

Why cloud and DevOps care

In cloud architecture, this distinction is a big deal. Mutable infrastructure, like manually updating a server, invites inconsistency and “works on my machine” disasters. Immutable infrastructure, by contrast, treats servers as disposable artifacts. Changes mean new deployments, not tweaks.

This isn’t just trendy. It’s survival. Imagine two teams modifying a shared configuration. If the object is mutable, chaos ensues, race conditions, broken dependencies, the works. If it’s immutable, each change spawns a new, predictable version. No guessing. No debugging at 3 a.m.

Performance matters too. Creating new objects has overhead, yes, but in distributed systems, the trade-off for reliability is often worth it. As the old adage goes: “You can optimize for speed or sanity. Pick one.”

How not to fall for the trick

So how do you avoid this trap?

Check the documentation. Is the type labeled mutable? If it’s a string, tuple, or frozenset, assume it’s playing hard to get.
Test identity. In Python, use id(). In Java, compare references. If the address changes, you’ve been duped.
Prefer immutability for shared data. Your future self will thank you when the system doesn’t collapse under concurrent edits.

And if all else fails, ask: “Did I alter the object, or did I just point to a new one?” If the answer isn’t obvious, grab a coffee. You’ll need it.

The cloud doesn’t change, it blinks

Let’s be brutally honest: in the cloud, assuming something is mutable because it changes is like assuming your toaster is self-repairing because the bread pops up different shades of brown. You tweak a Kubernetes config, redeploy, and poof, it’s “updated.” But did you mutate the cluster or merely summon a new one from the void? In the world of DevOps, this confusion isn’t just a coding quirk; it’s the difference between a smooth midnight rollout and a 3 a.m. incident war room where your coffee tastes like regret.

Cloud infrastructure doesn’t change; it reincarnates. When you “modify” an AWS Lambda function, you’re not editing a living organism. You’re cremating the old version and baptizing a new one in S3. The same goes for Terraform state files or Docker images: what looks like a tweak is a full-scale resurrection. Mutable configurations? They’re the digital equivalent of duct-taping a rocket mid-flight. Immutable ones? They’re the reason your team isn’t debugging why the production database now speaks in hieroglyphics.

And let’s talk about the real villain: configuration drift. It’s the gremlin that creeps into mutable systems when no one’s looking. One engineer tweaks a server, another “fixes” a firewall rule, and suddenly your cloud environment has the personality of a broken vending machine. Immutable infrastructure laughs at this. It’s the no-nonsense librarian who will replace the entire catalog if you so much as sneeze near the Dewey Decimal System.

So the next time a colleague insists, “But I changed it!” with the fervor of a street magician, lean in and whisper: “Ah, yes. Just like how I ‘changed’ my car by replacing it with a new one. Did you mutate the object, or did you just sacrifice it to the cloud gods?” Then watch their face, the same bewildered blink as your AWS console when you accidentally set min_instances = 0 on a critical service.

The cloud doesn’t get frustrated. It doesn’t sigh. It blinks. Once. Slowly. And in that silent judgment, you’ll finally grasp the truth: change is inevitable. Mutability is a choice. Choose wisely, or spend eternity debugging the ghost of a server that thought it was mutable.

(And for the love of all things scalable: stop naming your variables temp.)

Parenting your Kubernetes using hierarchical namespaces

Let’s be honest. Your Kubernetes cluster, on its bad days, feels less like a sleek, futuristic platform and more like a chaotic shared apartment right after college. The frontend team is “borrowing” CPU from the backend team, the analytics project left its sensitive data lying around in a public bucket, and nobody knows who finished the last of the memory reserves.

You tried to bring order. You dutifully handed out digital rooms to each team using namespaces. For a while, there was peace. But then those teams had their own little sub-projects, staging, testing, that weird experimental feature no one talks about, and your once-flat world devolved into a sprawling city with no zoning laws. The shenanigans continued, just inside slightly smaller boxes.

What you need isn’t more rules scribbled on a whiteboard. You need a family tree. It’s time to introduce some much-needed parental supervision into your cluster. It’s time for Hierarchical Namespaces.

The origin of the namespace rebellion

In the beginning, Kubernetes gave us namespaces, and they were good. The goal was simple: create virtual walls to stop teams from stealing each other’s lunch (metaphorically speaking, of course). Each namespace was its own isolated island, a sovereign nation with its own rules. This “flat earth” model worked beautifully… until it didn’t.

As organizations scaled, their clusters turned into bustling archipelagos of hundreds of namespaces. Managing them felt like being an air traffic controller for a fleet of paper airplanes in a hurricane. Teams realized that a flat structure was basically a free-for-all party where every guest could raid the fridge, as long as they stayed in their designated room. There was no easy way to apply a single rule, like a network policy or a set of permissions, to a group of related namespaces. The result was a maddening copy-paste-a-thon of YAML files, a breeding ground for configuration drift and human error.

The community needed a way to group these islands, to draw continents. And so, the Hierarchical Namespace Controller (HNC) was born, bringing a simple, powerful concept to the table: namespaces can have parents.

What this parenting gig gets you

Adopting a hierarchical structure isn’t just about satisfying your inner control freak. It comes with some genuinely fantastic perks that make cluster management feel less like herding cats.

The “Because I said so” principle: This is the magic of policy inheritance. Any Role, RoleBinding, or NetworkPolicy you apply to a parent namespace automatically cascades down to all its children and their children, and so on. It’s the parenting dream: set a rule once, and watch it magically apply to everyone. No more duplicating RBAC roles for the dev, staging, and testing environments of the same application.
The family budget: You can set a resource quota on a parent namespace, and it becomes the total budget for that entire branch of the family tree. For instance, team-alpha gets 100 CPU cores in total. Their dev and qa children can squabble over that allowance, but together, they can’t exceed it. It’s like giving your kids a shared credit card instead of a blank check.
Delegated authority: You can make a developer an admin of a “team” namespace. Thanks to inheritance, they automatically become an admin of all the sub-namespaces under it. They get the freedom to manage their own little kingdoms (staging, testing, feature-x) without needing to ping a cluster-admin for every little thing. You’re teaching them responsibility (while keeping the master keys to the kingdom, of course).

Let’s wrangle some namespaces

Convinced? I thought so. The good news is that bringing this parental authority to your cluster isn’t just a fantasy. Let’s roll up our sleeves and see how it works.

Step 0: Install the enforcer

Before we can start laying down the law, we need to invite the enforcer. The Hierarchical Namespace Controller (HNC) doesn’t come built-in with Kubernetes. You have to install it first.

You can typically install the latest version with a single kubectl command:

kubectl apply -f [https://github.com/kubernetes-sigs/hierarchical-namespaces/releases/latest/download/hnc-manager.yaml](https://github.com/kubernetes-sigs/hierarchical-namespaces/releases/latest/download/hnc-manager.yaml)

Wait a minute for the controller to be up and running in its own hnc-system namespace. Once it’s ready, you’ll have a new superpower: the kubectl hns plugin.

Step 1: Create the parent namespace

First, let’s create a top-level namespace for a project. We’ll call it project-phoenix. This will be our proud parent.

kubectl create namespace project-phoenix

Step 2: Create some children

Now, let’s give project-phoenix a couple of children: staging and testing. Wait, what’s that hns command? That’s not your standard kubectl. That’s the magic wand the HNC just gave you. You’re telling it to create a staging namespace and neatly tuck it under its parent.

kubectl hns create staging -n project-phoenix
kubectl hns create testing -n project-phoenix

Step 3: Admire your family tree

To see your beautiful new hierarchy in all its glory, you can ask HNC to draw you a picture.

kubectl hns tree project-phoenix

You’ll get a satisfyingly clean ASCII art diagram of your new family structure:

You can even create grandchildren. Let’s give the staging namespace its own child for a specific feature branch.

kubectl hns create feature-login-v2 -n staging
kubectl hns tree project-phoenix

And now your tree looks even more impressive:

Step 4 Witness the magic of inheritance

Let’s prove that this isn’t all smoke and mirrors. We’ll create a Role in the parent namespace that allows viewing Pods.

# viewer-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-viewer
  namespace: project-phoenix
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

Apply it:

kubectl apply -f viewer-role.yaml

Now, let’s give a user, let’s call her jane.doe, that role in the parent namespace.

kubectl create rolebinding jane-viewer --role=pod-viewer --user=jane.doe -n project-phoenix

Here’s the kicker. Even though we only granted Jane permission in project-phoenix, she can now magically view pods in the staging and feature-login-v2 namespaces as well.

# This command would work for Jane!
kubectl auth can-i get pods -n staging --as=jane.doe
# YES

# And even in the grandchild namespace!
kubectl auth can-i get pods -n feature-login-v2 --as=jane.doe
# YES

No copy-pasting required. The HNC saw the binding in the parent and automatically propagated it down the entire tree. That’s the power of parenting.

A word of caution from a fellow parent

As with real parenting, this new power comes with its own set of challenges. It’s not a silver bullet, and you should be aware of a few things before you go building a ten-level deep namespace dynasty.

Complexity can creep in: A deep, sprawling tree of namespaces can become its own kind of nightmare to debug. Who has access to what? Which quota is affecting this pod? Keep your hierarchy logical and as flat as you can get away with. Just because you can create a great-great-great-grandchild namespace doesn’t mean you should.
Performance is not free: The HNC is incredibly efficient, but propagating policies across thousands of namespaces does have a cost. For most clusters, it’s negligible. For mega-clusters, it’s something to monitor.
Not everyone obeys the parents: Most core Kubernetes resources (RBAC, Network Policies, Resource Quotas) play nicely with HNC. But not all third-party tools or custom controllers are hierarchy-aware. They might only see the flat world, so always test your specific tools.

Go forth and organize

Hierarchical Namespaces are the organizational equivalent of finally buying drawer dividers for that one kitchen drawer, you know the one. The one where the whisk is tangled with the batteries and a single, mysterious key. They transform your cluster from a chaotic free-for-all into a structured, manageable hierarchy that actually reflects how your organization works. It’s about letting you set rules with confidence and delegate with ease.

So go ahead, embrace your inner cluster parent. Bring some order to the digital chaos. Your future self, the one who isn’t spending a Friday night debugging a rogue pod in the wrong environment, will thank you. Just don’t be surprised when your newly organized child namespaces start acting like teenagers, asking for the production Wi-Fi password or, heaven forbid, the keys to the cluster-admin car.After all, with great power comes great responsibility… and a much, much cleaner kubectl get ns output.

October 3, 2025 by Fernando SRE Cloud stuff DevOps stuff Kubernetes SRE stuff

Building living systems with WebSockets

For the longest time, communication on the web has been a painfully formal affair. It’s like sending a letter. Your browser meticulously writes a request, sends it off via the postal service (HTTP), and then waits. And waits. Eventually, the server might write back with an answer. If you want to know if anything has changed five seconds later, you have to send another letter. It’s slow, it’s inefficient, and frankly, the postman is starting to give you funny looks.

This constant pestering, “Anything new? How about now? Now?”, is the digital equivalent of a child on a road trip asking, “Are we there yet?” It’s called polling, and it’s the clumsy foundation upon which much of the old web was built. For applications that need to feel alive, this just won’t do.

What if, instead of sending a flurry of letters, we could just open a phone line directly to the server? A dedicated, always-on connection where both sides can just shout information at each other the moment it happens. That, in a nutshell, is the beautiful, chaotic, and nonstop chatter of WebSockets. It’s the technology that finally gave our distributed systems a voice.

The secret handshake that starts the party

So how do you get access to this exclusive, real-time conversation? You can’t just barge in. You have to know the secret handshake.

The process starts innocently enough, with a standard HTTP request. It looks like any other request, but it carries a special, almost magical, header: Upgrade: websocket. This is the client subtly asking the server, “Hey, this letter-writing thing is a drag. Can we switch to a private line?”

If the server is cool, and equipped for a real conversation, it responds with a special status code, 101 Switching Protocols. This isn’t just an acknowledgment; it’s an agreement. The server is saying, “Heck yes. The formal dinner party is over. Welcome to the after-party.” At that moment, the clumsy, transactional nature of HTTP is shed like a heavy coat, and the connection transforms into a sleek, persistent, two-way WebSocket tunnel. The phone line is now open.

So what can we do with all this chatter?

Once you have this open line, the possibilities become far more interesting than just fetching web pages. You can build systems that breathe.

The art of financial eavesdropping

Think of a stock trading platform. With HTTP, you’d be that sweaty-palmed investor hitting refresh every two seconds, hoping to catch a price change. With WebSockets, the server just whispers the new prices in your ear the microsecond they change. It’s the difference between reading yesterday’s newspaper and having a live feed from the trading floor piped directly into your brain.

Keeping everyone on the same page literally

Remember the horror of emailing different versions of a document back and forth? “Report_Final_v2_Johns_Edits_Final_FINAL.docx”. Collaborative tools like Google Docs killed that nightmare, and WebSockets were the murder weapon. When you type, your keystrokes are streamed to everyone else in the document instantly. It’s a seamless, shared consciousness, not a series of disjointed monologues.

Where in the world is my taxi

Ride-sharing apps like Uber would be a farce without a live map. You don’t want a “snapshot” of where your driver was 30 seconds ago; you want to see that little car icon gliding smoothly toward you. WebSockets provide that constant stream of location data, turning a map from a static picture into a living, moving window.

When the conversation gets too loud

Of course, hosting a party where a million people are all talking at once isn’t exactly a walk in the park. This is where our brilliant WebSocket-powered dream can turn into a bit of a logistical headache.

A server that could happily handle thousands of brief HTTP requests might suddenly break into a cold sweat when asked to keep tens of thousands of WebSocket phone lines open simultaneously. Each connection consumes memory and resources. It’s like being a party host who promised to have a deep, meaningful conversation with every single guest, all at the same time. Eventually, you’re just going to collapse from exhaustion.

And what happens if the line goes dead? A phone can be hung up, but a digital connection can just… fade into the void. Is the client still there, listening quietly? Or did their Wi-Fi die mid-sentence? To avoid talking to a ghost, servers have to periodically poke the client with a ping message. If they get a pong back, all is well. If not, the server sadly hangs up, freeing the line for someone who actually wants to talk.

How to be a good conversation host

Taming this beast requires a bit of cleverness. You can’t just throw one server at the problem and hope for the best.

Load balancers become crucial, but they need to be smarter. A simple load balancer that just throws requests at any available server is a disaster for WebSockets. It’s like trying to continue a phone conversation while the operator keeps switching you to a different person who has no idea what you were talking about. You need “sticky sessions,” which is a fancy way of saying the load balancer is smart enough to remember which server you were talking to and keeps you connected to it.

Security also gets a fun new twist. An always-on connection is a wonderfully persistent doorway into your system. If you’re not careful about who you’re talking to and what they’re saying (WSS, the secure version, is non-negotiable), you might find you’ve invited a Trojan horse to your party.

A world that talks back

So, no, WebSockets aren’t just another tool in the shed. They represent a philosophical shift. It’s the moment we stopped treating the web like a library of dusty, static books and started treating it like a bustling, chaotic city square. We traded the polite, predictable, and frankly boring exchange of letters for the glorious, unpredictable mess of a real-time human conversation.

It means our applications can now have a pulse. They can be surprised, they can interrupt, and they can react with the immediacy of a startled cat. Building these living systems is certainly noisier and requires a different kind of host, one who’s part traffic cop and part group therapist. But by embracing the chaos, we create experiences that don’t just respond; they engage, they live. And isn’t building a world that actually talks back infinitely more fun?

October 1, 2025 by Fernando SRE Cloud stuff Computer Science stuff

BigQuery learns to read between the lines

Keyword search is the friend who hears every word and misses the point. Vector search is the friend who nods, squints a little, and says, “You want a safe family SUV that will not make your wallet cry.” This story is about teaching BigQuery to be the second friend.

I wanted semantic search without renting another database, shipping nightly exports, or maintaining yet another dashboard only I remember to feed. The goal was simple and a little cheeky: keep the data in BigQuery, add embeddings with Vertex AI, create a vector index, and still use boring old SQL to filter by price and mileage. Results should read like good advice, not a word-count contest.

Below is a practical pattern that works well for catalogs, internal knowledge bases, and “please find me the thing I mean” situations. It is light on ceremony, honest about trade‑offs, and opinionated where it needs to be.

Why keyword search keeps missing the point

Humans ask for meanings, not tokens. “Family SUV that does not guzzle” is intent, not keywords.
Catalogs are messy. Price, mileage, features, and descriptions live in different columns and dialects.
Traditional search treats text like a bag of Scrabble tiles. Embeddings turn it into geometry where similar meanings sit near each other.

If you have ever typed “cheap laptop with decent battery” and received a gaming brick with neon lighting, you know the problem.

Keep data where it already lives

No new database. BigQuery already stores your rows, talks SQL, and now speaks vectors. The plan

Build a clean content string per row so the model has a story to understand.
Generate embeddings in BigQuery via a remote Vertex AI model.
Store those vectors in a table and, when it makes sense, add a vector index.
Search with a natural‑language query embedding and filter with plain SQL.

Map of the idea:

Prepare a clean narrative for each row

Your model will eat whatever you feed it. Feed it something tidy. The goal is a single content field with labeled parts, so the embedding has clues.

-- Demo names and values are fictitious
CREATE OR REPLACE TABLE demo_cars.search_base AS
SELECT
  listing_id,
  make,
  model,
  year,
  price_usd,
  mileage_km,
  body_type,
  fuel,
  features,
  CONCAT(
    'make=', make, ' | ',
    'model=', model, ' | ',
    'year=', CAST(year AS STRING), ' | ',
    'price_usd=', CAST(price_usd AS STRING), ' | ',
    'mileage_km=', CAST(mileage_km AS STRING), ' | ',
    'body=', body_type, ' | ',
    'fuel=', fuel, ' | ',
    'features=', ARRAY_TO_STRING(features, ', ')
  ) AS content
FROM demo_cars.listings
WHERE status = 'active';

Housekeeping that pays off

Normalize units and spellings early. “20k km” is cute; 20000 is useful.
Keep labels short and consistent. Your future self will thank you.
Avoid stuffing everything. Noise in, noise out.

Turn text into vectors without hand waving

We will assume you have a BigQuery remote model that points to your Vertex AI text‑embedding endpoint. Choose a modern embedding model and be explicit about task type, use RETRIEVAL_DOCUMENT for rows and RETRIEVAL_QUERY for user queries. That hint matters.

Embed the documents

-- Store document embeddings alongside your base table
CREATE OR REPLACE TABLE demo_cars.search_with_vec AS
SELECT
  b.listing_id,
  b.make, b.model, b.year, b.price_usd, b.mileage_km, b.body_type, b.fuel, b.features,
  e.ml_generate_embedding_result AS embedding
FROM demo_cars.search_base AS b,
UNNEST([
  STRUCT(
    (SELECT ml_generate_embedding_result
     FROM ML.GENERATE_EMBEDDING(
       MODEL `demo.embed_text`,
       (SELECT b.content AS content),
       STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_DOCUMENT' AS task_type)
     )) AS ml_generate_embedding_result
  )
]) AS e;

That cross join with a single STRUCT is a neat way to add one vector per row without creating a separate subquery table. If you prefer, materialize embeddings in a separate table and JOIN on listing_id to minimize churn.

Build an index when it helps and skip it when it does not

BigQuery can scan vectors without an index, which is fine for small tables and prototypes. For larger tables, add an IVF index with cosine distance.

-- Optional but recommended beyond a few thousand rows
CREATE VECTOR INDEX demo_cars.search_with_vec_idx
ON demo_cars.search_with_vec(embedding)
OPTIONS(
  distance_type = 'COSINE',
  index_type = 'IVF',
  ivf_options = '{"num_lists": 128}'
);

Rules of thumb

Start without an index for quick experiments. Add the index when latency or cost asks for it.
Tune num_lists only after measuring. Guessing is cardio for your CPU.

Ask in plain English, filter in plain SQL

Here is the heart of it. One short block that embeds the query, runs vector search, then applies filters your finance team actually understands.

-- Natural language wish
DECLARE user_query STRING DEFAULT 'family SUV with lane assist under 18000 USD';

WITH q AS (
  SELECT ml_generate_embedding_result AS qvec
  FROM ML.GENERATE_EMBEDDING(
    MODEL `demo.embed_text`,
    (SELECT user_query AS content),
    STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_QUERY' AS task_type)
  )
)
SELECT s.listing_id, s.make, s.model, s.year, s.price_usd, s.mileage_km, s.body_type
FROM VECTOR_SEARCH(
  TABLE demo_cars.search_with_vec, 'embedding',
  TABLE q, query_column_to_search => 'qvec',
  top_k => 20, distance_type => 'COSINE'
) AS s
WHERE s.price_usd <= 18000
  AND s.body_type = 'SUV'
ORDER BY s.price_usd ASC;

This is the “hybrid search” pattern, shoulder to shoulder, semantics finds plausible candidates, SQL draws the hard lines. You get relevance and guardrails.

Measure quality and cost without a research grant

You do not need a PhD rubric, just a habit.

Relevance sanity check

Write five real queries from your users. Note how many good hits appear in the top ten. If it is fewer than six, look at your content field. It is almost always the content.

Latency

Time the query with and without the vector index. Keep an eye on top‑k and filters. If you filter out 90% of candidates, you can often keep top‑k low.

Cost

Avoid regenerating embeddings. Upserts should only touch changed rows. Schedule small nightly or hourly batches, not heroic full refreshes.

Where things wobble and how to steady them

Vague user queries

Add example phrasing in your product UI. Even two placeholders nudge users into better intent.

Sparse or noisy text

Enrich your content with compact labels and the two or three features people actually ask for. Resist the urge to dump raw logs.

Synonyms of the trade

Lightweight mapping helps. If your users say “lane keeping” and your data says “lane assist,” consider normalizing in content.

Region mismatches

Keep your dataset, remote connection, and model in compatible regions. Latency enjoys proximity. Downtime enjoys misconfigurations.

Run it day after day without drama

A few operational notes that keep the lights on

Track changes by listing_id and only re‑embed those rows.
Rebuild or refresh the index on a schedule that fits your churn. Weekly is plenty for most catalogs.
Keep one “golden query set” around for spot checks after schema or model changes.

Takeaways you can tape to your monitor

Keep data in BigQuery and add meaning with embeddings.
Build one tidy content per row. Labels beat prose.
Use RETRIEVAL_DOCUMENT for rows and RETRIEVAL_QUERY for the user’s text.
Start without an index; add IVF with cosine when volume demands it.
Let vectors shortlist and let SQL make the final call.

Tiny bits you might want later

An alternative query that biases toward newer listings

DECLARE user_query STRING DEFAULT 'compact hybrid with good safety under 15000 USD';
WITH q AS (
  SELECT ml_generate_embedding_result AS qvec
  FROM ML.GENERATE_EMBEDDING(
    MODEL `demo.embed_text`,
    (SELECT user_query AS content),
    STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_QUERY' AS task_type)
  )
)
SELECT s.listing_id, s.make, s.model, s.year, s.price_usd
FROM VECTOR_SEARCH(
  TABLE demo_cars.search_with_vec, 'embedding',
  TABLE q, query_column_to_search => 'qvec',
  top_k => 15, distance_type => 'COSINE'
) AS s
WHERE s.price_usd <= 15000
ORDER BY s.year DESC, s.price_usd ASC
LIMIT 10;

Quick checklist before you ship

The remote model exists and is reachable from BigQuery.
Dataset and connection share a region you actually meant to use.
content strings are consistent and free of junk units.
Embeddings updated only for changed rows.
Vector index present on tables that need it and not on those that do not.

If keyword search is literal‑minded, this setup is the polite interpreter who knows what you meant, forgives your typos, and still respects the house rules. You keep your data in one place, you use one language to query it, and you get answers that feel like common sense rather than a thesaurus attack. That is the job.

September 27, 2025 by Fernando SRE Cloud stuff

Ingress and egress on EKS made understandable

Getting traffic in and out of a Kubernetes cluster isn’t a magic trick. It’s more like running the city’s most exclusive nightclub. It’s a world of logistics, velvet ropes, bouncers, and a few bureaucratic tollbooths on the way out. Once you figure out who’s working the front door and who’s stamping passports at the exit, the rest is just good manners.

Let’s take a quick tour of the establishment.

A ninety-second tour of the premises

There are really only two journeys you need to worry about in this club.

Getting In: A hopeful guest (the client) looks up the address (DNS), arrives at the front door, and is greeted by the head bouncer (Load Balancer). The bouncer checks the guest list and directs them to the right party room (Service), where they can finally meet up with their friend (the Pod).

Getting Out: One of our Pods needs to step out for some fresh air. It gets an escort from the building’s internal security (the Node’s ENI), follows the designated hallways (VPC routing), and is shown to the correct exit—be it the public taxi stand (NAT Gateway), a private car service (VPC Endpoint), or a connecting tunnel to another venue (Transit Gateway).

The secret sauce in EKS is that our Pods aren’t just faceless guests; the AWS VPC CNI gives them real VPC IP addresses. This means the building’s security rules, Security Groups, route tables, and NACLs aren’t just theoretical policies. They are the very real guards and locked doors that decide whether a packet’s journey ends in success or a silent, unceremonious death.

Getting past the velvet rope

In Kubernetes, Ingress is the set of rules that governs the front door. But rules on paper are useless without someone to enforce them. That someone is a controller, a piece of software that translates your guest list into actual, physical bouncers in AWS.

The head of security for EKS is the AWS Load Balancer Controller. You hand it an Ingress manifest, and it sets up the door staff.

For your standard HTTP web traffic, it deploys an Application Load Balancer (ALB). Think of the ALB as a meticulous, sharp-dressed bouncer who doesn’t just check your name. It inspects your entire invitation (the HTTP request), looks at the specific event you’re trying to attend (/login or /api/v1), and only then directs you to the right room.
For less chatty protocols like raw TCP, UDP, or when you need sheer, brute-force throughput, it calls in a Network Load Balancer (NLB). The NLB is the big, silent type. It checks that you have a ticket and shoves you toward the main hall. It’s incredibly fast but doesn’t get involved in the details.

This whole operation can be made public or private. For internal-only events, the controller sets up an internal ALB or NLB and uses a private Route 53 zone, hiding the party from the public internet entirely.

The modern VIP system

The classic Ingress system works, but it can feel a bit like managing your guest list with a stack of sticky notes. The rules for routing, TLS, and load balancer behavior are all crammed into a single resource, creating a glorious mess of annotations.

This is where the Gateway API comes in. It’s the successor to Ingress, designed by people who clearly got tired of deciphering annotation soup. Its genius lies in separating responsibilities.

The Platform team (the club owners) manages the Gateway. They decide where the entrances are, what protocols are allowed (HTTP, TCP), and handle the big-picture infrastructure like TLS certificates.
The Application teams (the party hosts) manage Routes (HTTPRoute, TCPRoute, etc.). They just point to an existing Gateway and define the rules for their specific application, like “send traffic for app.example.com/promo to my service.”

This creates a clean separation of duties, offers richer features for traffic management without resorting to custom annotations, and makes your setup far more portable across different environments.

The art of the graceful exit

So, your Pods are happily running inside the club. But what happens when they need to call an external API, pull an image, or talk to a database? They need to get out. This is egress, and it’s mostly about navigating the building’s corridors and exits.

The public taxi stand: For general internet access from private subnets, Pods are sent to a NAT Gateway. It works, but it’s like a single, expensive taxi stand for the whole neighborhood. Every trip costs money, and if it gets too busy, you’ll see it on your bill. Pro tip: Put one NAT in each Availability Zone to avoid paying extra for your Pods to take a cross-town cab just to get to the taxi stand.
The private car service: When your Pods need to talk to other AWS services (like S3, ECR, or Secrets Manager), sending them through the public internet is a waste of time and money. Use
VPC endpoints instead. Think of this as a pre-booked black car service. It creates a private, secure tunnel directly from your VPC to the AWS service. It’s faster, cheaper, and the traffic never has to brave the public internet.
The diplomatic passport: The worst way to let Pods talk to AWS APIs is by attaching credentials to the node itself. That’s like giving every guest in the club a master key. Instead, we use
IRSA (IAM Roles for Service Accounts). This elegantly binds an IAM role directly to a Pod’s service account. It’s the equivalent of issuing your Pod a diplomatic passport. It can present its credentials to AWS services with full authority, no shared keys required.

Setting the house rules

By default, Kubernetes networking operates with the cheerful, chaotic optimism of a free-for-all music festival. Every Pod can talk to every other Pod. In production, this is not a feature; it’s a liability. You need to establish some house rules.

Your two main tools for this are Security Groups and NetworkPolicy.

Security Groups are your Pod’s personal bodyguards. They are stateful and wrap around the Pod’s network interface, meticulously checking every incoming and outgoing connection against a list you define. They are an AWS-native tool and very precise.

NetworkPolicy, on the other hand, is the club’s internal security team. You need to hire a third-party firm like Calico or Cilium to enforce these rules in EKS, but once you do, you can create powerful rules like “Pods in the ‘database’ room can only accept connections from Pods in the ‘backend’ room on port 5432.”

The most sane approach is to start with a default deny policy. This is the bouncer’s universal motto: “If your name’s not on the list, you’re not getting in.” Block all egress by default, then explicitly allow only the connections your application truly needs.

A few recipes from the bartender

Full configurations are best kept in a Git repository, but here are a few cocktail recipes to show the key ingredients.

Recipe 1: Public HTTPS with a custom domain. This Ingress manifest tells the AWS Load Balancer Controller to set up a public-facing ALB, listen on port 443, use a specific TLS certificate from ACM, and route traffic for app.yourdomain.com to the webapp service.

# A modern Ingress for your web application
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: webapp-ingress
  annotations:
    # Set the bouncer to be public
    alb.ingress.kubernetes.io/scheme: internet-facing
    # Talk to Pods directly for better performance
    alb.ingress.kubernetes.io/target-type: ip
    # Listen for secure traffic
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    # Here's the TLS certificate to wear
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789012:certificate/your-cert-id
spec:
  ingressClassName: alb
  rules:
    - host: app.yourdomain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: webapp-service
                port:
                  number: 8080

Recipe 2: A diplomatic passport for S3 access. This gives our Pod a ServiceAccount annotated with an IAM role ARN. Any Pod that uses this service account can now talk to AWS APIs (like S3) with the permissions granted by that role, thanks to IRSA.

# The ServiceAccount with its IAM credentials
apiVersion: v1
kind: ServiceAccount
metadata:
  name: s3-reader-sa
  annotations:
    # This is the diplomatic passport: the ARN of the IAM role
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/EKS-S3-Reader-Role
---
# The Deployment that uses the passport
apiVersion: apps/v1
kind: Deployment
metadata:
  name: report-generator
spec:
  replicas: 1
  selector:
    matchLabels: { app: reporter }
  template:
    metadata:
      labels: { app: reporter }
    spec:
      # Use the service account we defined above
      serviceAccountName: s3-reader-sa
      containers:
        - name: processor
          image: your-repo/report-generator:v1.5.2
          ports:
            - containerPort: 8080

A short closing worth remembering

When you boil it all down, Ingress is just the etiquette you enforce at the front door. Egress is the paperwork required for a clean exit. In EKS, the etiquette is defined by Kubernetes resources, while the paperwork is pure AWS networking. Neither one cares about your intentions unless you write them down clearly.

So, draw the path for traffic both ways, pick the right doors for the job, give your Pods a proper identity, and set the tolls where they make sense. If you do, the cluster will behave, the bill will behave, and your on-call shifts might just start tasting a lot more like sleep.

September 23, 2025 by Fernando SRE Cloud stuff DevOps stuff Kubernetes SRE stuff

What your DNS logs are saying behind your back

There’s a dusty shelf in every network closet where good intentions go to die. Or worse, to gossip. You centralize DNS for simplicity. You enable logging for accountability. You peer VPCs for convenience. A few sprints later, your DNS logs have become that chatty neighbor who sees every car that comes and goes, remembers every visitor, and pieces together a startlingly accurate picture of your life.

They aren’t leaking passwords or secret keys. They’re leaking something just as valuable: the blueprints of your digital house.

This post walks through a common pattern that quietly spills sensitive clues through AWS Route 53 Resolver query logging. We’ll skip the dry jargon and focus on the story. You’ll leave with a clear understanding of the problem, a checklist to investigate your own setup, and a handful of small, boring changes that buy you a lot of peace.

The usual suspects are a disaster recipe in three easy steps

This problem rarely stems from one catastrophic mistake. It’s more like three perfectly reasonable decisions that meet for lunch and end up burning down the restaurant. Let’s meet the culprits.

1. The pragmatic architect

In a brilliant move of pure common sense, this hero centralizes DNS resolution into a single, shared network VPC. “One resolver to rule them all,” they think. It simplifies configuration, reduces operational overhead, and makes life easier for everyone. On paper, it’s a flawless idea.

2. The visibility aficionado

Driven by the noble quest for observability, this character enables Route 53 query logging on that shiny new central resolver. “What gets measured, gets managed,” they wisely quote. To be extra helpful, they associate this logging configuration with every single VPC that peers with the network VPC. After all, data is power. Another flawless idea.

3. The easy-going permissions manager

The logs have to land somewhere, usually a CloudWatch Log Group or an S3 bucket. Our third protagonist, needing to empower their SRE and Ops teams, grants them broad read access to this destination. “They need it to debug things,” is the rationale. “They’re the good guys.” A third, utterly flawless idea.

Separately, these are textbook examples of good cloud architecture. Together, they’ve just created the perfect surveillance machine: a centralized, all-seeing eye that diligently writes down every secret whisper and then leaves the diary on the coffee table for anyone to read.

So what is actually being spilled

The real damage comes from the metadata. DNS queries are the internal monologue of your applications, and your logs are capturing every single thought. A curious employee, a disgruntled contractor, or even an automated script can sift through these logs and learn things like:

Service Hostnames that tell a story: Names like billing-api.prod.internal or customer-data-primary-db.restricted.internal do more than just resolve to an IP. They reveal your service names, their environments, and even their importance.
Secret project names: That new initiative you haven’t announced yet? If its services are making DNS queries like project-phoenix-auth-service.dev.internal, the secret’s already out.
Architectural hints: Hostnames often contain roles like etl-worker-3.prod, admin-gateway.staging, or sre-jumpbox.ops.internal. These are the labels on your architectural diagrams, printed in plain text.
Cross-Environment chatter: The most dangerous leak of all. When a query from a dev VPC successfully resolves a hostname in the prod environment (e.g., prod-database.internal), you’ve just confirmed a path between them exists. That’s a security finding waiting to happen.

Individually, these are harmless breadcrumbs. But when you have millions of them, anyone can connect the dots and draw a complete, and frankly embarrassing, map of your entire infrastructure.

Put on your detective coat and investigate your own house

Feeling a little paranoid? Good. Let’s channel that energy into a quick investigation. You don’t need a magnifying glass, just your AWS command line.

Step 1 Find the secret diaries

First, we need to find out where these confessions are being stored. This command asks AWS to list all your Route 53 query logging configurations. It’s the equivalent of asking, “Where are all the diaries kept?”

aws route53resolver list-resolver-query-log-configs \
--query 'ResolverQueryLogConfigs[].{Name:Name, Id:Id, DestinationArn:DestinationArn, VpcCount:ResolverQueryLogConfigAssociationCount}'

Take note of the DestinationArn for any configs with a high VpcCount. Those are your prime suspects. That ARN is the scene of the crime.

Step 2 Check who has the keys

Now that you know where the logs are, the million-dollar question is: who can read them?

If the destination is a CloudWatch Log Group, examine its resource-based policy and also review the IAM policies associated with your user roles. Are there wildcard permissions like logs:Get* or logs:* attached to broad groups?

If it’s an S3 bucket, check the bucket policy. Does it look something like this?

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:root"
      },
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::central-network-dns-logs/*"
    }
  ]
}

This policy generously gives every single IAM user and role in the account access to read all the logs. It’s the digital equivalent of leaving your front door wide open.

Step 3 Listen for the juicy gossip

Finally, let’s peek inside the logs themselves. Using CloudWatch Log Insights, you can run a query to find out if your non-production environments are gossiping about your production environment.

fields @timestamp, @message
| filter @message like /\.prod\.internal/
| filter vpc.id not like /vpc-prod-environment-id/
| stats count(*) by vpc.id as sourceVpc
| sort by @timestamp desc

This query looks for any log entries that mention your production domain (.prod.internal) but did not originate from a production VPC. Any results here are a flashing red light, indicating that your environments are not as isolated as you thought.

The fix is housekeeping, not heroics

The good news is that you don’t need to re-architect your entire network. The solution isn’t some heroic, complex project. It’s just boring, sensible housekeeping.

Be granular with your logging: Don’t use a single, central log destination for every VPC. Create separate logging configurations for different environments (prod, staging, dev). Send production logs to a highly restricted location and development logs to a more accessible one.
Practice a little scrutiny: Just because a resolver is shared doesn’t mean its logs have to be. Associate your logging configurations only with the specific VPCs that absolutely need it.
Embrace the principle of least privilege: Your IAM and S3 bucket policies should be strict. Access to production DNS logs should be an exception, not the rule, requiring a specific IAM role that is audited and temporary.

That’s it. No drama, no massive refactor. Just a few small tweaks to turn your chatty neighbor back into a silent, useful tool. Because at the end of the day, the best secret-keeper is the one who never heard the secret in the first place.

September 18, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

Stop building cathedrals in Terraform

It’s 9 AM on a Tuesday. You, a reasonably caffeinated engineer, open a pull request to add a single tag to an S3 bucket. A one-line change. You run terraform plan and watch in horror as your screen scrolls with a novel’s worth of green, yellow, and red text. Two hundred and seventeen resources want to be updated.

Welcome to a special kind of archaeological dig. Somewhere, buried three folders deep, a “reusable” module you haven’t touched in six months has decided to redecorate your entire production environment. The brochure promised elegance and standards. The reality is a Tuesday spent doing debugging, cardio, and praying to the Git gods.

Small teams, in particular, fall into this trap. You don’t need to build a glorious cathedral of abstractions just to hang a picture on the wall. You need a hammer, a nail, and enough daylight to see what you’re doing.

The allure of the perfect system

Let’s be honest, custom Terraform modules are seductive. They whisper sweet nothings in your ear about the gospel of DRY (Don’t Repeat Yourself). They promise a future where every resource is a perfect, standardized snowflake, lovingly stamped out from a single, blessed template. It’s the engineering equivalent of having a perfectly organized spice rack where all the labels face forward.

In theory, it’s beautiful. In practice, for a small, fast-moving team, it’s a tax. A heavy one. An indirection tax.

What starts as a neat wrapper today becomes a Matryoshka doll of complexity by next quarter. Inputs multiply. Defaults are buried deeper than state secrets. Soon, flipping a single boolean in a variables.tf file feels like rewiring a nuclear submarine with the lights off. The module is no longer serving you; you are now its humble servant.

It’s like buying one of those hyper-specific kitchen gadgets, like a banana slicer. Yes, it slices bananas. Perfectly. But now you own a piece of plastic whose only job is to do something a knife you already owned could do just fine. That universal S3 module you built is the junk drawer of your infrastructure. Sure, it holds everything, but now you have to rummage past a broken can opener and three instruction manuals just to find a spoon.

A heuristic for staying sane

So, what’s the alternative? Anarchy? Copy-pasting HCL like a digital barbarian? Yes. Sort of.

Here’s a simple, sanity-preserving heuristic:

Duplicate once without shame. Duplicate twice with comments. On the third time, and only then, consider extracting a module.

Until you hit that third, clear, undeniable repetition of a pattern, plain HCL is your best friend. It wins on speed, clarity, and keeping the blast radius of any change predictably small. You avoid abstracting a solution before you even fully understand the problem.

Let’s see it in action. You need a simple, private S3 bucket for your new service.

The cathedral-builder’s approach might look like this:

# service-alpha/main.tf

module "service_alpha_bucket" {
  source = "git::ssh://git@github.com/your-org/terraform-modules.git//s3/private-bucket?ref=v1.4.2"

  bucket_name      = "service-alpha-data-logs-2025"
  enable_versioning = true
  force_destroy    = false # Safety first!
  lifecycle_days   = 90
  tags = {
    Service   = "alpha"
    ManagedBy = "Terraform"
  }
}

It looks clean, but what happens when you need to add a specific replication rule? Or a weird CORS policy for a one-off integration? You’re off to another repository to wage war with the module’s maintainer (who is probably you, from six months ago).

Now, the boring, sane, ship-it-today approach:

# service-alpha/main.tf

resource "aws_s3_bucket" "data_bucket" {
  bucket = "service-alpha-data-logs-2025"

  tags = {
    Service   = "alpha"
    ManagedBy = "Terraform"
  }
}

resource "aws_s3_bucket_versioning" "data_bucket_versioning" {
  bucket = aws_s3_bucket.data_bucket.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "data_bucket_lifecycle" {
  bucket = aws_s3_bucket.data_bucket.id

  rule {
    id     = "log-expiration"
    status = "Enabled"
    expiration {
      days = 90
    }
  }
}

resource "aws_s3_bucket_public_access_block" "data_bucket_access" {
  bucket                  = aws_s3_bucket.data_bucket.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

Is it more lines of code? Yes. Is it gloriously, beautifully, and unapologetically obvious? Absolutely. You can read it, understand it, and change it in thirty seconds. No context switching. No spelunking through another codebase. Just a bucket, doing bucket things.

Where a module is not a swear word

Okay, I’m not a total monster. Modules have their place. They are the right tool when you are building the foundations, not the furniture.

A module earns its keep when it defines a stable, slow-moving, and genuinely complex pattern that you truly want to be identical everywhere. Think of it like the plumbing and electrical wiring of a house. You don’t reinvent it for every room.

Good candidates for a module include:

VPC and core networking: The highway system of your cloud. Build it once, build it well, and then leave it alone.
Kubernetes cluster baselines: The core EKS/GKE/AKS setup, IAM roles, and node group configurations.
Security and telemetry agents: The non-negotiable stuff that absolutely must run on every single instance.
IAM roles for CI/CD: A standardized way for your deployment pipeline to get the permissions it needs.

The key difference? These things change on a scale of months or years, not days or weeks.

Your escape plan from module purgatory

What if you’re reading this and nodding along in despair, already trapped in a gilded cage of your own abstractions? Don’t panic. There’s a way out, and it doesn’t require a six-month migration project.

Freeze everything: First, go to every service that uses the problematic module and pin the version number. ref=v1.4.2. No more floating on main. You’ve just stopped the bleeding.
Take inventory: In one service, run a Terraform state list to see the exact resources managed by the module.
Perform the adoption: This is the magic trick. Write the plain HCL code for those resources directly in your service’s configuration. Then, tell Terraform that the old resource (inside the module) and your new resource (the plain HCL) are actually the same thing. You do this with a moved block or the Terraform state mv command.

Let’s say your module created a bucket. The state address is module.service_alpha_bucket.aws_s3_bucket.this[0]. Your new plain resource is aws_s3_bucket.data_bucket.

You would run:

terraform state mv 'module.service_alpha_bucket.aws_s3_bucket.this[0]' aws_s3_bucket.data_bucket

Verify and obliterate: Run terraform plan. It should come back with “No changes. Your infrastructure matches the configuration.” The plan is clean. You are now free. Delete the module block, pop the champagne, and submit your PR. Repeat for other services, one at a time. No heroics.

Fielding objections from the back row

When you propose this radical act of simplicity, someone will inevitably raise their hand.

“But we need standards!” You absolutely do. Standardize on things that matter: tags, naming conventions, and security policies. Enforce them with tools like tflint, checkov, and OPA/Gatekeeper. A linter yelling at you in a PR is infinitely better than a module silently deploying the wrong thing everywhere.
“What about junior developers? They need a paved road!” They do. A haunted mega-module with 50 input variables is not a paved road; it’s a labyrinth with a minotaur. A better “paved road” is a folder of well-documented, copy-pasteable examples of plain HCL for common tasks.
“Compliance will have questions!” Good. Let them. A tiny, focused, version-pinned module for your IAM boundary policy is a fantastic answer. A sprawling, do-everything wrapper module that changes every week is a compliance nightmare waiting to happen.

The gospel of ‘Good Enough’ for now

Stop trying to solve tomorrow’s problems today. That perfect, infinitely configurable abstraction you’re dreaming of is a solution in search of a problem you don’t have yet.

Don’t optimize for DRY. Optimize for change.

Small teams don’t need fewer lines of HCL; they need fewer places to look when something breaks at 3 PM on a Friday. They need clarity, not cleverness. Keep your power tools for the heavy-duty work. Save the cathedral for when you’ve actually founded a religion.

For now, ship the bucket, and go get lunch.

September 14, 2025 by Fernando SRE Cloud stuff DevOps stuff

Your Kubernetes rollback is lying

The PagerDuty alert screams. The new release, born just minutes ago with such promising release notes, is coughing up blood in production. The team’s Slack channel is a frantic mess of flashing red emojis. Someone, summoning the voice of a panicked adult, yells the magic word: “ROLLBACK!”

And so, Helm, our trusty tow-truck operator, rides in with a smile, waving its friendly green check marks. The dashboards, those silent accomplices, beam with the serene glow of healthy metrics. Kubernetes probes, ever so polite, confirm that the resurrected pods are, in fact, “breathing.”

Then, production face-plants. Hard.

The feeling is like putting a cartoon-themed bandage on a burst water pipe and then wondering, with genuine surprise, why the living room has become a swimming pool. This article is the autopsy of those “perfect” rollbacks. We’re going to uncover why your monitoring is a pathological liar, how network traffic becomes a double agent, and what to do so that the next time Helm gives you a thumbs-up, you can actually believe it.

A state that refuses to time-travel

The first, most brutal lie a rollback tells you is that it can turn back time. A helm rollback is like the “rewind” button on an old VCR remote; it diligently rewinds the tape (your YAML manifests), but it has absolutely no power to make the actors on screen younger.

Your application’s state is one of those stubborn actors.

While your ConfigMaps and Secrets might dutifully revert to their previous versions, your data lives firmly in the present. If your new release included a database migration that added a column, rolling back the application code doesn’t magically make that column disappear. Now your old code is staring at a database schema from the future, utterly confused, like a medieval blacksmith being handed an iPad.

The same goes for PersistentVolumeClaims, external caches like Redis, or messages sitting in a Kafka queue. The rollback command whispers sweet nothings about returning to a “known good state,” but it’s only talking about itself. The rest of your universe has moved on, and it refuses to travel back with you.

The overly polite doorman

The second culprit in our investigation is the Kubernetes probe. Think of the readinessProbe as an overly polite doorman at a fancy party. Its job is to check if a guest (your pod) is ready to enter. But its definition of “ready” can be dangerously optimistic.

Many applications, especially those running on the JVM, have what we’ll call a “warming up” period. When a pod starts, the process is running, the HTTP port is open, and it will happily respond to a simple /health check. The doorman sees a guest in a tuxedo and says, “Looks good to me!” and opens the door.

What the doorman doesn’t see is that this guest is still stretching, yawning, and trying to remember where they are. The application’s caches are cold, its connection pools are empty, and its JIT compiler is just beginning to think about maybe, possibly, optimizing some code. The first few dozen requests it receives will be painfully slow or, worse, time out completely.

So while your readinessProbe is giving you a green light, your first wave of users is getting a face full of errors. For these sleepy applications, you need a more rigorous bouncer.

A startupProbe is that bouncer. It gives the app a generous amount of time to get its act together before even letting the doorman (readiness and liveness probes) start their shift.

# This probe gives our sleepy JVM app up to 5 minutes to wake up.
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
startupProbe:
  httpGet:
    path: /health/ready
    port: 8080
  # Kubelet will try 30 times with a 10-second interval (300 seconds).
  # If the app isn't ready by then, the pod will be restarted.
  failureThreshold: 30
  periodSeconds: 10

Without it, your rollback creates a fleet of pods that are technically alive but functionally useless, and Kubernetes happily sends them a flood of unsuspecting users.

Traffic, the double agent

And that brings us to our final suspect: the network traffic itself. In a modern setup using a service mesh like Istio or Linkerd, traffic routing is a sophisticated dance. But even the most graceful dancer can trip.

When you roll back, a new ReplicaSet is created with the old pod specification. The service mesh sees these new pods starting up, asks the doorman (readinessProbe) if they’re good to go, gets an enthusiastic “yes!”, and immediately starts sending them a percentage of live production traffic.

This is where all our problems converge. Your service mesh, in its infinite efficiency, has just routed 50% of your user traffic to a platoon of sleepy, confused pods that are trying to talk to a database from the future.

Let’s look at the evidence. This VirtualService, which we now call “The 50/50 Disaster Splitter,” was routing traffic with criminal optimism.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: checkout-api-vs
  namespace: prod-eu-central
spec:
  hosts:
    - "checkout.api.internal"
  http:
    - route:
        - destination:
            host: checkout-api-svc
            subset: v1-stable
          weight: 50 # 50% to the (theoretically) working pods
        - destination:
            host: checkout-api-svc
            subset: v1-rollback
          weight: 50 # 50% to the pods we just dragged from the past

The service mesh isn’t malicious. It’s just an incredibly efficient tool that is very good at following bad instructions. It sees a green light and hits the accelerator.

A survival guide that won’t betray you

So, you’re in the middle of a fire, and the “break glass in case of emergency” button is a lie. What do you do? You need a playbook that acknowledges reality.

Step 0: Breathe and isolate the blast radius

Before you even think about rolling back, stop the bleeding. The fastest way to do that is often at the traffic level. Use your service mesh or ingress controller to immediately shift 100% of traffic back to the last known good version. Don’t wait for new pods to start. This is a surgical move that takes seconds and gives you breathing room.

Step 1: Declare an incident and gather the detectives

Get the right people on a call. Announce that this is not a “quick rollback” but an incident investigation. Your goal is to understand why the release failed, not just to hit the undo button.

Step 2: Perform the autopsy (while the system is stable)

With traffic safely routed away from the wreckage, you can now investigate. Check the logs of the failed pods. Look at the database. Is there a schema mismatch? A bad configuration? This is where you find the real killer.

Step 3: Plan the counter-offensive (which might not be a rollback)

Sometimes, the safest path forward is a roll forward. A small hotfix that corrects the issue might be faster and less risky than trying to force the old code to work with a new state. A rollback should be a deliberate, planned action, not a panic reflex. If you must roll back, do it with the knowledge you’ve gained from your investigation.

Step 4: The deliberate, cautious rollback

If you’ve determined a rollback is the only way, do it methodically.

Scale down the broken deployment:
kubectl scale deployment/checkout-api –replicas=0
Execute the Helm rollback:
helm rollback checkout-api 1 -n prod-eu-central
Watch the new pods like a hawk: Monitor their logs and key metrics as they come up. Don’t trust the green check marks.
Perform a Canary Release: Once the new pods look genuinely healthy, use your service mesh to send them 1% of the traffic. Then 10%. Then 50%. Then 100%. You are now in control, not the blind optimism of the automation.

The truth will set you free

A Kubernetes rollback isn’t a time machine. It’s a YAML editor with a fancy title. It doesn’t understand your data, it doesn’t appreciate your app’s need for a morning coffee, and it certainly doesn’t grasp the nuances of traffic routing under pressure.

Treating a rollback as a simple, safe undo button is the fastest way to turn a small incident into a full-blown outage. By understanding the lies it tells, you can build a process that trusts human investigation over deceptive green lights. So the next time a deployment goes sideways, don’t just reach for the rollback lever. Reach for your detective’s hat instead.

September 11, 2025 by Fernando SRE DevOps stuff Kubernetes SRE stuff

Avoiding serverless chaos with 3 essential Lambda patterns

Your first Lambda function was a thing of beauty. Simple, elegant, it did one job and did it well. Then came the second. And the tenth. Before you knew it, you weren’t running an application; you were presiding over a digital ant colony, with functions scurrying in every direction without a shred of supervision.

AWS Lambda, the magical service that lets us run code without thinking about servers, can quickly devolve into a chaotic mess of serverless spaghetti. Each function lives happily in its own isolated bubble, and when demand spikes, AWS kindly hands out more and more bubbles. The result? An anarchic party of concurrent executions.

But don’t despair. Before you consider a career change to alpaca farming, let’s introduce three seasoned wranglers who will bring order to your serverless circus. These are the architectural patterns that separate the rookies from the maestros in the art of building resilient, scalable systems.

Meet the micromanager boss

First up is a Lambda with a clipboard and very little patience. This is the Command Pattern function. Its job isn’t to do the heavy lifting—that’s what the interns are for. Its sole purpose is to act as the gatekeeper, the central brain that receives an order, scrutinizes it (request validation), consults its dusty rulebook (business logic), and then barks commands at its underlings to do the actual work.

It’s the perfect choice for workflows where bringing in AWS Step Functions would be like using a sledgehammer to crack a nut. It centralizes decision-making and maintains a crystal-clear separation between those who think and those who do.

When to hire this boss

For small to medium workflows that need a clear, single point of control.
When you need a bouncer at the door to enforce rules before letting anyone in.
If you appreciate a clean hierarchy: one boss, many workers.

A real-world scenario

An OrderProcessor Lambda receives a new order via API Gateway. It doesn’t trust anyone. It first validates the payload, saves a record to DynamoDB so it can’t get lost, and only then does it invoke other Lambdas: one to handle the payment, another to send a confirmation email, and a third to notify the shipping department. The boss orchestrates; the workers execute. Clean and effective.

Visually, it looks like a central hub directing traffic:

Here’s how that boss might delegate a task to the notifications worker:

// The Command Lambda (e.g., process-order-command)
import { LambdaClient, InvokeCommand } from "@aws-sdk/client-lambda";

const lambdaClient = new LambdaClient({ region: "us-east-1" });

export const handler = async (event) => {
    const orderDetails = JSON.parse(event.body);

    // 1. Validate and save the order (your business logic here)
    console.log(`Processing order ${orderDetails.orderId}...`);
    // ... logic to save to DynamoDB ...

    // 2. Delegate to the notification worker
    const invokeParams = {
        FunctionName: 'arn:aws:lambda:us-east-1:123456789012:function:send-confirmation-email',
        InvocationType: 'Event', // Fire-and-forget
        Payload: JSON.stringify({
            orderId: orderDetails.orderId,
            customerEmail: orderDetails.customerEmail,
        }),
    };

    await lambdaClient.send(new InvokeCommand(invokeParams));

    return {
        statusCode: 202, // Accepted
        body: JSON.stringify({ message: "Order received and is being processed." }),
    };
};

The dark side of micromanagement

Be warned. This boss can become a bottleneck. If all decisions flow through one function, it can get overwhelmed. It also risks becoming a “God Object,” a monstrous function that knows too much and does too much, making it a nightmare to maintain and a single, terrifying point of failure.

Enter the patient courier

So, what happens when the micromanager gets ten thousand requests in one second? It chokes, your system grinds to a halt, and you get a frantic call from your boss. The Command Pattern’s weakness is its synchronous nature. We need a buffer. We need an intermediary.

This is where the Messaging Pattern comes in, embodying the art of asynchronous patience. Here, instead of talking directly, services drop messages into a queue or stream (like SQS, SNS, or Kinesis). A consumer Lambda then picks them up whenever it’s ready. This builds healthy boundaries between your services, absorbs sudden traffic bursts like a sponge, and ensures that if something goes wrong, the message can be retried.

When to Call the Courier

For bursty or unpredictable workloads that would otherwise overwhelm your system.
To isolate slow or unreliable third-party services from your main request path.
When you need to offload heavy tasks to be processed in the background.
If you need a guarantee that a task will be executed at least once, with a safety net (a Dead-Letter Queue) for messages that repeatedly fail.

A Real-World Scenario

A user clicks “Checkout.” Instead of processing everything right away, the API Lambda simply drops an OrderPlaced event into an SQS queue and immediately returns a success message to the user. On the other side, a ProcessOrderQueue Lambda consumes events from the queue at its own pace. It reserves inventory, charges the credit card, and sends notifications. If the payment service is down, SQS holds the message, and the Lambda tries again later. No lost orders, no frustrated users.

The flow decouples the producer from the consumer:

The producer just needs to drop the message and walk away:

// The Producer Lambda (e.g., checkout-api)
import { SQSClient, SendMessageCommand } from "@aws-sdk/client-sqs";

const sqsClient = new SQSClient({ region: "us-east-1" });

export const handler = async (event) => {
    const orderDetails = JSON.parse(event.body);

    const command = new SendMessageCommand({
        QueueUrl: "[https://sqs.us-east-1.amazonaws.com/123456789012/OrderProcessingQueue](https://sqs.us-east-1.amazonaws.com/123456789012/OrderProcessingQueue)",
        MessageBody: JSON.stringify(orderDetails),
        MessageGroupId: orderDetails.orderId // For FIFO queues
    });

    await sqsClient.send(command);

    return {
        statusCode: 200,
        body: JSON.stringify({ message: "Your order is confirmed!" }),
    };
};

The price of patience

This resilience isn’t free. The biggest trade-off is added latency; you’re introducing an extra step. It also makes end-to-end tracing more complex. Debugging a journey that spans across a queue can feel like trying to track a package with no tracking number.

Unleash the Ttown crier

Sometimes, one piece of news needs to be told to everyone, all at once, without waiting for them to ask. You don’t want a single boss delegating one by one, nor a courier delivering individual letters. You need a proclamation.

The Fan-Out Pattern is your digital town crier. A single event is published to a central hub (typically an SNS topic or EventBridge), which then broadcasts it to any services that have subscribed. Each subscriber is a Lambda function that kicks into action in parallel, completely unaware of the others.

When to shout from the rooftops

When a single event needs to trigger multiple, independent downstream processes.
For building real-time, event-driven architectures where services react to changes.
In high-scale systems where parallel processing is a must.

A real-world scenario

An OrderPlaced event is published to an SNS topic. Instantly, this triggers multiple Lambdas in parallel: one to update inventory, another to send a confirmation email, and a third for the analytics pipeline. The beauty is that the publisher doesn’t know or care who is listening. You can add a fifth or sixth subscriber later without ever touching the original publishing code.

One event triggers many parallel actions:

The publisher’s job is delightfully simple:

// The Publisher Lambda (e.g., reservation-service)
import { SNSClient, PublishCommand } from "@aws-sdk/client-sns";

const snsClient = new SNSClient({ region: "us-east-1" });

export const handler = async (event) => {
    // ... logic to create a reservation ...
    const reservationDetails = {
        reservationId: "res-xyz-123",
        customerEmail: "jane.doe@example.com",
    };

    const command = new PublishCommand({
        TopicArn: "arn:aws:sns:us-east-1:123456789012:NewReservationsTopic",
        Message: JSON.stringify(reservationDetails),
        MessageAttributes: {
            'eventType': {
                DataType: 'String',
                StringValue: 'RESERVATION_CONFIRMED'
            }
        }
    });

    await snsClient.send(command);

    return { status: "SUCCESS", reservationId: reservationDetails.reservationId };
};

The dangers of a loud voice

With great power comes a great potential for a massive, distributed failure. A single poison-pill event could trigger dozens of Lambdas, each failing and retrying, leading to an invocation storm and a bill that will make your eyes water. Careful monitoring and robust error handling in each subscriber are non-negotiable.

Choosing your champions

There you have it: the Micromanager, the Courier, and the Town Crier. Three patterns that form the bedrock of almost any serverless architecture worth its salt.

Use the Command Pattern when you need a firm hand on the tiller.
Adopt the Messaging Pattern to give your services breathing room and resilience.
Leverage the Fan-Out Pattern when one event needs to efficiently kickstart a flurry of activity.

The real magic begins when you combine them. But for now, start seeing your Lambdas not as a chaotic mob of individual functions, but as a team of specialists. With a little architectural guidance, they can build systems that are complex, resilient, and, best of all, cause you far fewer operational headaches.

September 7, 2025 by Fernando SRE Cloud stuff SRE stuff

Serverless without the wait

I once bought a five-minute rice cooker that spent four of those minutes warming up with a pathetic hum. It delivered the goods, eventually, but the promise felt… deceptive. For years, AWS Lambda felt like that gadget. It was the perfect kitchen tool for the odd jobs: a bit of glue code here, a light API there. It was the brilliant, quick-fire microwave of our architecture.

Then our little kitchen grew into a full-blown restaurant. Our “hot path”, the user checkout process, became the star dish on our menu. And our diners, quite rightly, expected it to be served hot and fast every time, not after a polite pause while the oven preheated. That polite pause was our cold start, and it was starting to leave a bad taste.

This isn’t a story about how we fell out of love with Lambda. We still adore it. This is the story of how we moved our main course to an industrial-grade, always-on stove. It’s about what we learned by obsessively timing every step of the process and why we still keep that trusty microwave around for the side dishes it cooks so perfectly. Because when your p95 latency needs to be boringly predictable, keeping the kitchen warm isn’t a preference; it’s a law of physics.

What forced us to remodel the kitchen

No single event pushed us over the edge. It was more of a slow-boiling frog situation, a gradual realization that our ambitions were outgrowing our tools. Three culprits conspired against our sub-300ms dream.

First, our traffic got moody. What used to be a predictable tide of requests evolved into sudden, sharp tsunamis during business hours. We needed a sea wall, not a bucket.

Second, our user expectations tightened. We set a rather tyrannical goal of a sub-300ms p95 for our checkout and search paths. Suddenly, the hundreds of milliseconds Lambda spent stretching and yawning before its first cup of coffee became a debt we couldn’t afford.

Finally, our engineers were getting tired. We found ourselves spending more time performing sacred rituals to appease the cold start gods, fiddling with layers, juggling provisioned concurrency, than we did shipping features our users actually cared about. When your mechanics spend more time warming up the engine than driving the car, you know something’s wrong.

The punchline isn’t that Lambda is “bad.” It’s that our requirements changed. When your performance target drops below the cost of a cold start plus dependency initialization, physics sends you a sternly worded letter.

Numbers don’t lie, but anecdotes do

We don’t ask you to trust our feelings. We ask you to trust the stopwatch. Replicate this experiment, adjust it for your own tech stack, and let the data do the talking. The setup below is what we used to get our own facts straight. All results are our measurements as of September 2025.

The test shape

Endpoint: Returns a simple 1 KB JSON payload.
Comparable Compute: Lambda set to 512 MB vs. an ECS Fargate container task with 0.5 vCPU and 1 GB of memory.
Load Profile: A steady, closed-loop 100 requests per second (RPS) for 10 minutes.
Metrics Reported: p50, p90, p95, p99 latency, and the dreaded error rate.

Our trusty tools

Load Generator: The ever-reliable k6.
Metrics: A cocktail of CloudWatch and Prometheus.
Dashboards: Grafana, to make the pretty charts that managers love.

Your numbers will be different. That’s the entire point. Run the tests, get your own data, and then make a decision based on evidence, not a blog post (not even this one).

Where our favorite gadget struggled

Under the harsh lights of our benchmark, Lambda’s quirks on our hot path became impossible to ignore.

Cold start spikes: Provisioned Concurrency can tame these, but it’s like hiring a full-time chauffeur to avoid a random 10-minute wait for a taxi. It costs you a constant fee, and during a real rush hour, you might still get stuck in traffic.
The startup toll: Initializing SDKs and warming up connections added tens to hundreds of milliseconds. This “entry fee” was simply too high to hide under our 300ms p95 goal.
The debugging labyrinth: Iterating was slow. Local emulators helped, but parity was a myth that occasionally bit us. Debugging felt like detective work with half the clues missing.

Lambda continues to be a genius for event glue, sporadic jobs, and edge logic. It just stopped being the right tool to serve our restaurant’s most popular dish at rush hour.

Calling in the heavy artillery

We moved our high-traffic endpoints to container-native services. For us, that meant ECS on Fargate fronted by an Application Load Balancer (ALB). The core idea is simple: keep a few processes warm and ready at all times.

Here’s why it immediately helped:

Warm processes: No more cold start roulette. Our application was always awake, connection pools were alive, and everything was ready to go instantly.
Standardized packaging: We traded ZIP files for standard Docker images. What we built and tested on our laptops was, byte for byte, what we shipped to production.
Civilized debugging: We could run the exact same image locally and attach a real debugger. It was like going from candlelight to a floodlight.
Smarter scaling: We could maintain a small cadre of warm tasks as a baseline and then scale out aggressively during peaks.

A quick tale of the tape

Here’s a simplified look at how the two approaches stacked up for our specific needs.

Our surprisingly fast migration plan

We did this in days, not weeks. The key was to be pragmatic, not perfect.

1. Pick your battles: We chose our top three most impactful endpoints with the worst p95 latency.

2. Put it in a box: We converted the function handler into a tiny web service. It’s less dramatic than it sounds.

# Dockerfile (Node.js example)
FROM node:22-slim
WORKDIR /usr/src/app

COPY package*.json ./
RUN npm ci --only=production

COPY . .

ENV NODE_ENV=production PORT=3000
EXPOSE 3000
CMD [ "node", "server.js" ]

// server.js
const http = require('http');
const port = process.env.PORT || 3000;

const server = http.createServer((req, res) => {
  if (req.url === '/health') {
    res.writeHead(200, { 'Content-Type': 'text/plain' });
    return res.end('ok');
  }

  // Your actual business logic would live here
  const body = JSON.stringify({ success: true, timestamp: Date.now() });
  res.writeHead(200, { 'Content-Type': 'application/json' });
  res.end(body);
});

server.listen(port, () => {
  console.log(`Server listening on port ${port}`);
});

3. Set up the traffic cop: We created a new target group for our service and pointed a rule on our Application Load Balancer to it.

{
  "family": "payment-api",
  "networkMode": "awsvpc",
  "cpu": "512",
  "memory": "1024",
  "requiresCompatibilities": ["FARGATE"],
  "executionRoleArn": "arn:aws:iam::987654321098:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::987654321098:role/paymentTaskRole",
  "containerDefinitions": [
    {
      "name": "app-container",
      "image": "[987654321098.dkr.ecr.us-east-1.amazonaws.com/payment-api:2.1.0](https://987654321098.dkr.ecr.us-east-1.amazonaws.com/payment-api:2.1.0)",
      "portMappings": [{ "containerPort": 3000, "protocol": "tcp" }],
      "environment": [{ "name": "NODE_ENV", "value": "production" }]
    }
  ]
}

4. The canary in the coal mine: We used weighted routing to dip our toes in the water. We started by sending just 5% of traffic to the new container service.

# Terraform Route 53 weighted canary
resource "aws_route53_record" "api_primary_lambda" {
  zone_id = var.zone_id
  name    = "api.yourapp.com"
  type    = "A"

  alias {
    name                   = aws_api_gateway_domain_name.main.cloudfront_domain_name
    zone_id                = aws_api_gateway_domain_name.main.cloudfront_zone_id
    evaluate_target_health = true
  }

  set_identifier = "primary-lambda-path"
  weight         = 95
}

resource "aws_route53_record" "api_canary_container" {
  zone_id = var.zone_id
  name    = "api.yourapp.com"
  type    = "A"

  alias {
    name                   = aws_lb.main_alb.dns_name
    zone_id                = aws_lb.main_alb.zone_id
    evaluate_target_health = true
  }

  set_identifier = "canary-container-path"
  weight         = 5
}

5. Stare at the graphs: For one hour, we watched four numbers like hawks: p95 latency, error rates, CPU/memory headroom on the new service, and our estimated cost per million requests.

6. Go all in (or run away): The graphs stayed beautifully, boringly flat. So we shifted to 50%, then 100%. The whole affair was done in an afternoon.

The benchmark kit you can steal

Don’t just read about it. Run a quick test yourself.

// k6 script (save as test.js)
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  vus: 100,
  duration: '5m',
  thresholds: {
    'http_req_duration': ['p(95)<250'], // Aim for a 250ms p95
    'checks': ['rate>0.999'],
  },
};

export default function () {
  const url = __ENV.TARGET_URL || '[https://api.yourapp.com/checkout/v2/quote](https://api.yourapp.com/checkout/v2/quote)';
  const res = http.get(url);
  check(res, { 'status is 200': r => r.status === 200 });
  sleep(0.2); // Small pause between requests
}

Run it from your terminal like this:

k6 run -e TARGET_URL=https://your-canary-endpoint.com test.js

Our results for context

These aren’t universal truths; they are snapshots of our world. Your mileage will vary.

The numbers in bold are what kept us up at night and what finally let us sleep. For our steady traffic, the always-on container was not only faster and more reliable, but it was also shaping up to be cheaper.

Lambda is still in our toolbox

We didn’t throw the microwave out. We just stopped using it to cook the Thanksgiving turkey. Here’s where we still reach for Lambda without a second thought:

Sporadic or bursty workloads: Those once-a-day reports or rare event handlers are perfect for scale-to-zero.
Event glue: It’s the undisputed champion of transforming S3 puts, reacting to DynamoDB streams, and wiring up EventBridge.
Edge logic: For tiny header manipulations or rewrites, Lambda@Edge and CloudFront Functions are magnificent.

Lambda didn’t fail us. We outgrew its default behavior for a very specific, high-stakes workload. We cheated physics by keeping our processes warm, and in return, our p95 stopped stretching like hot taffy.

If your latency targets and traffic shape look anything like ours, please steal our tiny benchmark kit. Run a one-day canary. See what the numbers tell you. The goal isn’t to declare one tool a winner, but to spend less time arguing with physics and more time building things that people love.

September 2, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff