Cloud stuff

Kubernetes leases or the art of waiting for the bathroom

If you looked inside a running Kubernetes cluster with a microscope, you would not see a perfectly choreographed ballet of binary code. You would see a frantic, crowded open-plan office staffed by thousands of employees who have consumed dangerous amounts of espresso. You have schedulers, controllers, and kubelets all sprinting around, frantically trying to update databases and move containers without crashing into each other.

It is a miracle that the whole thing does not collapse into a pile of digital rubble within seconds. Most human organizations of this size descend into bureaucratic infighting before lunch. Yet, somehow, Kubernetes keeps this digital circus from turning into a riot.

You might assume that the mechanism preventing this chaos is a highly sophisticated, cryptographic algorithm forged in the fires of advanced mathematics. It is not. The thing that keeps your cluster from eating itself is the distributed systems equivalent of a sticky note on a door. It is called a Lease.

And without this primitive, slightly passive-aggressive little object, your entire cloud infrastructure would descend into anarchy faster than you can type kubectl delete namespace.

The sticky note of power

To understand why a Lease is necessary, we have to look at the psychology of a Kubernetes controller. These components are, by design, incredibly anxious. They want to ensure that the desired state of the world matches the actual state.

The problem arises when you want high availability. You cannot just have one controller running because if it dies, your cluster stops working. So you run three replicas. But now you have a new problem. If all three replicas try to update the same routing table or create the same pod at the exact same moment, you get a “split-brain” scenario. This is the technical term for a psychiatric emergency where the left hand deletes what the right hand just created.

Kubernetes solves this with the Lease object. Technically, it is an API resource in the coordination.k8s.io group. Spiritually, it is a “Do Not Disturb” sign hung on a doorknob.

If you look at the YAML definition of a Lease, it is almost insultingly simple. It does not ask for a security clearance or a biometric scan. It essentially asks three questions:

HolderIdentity: Who are you?
LeaseDurationSeconds: How long are you going to be in there?
RenewTime: When was the last time you shouted that you are still alive?

Here is what one looks like in the wild:

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: cluster-coordination-lock
  namespace: kube-system
spec:
  holderIdentity: "controller-pod-beta-09"
  leaseDurationSeconds: 15
  renewTime: "2023-10-27T10:04:05.000000Z"

In plain English, this document says: “Controller Beta-09 is holding the steering wheel. It has fifteen seconds to prove it has not died of a heart attack. If it stays silent for sixteen seconds, we are legally allowed to pry the wheel from its cold, dead fingers.”

An awkward social experiment

To really grasp the beauty of this system, we need to leave the server room and enter a shared apartment with a terrible design flaw. There is only one bathroom, the lock is broken, and there are five roommates who all drank too much water.

The bathroom is the “critical resource.” In a computerized world without Leases, everyone would just barge in whenever they felt the urge. This leads to what engineers call a “race condition” and what normal people call “an extremely embarrassing encounter.”

Since we cannot fix the lock, we install a whiteboard on the door. This is the Lease.

The rules of this apartment are strict but effective. When you walk up to the door, you write your name and the current time on the board. You have now acquired the lock. As long as your name is there and the timestamp is fresh, the other roommates will stand in the hallway, crossing their legs and waiting politely.

But here is where it gets stressful. You cannot just write your name and fall asleep in the tub. The system requires constant anxiety. Every few seconds, you have to crack the door open, reach out with a marker, and update the timestamp. This is the “heartbeat.” It tells the people waiting outside that you are still conscious and haven’t slipped in the shower.

If you faint, or if the WiFi cuts out and you cannot reach the whiteboard, you stop updating the time. The roommates outside watch the clock. Ten seconds pass. Fifteen seconds. At sixteen seconds, they do not knock to see if you are okay. They assume you are gone forever, wipe your name off the board, write their own, and barge in.

It is ruthless, but it ensures that the bathroom is never left empty just because the previous occupant vanished into the void.

The paranoia of leader election

The most critical use of this bathroom logic is something called Leader Election. This is the mechanism that keeps your kube-controller-manager and kube scheduler from turning into a bar fight.

You typically run multiple copies of these control plane components for redundancy. However, you absolutely cannot have five different schedulers trying to assign the same pod to five different nodes simultaneously. That would be like having five conductors trying to lead the same orchestra. You do not get music; you get noise and a lot of angry musicians.

So, the replicas hold an election. But it is not a democratic vote with speeches and ballots. It is a race to grab the marker.

The moment the controllers start up, they all rush toward the Lease object. The first one to write its name in the holderIdentity field becomes the Leader. The others, the candidates, do not go home. They stand in the corner, staring at the Lease, refreshing the page every two seconds, waiting for the Leader to fail.

There is something deeply human about this setup. The backup replicas are not “supporting” the leader. They are jealous understudies watching the lead actor, hoping he breaks a leg so they can take center stage.

If the Leader crashes or simply gets stuck in a network traffic jam, the renewTime stops updating. The lease expires. Immediately, the backups scramble to write their own name. The winner takes over the cluster duties instantly. It is seamless, automated, and driven entirely by the assumption that everyone else is unreliable.

Reducing the noise pollution

In the early days of Kubernetes, things were even messier. Nodes, the servers doing the actual work, had to prove they were alive by sending a massive status report to the API server every few seconds.

Imagine a receptionist who has to process a ten-page medical history form from every single employee every ten seconds, just to confirm they are at their desks. It was exhausting. The API server spent so much time reading these reports that it barely had time to do anything else.

Today, Kubernetes uses Leases for node heartbeats, too. Instead of the full medical report, the node just updates a Lease object. It is a quick, lightweight ping.

“I’m here.”

“Good.”

“Still here.”

“Great.”

This change reduced the computational cost of staying alive significantly. The API server no longer needs to know your blood pressure and cholesterol levels every ten seconds; it just needs to know you are breathing. It turns a bureaucratic nightmare into a simple check-in.

How to play with fire

The beauty of the Lease system is that it is just a standard Kubernetes object. You can see these invisible sticky notes right now. If you list the leases in the system namespace, you will see the invisible machinery that keeps the lights on:

kubectl get leases -n kube-system

You will see entries for the controller manager, the scheduler, and probably one for every node in your cluster. If you want to see who the current boss is, you can describe the lease:

kubectl describe lease kube-scheduler -n kube-system

You will see the holderIdentity. That is the name of the replica currently running the show.

Now, if you are feeling particularly chaotic, or if you just want to see the world burn, you can delete a Lease manually.

kubectl delete lease kube-scheduler -n kube-system

Please do not do this in production unless you enjoy panic attacks.

Deleting an active Lease is like ripping the “Occupied” sign off the bathroom door while someone is inside. You are effectively lying to the system. You are telling the backup controllers, “The leader is dead! Long live the new leader!”

The backups will rush in and elect a new leader. But the old leader, who was effectively just sitting there minding its own business, is still running. Suddenly, it realizes it has been fired without notice. Ideally, it steps down gracefully. But in the split second before it realizes what happened, you might have two controllers giving orders.

The system will heal itself, usually within seconds, but those few seconds are a period of profound confusion for everyone involved.

The survival of the loudest

Leases are the unsung heroes of the cloud native world. We like to talk about Service Meshes and eBPF and other shiny, complex technologies. But at the bottom of the stack, keeping the whole thing from exploding, is a mechanism as simple as a name on a whiteboard.

It works because it accepts a fundamental truth about distributed systems: nothing is reliable, everyone is going to crash eventually, and the only way to maintain order is to force components to shout “I am alive!” every few seconds.

Next time your cluster survives a node failure or a controller restart without you even noticing, spare a thought for the humble Lease. It is out there in the void, frantically renewing timestamps, protecting you from the chaos of a split-brain scenario. And that is frankly better than a lock on a bathroom door any day.

Managing the emotional stability of your Linux server

Thursday, 3:47 AM. Your server is named Nigel. You named him Nigel because deep down, despite the silicon and the circuitry, he feels like a man who organizes his spice rack alphabetically by the Latin name of the plant. But right now, Nigel is not organizing spices. Nigel has decided to stage a full-blown existential rebellion.

The screen is black. The network fan is humming with a tone of passive-aggressive silence. A cursor blinks in the upper-left corner with a rhythm that seems designed specifically to induce migraines. You reboot. Nigel reboots. Nothing changes. The machine is technically “on,” in the same way a teenager staring at the ceiling for six hours is technically “awake.”

At this moment, the question separating the seasoned DevOps engineer from the panicked googler is not “Why me?” but rather: Which personality did Nigel wake up with today?

This is not a technical question. It is a psychological one. Linux does not break at random; it merely changes moods. It has emotional states. And once you learn to read them, troubleshooting becomes less like exorcising a demon and more like coaxing a sulking relative out of the bathroom during Thanksgiving dinner.

The grumpy grandfather who started it all

We lived in a numeric purgatory for years. In an era when “multitasking” sounded like dangerous witchcraft and coffee came only in one flavor (scorched), Linux used a system called SysVinit to manage its temperaments. This system boiled the entire machine’s existence down to a handful of numbers, zero through six, called runlevels.

It was a rigid caste system. Each number was a dial you could turn to decide how much Nigel was willing to participate in society.

Runlevel 0 meant Nigel was checking out completely. Death. Runlevel 6 meant Nigel had decided to reincarnate. Runlevel 1 was Nigel as a hermit monk, holed up in a cave with no network, no friends, just a single shell and a vow of digital silence. Runlevel 5 was Nigel on espresso and antidepressants, graphical interface blazing, ready to party and consume RAM for no apparent reason.

This was functional, in the way a Soviet-era tractor is functional. It was also about as intuitive as a dishwasher manual written in cuneiform. You would tell a junior admin to “boot to runlevel 3,” and they would nod while internally screaming. What does three mean? Is it better than two? Is five twice as good as three? The numbers did not describe anything; they just were, like the arbitrary rules of a board game invented by someone who actively hated you.

And then there was runlevel 4. Runlevel 4 is the appendix of the Linux anatomy. It is vaguely present, historically relevant, but currently just taking up space. It was the “user-definable” switch in your childhood home that either did nothing or controlled the neighbor’s garage door. It sits there, unused, gathering digital dust.

Enter the overly organized therapist

Then came systemd. If SysVinit was a grumpy grandfather, systemd is the high-energy hospital administrator who carries a clipboard and yells at people for walking too slowly. Systemd took one look at those numbered mood dials and was appalled. “Numbers? Seriously? Even my router has a name.”

It replaced the cold digits with actual descriptive words: multi-user.target, graphical.target, rescue.target. It was as if Linux had finally gone to therapy and learned to use its words to express its feelings instead of grunting “runlevel 3” when it really meant “I need personal space, but WiFi would be nice.”

Targets are just runlevels with a humanities degree. They perform the exact same job, defining which services start, whether the GUI is invited to the party, whether networking gets a plus-one, but they do so with the kind of clarity that makes you wonder how we survived the numeric era without setting more server rooms on fire.

A Rosetta Stone for Nigel’s mood swings

Here is the translation guide that your cheat sheet wishes it had. Think of this as the DSM-5 for your server.

Runlevel 0 becomes poweroff.target
Nigel is taking a permanent nap. This is the Irish Goodbye of operating states.
Runlevel 1 becomes rescue.target
Nigel is in intensive care. Only family is allowed to visit (root user). The network is unplugged, the drives might be mounted read-only, and the atmosphere is grim. This is where you go when you have broken something fundamental and need to perform digital surgery.
Runlevel 3 becomes multi-user.target
Nigel is wearing sweatpants but answering emails. This is the gold standard for servers. Networking is up, multiple users can log in, cron jobs are running, but there is no graphical interface to distract anyone. It is a state of pure, joyless productivity.
Runlevel 5 becomes graphical.target
Nigel is in full business casual with a screensaver. He has loaded the window manager, the display server, and probably a wallpaper of a cat. He is ready to interact with a mouse. He is also consuming an extra gigabyte of memory just to render window shadows.
Runlevel 6 becomes reboot.target
Nigel is hitting the reset button on his life.

The command line couch

Knowing Nigel’s mood is useless unless you can change it. You need tools to intervene. These are the therapy techniques you keep in your utility belt.

To eyeball Nigel’s default personality (the one he wakes up with every morning), you ask:

systemctl get-default

This might spit back graphical.target. This means Nigel is a morning person who greets the world with a smile and a heavy user interface. If it says multi-user.target, Nigel is the coffee-before-conversation type.

But sometimes, you need to force a mood change. Let’s say you want to switch Nigel from party mode (graphical) to hermit mode (text-only) without making it permanent. You are essentially putting an extrovert in a quiet room for a breather.

systemctl isolate multi-user.target

The word “isolate” here is perfect. It is not “disable” or “kill.” It is “isolate”. It sounds less like computer administration and more like what happens to the protagonist in the third act of a horror movie involving Antarctic research stations. It tells systemd to stop everything that doesn’t belong in the new target. The GUI vanishes. The silence returns.

To switch back, because sometimes you actually need the pretty buttons:

systemctl isolate graphical.target

And to permanently change Nigel’s baseline disposition, akin to telling a chronically late friend that dinner is at 6:30 when it is really at 7:00:

systemctl set-default multi-user.target

Now Nigel will always wake up in Command Line Interface mode, even after a reboot. You can practically hear the sigh of relief from your CPU as it realizes it no longer has to render pixels.

When Nigel has a real breakdown

Let’s walk through some actual disasters, because theory is just a hobby until production goes down and your boss starts hovering behind your chair breathing through his mouth.

Scenario one: The fugue state

Nigel updated his kernel and now boots to a black screen. He is not dead; he is just confused. You reboot, interrupt the boot loader, and add systemd.unit=rescue.target to the boot parameters.

Nigel wakes up in a safe room. It is a root shell. There is no networking. There is no drama. It is just you and the config files. It is intimate, in a disturbing way. You fix the offending setting, type exec /sbin/init, and Nigel reboots into his normal self, slightly embarrassed about the whole episode.

Scenario two: The toddler on espresso

Nigel’s graphical interface has started crashing like a toddler after too much sugar. Every time you log in, the desktop environment panics and dies. Instead of fighting it, you switch to multi-user.target.

Nigel is now a happy, stable server with no interest in pretty icons. Your users can still SSH in. Your automated jobs still run. Nigel just doesn’t have to perform anymore. It is like taking the toddler out of the Chuck E. Cheese and putting him in a library. The screaming stops immediately.

Scenario three: The bloatware incident

Nigel is a production web server that has inexplicably slowed to a crawl. You dig through the logs and discover that an intern (let’s call him “Not-Fernando”) installed a full desktop environment six months ago because they liked the screensaver.

This is akin to buying a Ferrari to deliver pizza because you like the leather seats. The graphical target is eating resources that your database desperately needs. You set the default to multi-user.target and reboot. Nigel comes back lean, mean, and suddenly has five hundred extra megabytes of RAM to play with. It is like watching someone shed a winter coat in the middle of July.

The mindset shift

Beginners see a black screen and ask, “Why is Nigel broken?” Professionals see a black screen and ask, “Which target is Nigel in, and which services are active?”

This is not just semantics. It is the difference between treating a symptom and diagnosing a disease. When you understand that Linux doesn’t break so much as it changes states, you stop being a victim of circumstance and start being a negotiator. You are not praying to the machine gods; you are simply asking Nigel, “Hey buddy, what mood are you in?” and then coaxing him toward a more productive state.

The panic evaporates because you know the vocabulary. You know that rescue.target is a panic room, multi-user.target is a focused work session, and graphical.target is Nigel trying to impress someone at a party.

Linux targets are not arcane theory reserved for greybeards and certification exams. They are the foundational language of state management. They are how you tell Nigel, “It is okay to be a hermit today,” or “Time to socialize,” or “Let’s check you into therapy real quick.”

Once you internalize this, boot issues stop being terrifying mysteries. They become logical puzzles. Interviews stop being interrogations. They become conversations. You stop sounding like a generic admin reading a forum post and start sounding like someone who knows Nigel personally.

Because you do. Nigel is that fussy, brilliant, occasionally melodramatic friend who just needs the right kind of encouragement. And now you have the exact words to provide it.

December 22, 2025 by Fernando SRE Cloud stuff DevOps stuff Linux Stuff SRE stuff

Let IAM handle the secrets you can avoid

There are two kinds of secrets in cloud security.

The first kind is the legitimate kind: a third-party API token, a password for something you do not control, a certificate you cannot simply wish into existence.

The second kind is the kind we invent because we are in a hurry: long-lived access keys, copied into a config file, then copied into a Docker image, then copied into a ticket, then copied into the attacker’s weekend plans.

This article is about refusing to participate in that second category.

Not because secrets are evil. Because static credentials are the “spare house key under the flowerpot” of AWS. Convenient, popular, and a little too generous with access for something that can be photographed.

The goal is not “no secrets exist.” The goal is no secrets live in code, in images, or in long-lived credentials.

If you do that, your security posture stops depending on perfect human behavior, which is great because humans are famously inconsistent. (We cannot all be trusted with a jar of cookies, and we definitely cannot all be trusted with production AWS keys.)

Why this works in real life

AWS already has a mechanism designed to prevent your applications from holding permanent credentials: IAM roles and temporary credentials (STS).

When your Lambda runs with an execution role, AWS hands it short-lived credentials automatically. They rotate on their own. There is nothing to copy, nothing to stash, nothing to rotate in a spreadsheet named FINAL-final-rotation-plan.xlsx.

What remains are the unavoidable secrets, usually tied to systems outside AWS. For those, you store them in AWS Secrets Manager and retrieve them at runtime. Not at build time. Not at deploy time. Not by pasting them into an environment variable and calling it “secure” because you used uppercase letters.

This gives you a practical split:

Avoidable secrets are replaced by IAM roles and temporary credentials
Unavoidable secrets go into Secrets Manager, encrypted and tightly scoped

The architecture in one picture

A simple flow to keep in mind:

A Lambda function runs with an IAM execution role
The function fetches one third-party API key from Secrets Manager at runtime
The function calls the third-party API and writes results to DynamoDB
Network access to Secrets Manager stays private through a VPC interface endpoint (when the Lambda runs in a VPC)

The best part is what you do not see.

No access keys. No “temporary” keys that have been temporary since 2021. No secrets baked into ZIPs or container layers.

What this protects you from

This pattern is not a magic spell. It is a seatbelt.

It helps reduce the chance of:

Credentials leaking through Git history, build logs, tickets, screenshots, or well-meaning copy-paste
Forgotten key rotation schedules that quietly become “never.”
Overpowered policies that turn a small bug into a full account cleanup
Unnecessary public internet paths for sensitive AWS API calls

Now let’s build it, step by step, with code snippets that are intentionally sanitized.

Step 1 build an IAM execution role with tight policies

The execution role is the front door key your Lambda carries.

If you give it access to everything, it will eventually use that access, if only because your future self will forget why it was there and leave it in place “just in case.”

Keep it boring. Keep it small.

Here is an example IAM policy for a Lambda that only needs to:

write to one DynamoDB table
read one secret from Secrets Manager
decrypt using one KMS key (optional, depending on how you configure encryption)

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "WriteToOneTable",
      "Effect": "Allow",
      "Action": [
        "dynamodb:PutItem",
        "dynamodb:UpdateItem"
      ],
      "Resource": "arn:aws:dynamodb:eu-west-1:111122223333:table/app-results-prod"
    },
    {
      "Sid": "ReadOneSecret",
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetSecretValue"
      ],
      "Resource": "arn:aws:secretsmanager:eu-west-1:111122223333:secret:thirdparty/weather-api-key-*"
    },
    {
      "Sid": "DecryptOnlyThatKey",
      "Effect": "Allow",
      "Action": [
        "kms:Decrypt"
      ],
      "Resource": "arn:aws:kms:eu-west-1:111122223333:key/12345678-90ab-cdef-1234-567890abcdef",
      "Condition": {
        "StringEquals": {
          "kms:ViaService": "secretsmanager.eu-west-1.amazonaws.com"
        }
      }
    }
  ]
}

A few notes that save you from future regret:

The secret ARN ends with -* because Secrets Manager appends a random suffix.
The KMS condition helps ensure the key is used only through Secrets Manager, not as a general-purpose decryption service.
You can skip the explicit kms:Decrypt statement if you use the AWS-managed key and accept the default behavior, but customer-managed keys are common in regulated environments.

Step 2 store the unavoidable secret properly

Secrets Manager is not a place to dump everything. It is a place to store what you truly cannot avoid.

A third-party API key is a perfect example because IAM cannot replace it. AWS cannot assume a role in someone else’s SaaS.

Use a JSON secret so you can extend it later without creating a new secret every time you add a field.

{
  "api_key": "REDACTED-EXAMPLE-TOKEN"
}

If you like the CLI (and I do, because buttons are too easy to misclick), create the secret like this:

aws secretsmanager create-secret \
  --name "thirdparty/weather-api-key" \
  --description "Token for the Weatherly API used by the ingestion Lambda" \
  --secret-string '{"api_key":"REDACTED-EXAMPLE-TOKEN"}' \
  --region eu-west-1

Then configure:

encryption with a customer-managed KMS key if required
rotation if the provider supports it (rotation is amazing when it is real, and decorative when the vendor does not allow it)

If the vendor does not support rotation, you still benefit from central storage, access control, audit logging, and removing the secret from code.

Step 3 lock down secret access with a resource policy

Identity-based policies on the Lambda role are necessary, but resource policies are a nice extra lock.

Think of it like this: your role policy is the key. The resource policy is the bouncer who checks the wristband.

Here is a resource policy that allows only one role to read the secret.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowOnlyIngestionRole",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111122223333:role/lambda-ingestion-prod"
      },
      "Action": "secretsmanager:GetSecretValue",
      "Resource": "*"
    },
    {
      "Sid": "DenyEverythingElse",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "secretsmanager:GetSecretValue",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalArn": "arn:aws:iam::111122223333:role/lambda-ingestion-prod"
        }
      }
    }
  ]
}

This is intentionally strict. Strict is good. Strict is how you avoid writing apology emails.

Step 4 keep Secrets Manager traffic private with a VPC endpoint

If your Lambda runs inside a VPC, it will not automatically have internet access. That is often the point.

In that case, you do not want the function reaching Secrets Manager through a NAT gateway if you can avoid it. NAT works, but it is like walking your valuables through a crowded shopping mall because the back door is locked.

Use an interface VPC endpoint for Secrets Manager.

Here is a Terraform example (sanitized) that creates the endpoint and limits access using a dedicated security group.

resource "aws_security_group" "secrets_endpoint_sg" {
  name        = "secrets-endpoint-sg"
  description = "Allow HTTPS from Lambda to Secrets Manager endpoint"
  vpc_id      = aws_vpc.main.id

  ingress {
    from_port       = 443
    to_port         = 443
    protocol        = "tcp"
    security_groups = [aws_security_group.lambda_sg.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_vpc_endpoint" "secretsmanager" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.eu-west-1.secretsmanager"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = [aws_subnet.private_a.id, aws_subnet.private_b.id]
  private_dns_enabled = true
  security_group_ids  = [aws_security_group.secrets_endpoint_sg.id]
}

If your Lambda is not in a VPC, you do not need this step. The function will reach Secrets Manager over AWS’s managed network path by default.

If you want to go further, consider adding a DynamoDB gateway endpoint too, so your function can write to DynamoDB without touching the public internet.

Step 5 retrieve the secret at runtime without turning logs into a confession

This is where many teams accidentally reinvent the problem.

They remove the secret from the code, then log it. Or they put it in an environment variable because “it is not in the repository,” which is a bit like saying “the spare key is not under the flowerpot, it is under the welcome mat.”

The clean approach is:

store only the secret name (not the secret value) as configuration
retrieve the value at runtime
cache it briefly to reduce calls and latency
never print it, even when debugging, especially when debugging

Here is a Python example for AWS Lambda with a tiny TTL cache.

import json
import os
import time
import boto3

_secrets_client = boto3.client("secretsmanager")
_cached_value = None
_cached_until = 0

SECRET_ID = os.getenv("THIRDPARTY_SECRET_ID", "thirdparty/weather-api-key")
CACHE_TTL_SECONDS = int(os.getenv("SECRET_CACHE_TTL_SECONDS", "300"))


def _get_api_key() -> str:
    global _cached_value, _cached_until

    now = int(time.time())
    if _cached_value and now < _cached_until:
        return _cached_value

    resp = _secrets_client.get_secret_value(SecretId=SECRET_ID)
    payload = json.loads(resp["SecretString"])

    api_key = payload["api_key"]
    _cached_value = api_key
    _cached_until = now + CACHE_TTL_SECONDS
    return api_key


def lambda_handler(event, context):
    api_key = _get_api_key()

    # Use the key without ever logging it
    results = call_weatherly_api(api_key=api_key, city=event.get("city", "Seville"))

    write_to_dynamodb(results)

    return {
        "status": "ok",
        "items": len(results) if hasattr(results, "__len__") else 1
    }

This snippet is intentionally short. The important part is the pattern:

minimal secret access
controlled cache
zero secret output

If you prefer a library, AWS provides a Secrets Manager caching client for some runtimes, and AWS Lambda Powertools can help with structured logging. Use them if they fit your stack.

Step 6 make security noisy with logs and alarms

Security without visibility is just hope with a nicer font.

At a minimum:

enable CloudTrail in the account
ensure Secrets Manager events are captured
alert on unusual secret access patterns

A simple and practical approach is a CloudWatch metric filter for GetSecretValue events coming from unexpected principals. Another is to build a dashboard showing:

Lambda errors
Secrets Manager throttles
sudden spikes in secret reads

Here is a tiny Terraform example that keeps your Lambda logs from living forever (because storage is forever, but your attention span is not).

resource "aws_cloudwatch_log_group" "lambda_logs" {
  name              = "/aws/lambda/lambda-ingestion-prod"
  retention_in_days = 14
}

Also consider:

IAM Access Analyzer to spot risky resource policies
AWS Config rules or guardrails if your organization uses them
an alarm on unexpected NAT data processing if you intended to keep traffic private

Common mistakes I have made, so you do not have to

I am listing these because I have either done them personally or watched them happen in slow motion.

Using a wildcard secret policy
secretsmanager:GetSecretValue on * feels convenient until it is a breach multiplier.
Putting secret values into environment variables
Environment variables are not evil, but they are easy to leak through debugging, dumps, tooling, or careless logging. Store secret names there, not secret contents.
Retrieving secrets at build time
Build logs live forever in the places you forget to clean. Runtime retrieval keeps secrets out of build systems.
Logging too much while debugging
The fastest way to leak a secret is to print it “just once.” It will not be just once.
Skipping the endpoint and relying on NAT by accident
The NAT gateway is not evil either. It is just an expensive and unnecessary hallway if a private door exists.

A two minute checklist you can steal

Your Lambda uses an IAM execution role, not access keys
The role policy scopes Secrets Manager access to one secret ARN pattern
The secret has a resource policy that only allows the expected role
Secrets are encrypted with KMS when required
The secret value is never stored in code, images, build logs, or environment variables
If Lambda runs in a VPC, you use an interface VPC endpoint for Secrets Manager
You have CloudTrail enabled and you can answer “who accessed this secret” without guessing

Extra thoughts

If you remove long-lived credentials from your applications, you remove an entire class of problems.

You stop rotating keys that should never have existed in the first place.

You stop pretending that “we will remember to clean it up later” is a security strategy.

And you get a calmer life, which is underrated in engineering.

Let IAM handle the secrets you can avoid.

Then let Secrets Manager handle the secrets you cannot.

And let your code do what it was meant to do: process data, not babysit keys like they are a toddler holding a permanent marker.

December 14, 2025 by Fernando SRE Cloud stuff DevOps stuff

How Dropbox saved millions by leaving AWS

Most of us treat cloud storage like a magical, bottomless attic. You throw your digital clutter into a folder: PDFs of tax returns from 2014, blurred photos of a cat that has long since passed away, unfinished drafts of novels, and you forget about them. It feels weightless. It feels ephemeral. But somewhere in a windowless concrete bunker in Virginia or Oregon, a spinning platter of rust is working very hard to keep those cat photos alive. And every time that platter spins, a meter is running.

For the first decade of its existence, Dropbox was essentially a very polished, user-friendly frontend for Amazon’s garage. When you saved a file to Dropbox, their servers handled the metadata (the index card that says where the file is), but the actual payload (the bytes themselves) was quietly ushered into Amazon S3. It was a brilliant arrangement. It allowed a small startup to scale without worrying about hard drives catching fire or power supplies exploding.

But then Dropbox grew up. And when you grow up, living in a hotel starts to get expensive.

By 2015, Dropbox was storing exabytes of data. The problem wasn’t just the storage fee, which is akin to paying rent. The real killer was the “egress” and request fees. Amazon’s business model is brilliantly designed to function like the Hotel California: you can check out any time you like, but leaving with your luggage is going to cost you a fortune. Every time a user opened a file, edited a document, or synced a folder, a tiny cash register dinged in Jeff Bezos’s headquarters.

The bill was no longer just an operating expense. It was an existential threat. The unit economics were starting to look less like a software business and more like a philanthropy dedicated to funding Amazon’s R&D department.

So, they decided to do something that is generally considered suicidal in the modern software era. They decided to leave the cloud.

The audacity of building your own closet

In Silicon Valley, telling investors you plan to build your own data centers is like telling your spouse you plan to perform your own appendectomy using a steak knife and a YouTube tutorial. It is seen as messy, dangerous, and generally regressive. The prevailing wisdom is that hardware is a commodity, a utility like electricity or sewage, and you should let the professionals handle the sludge.

Dropbox ignored this. They launched a project with the internally ironic name “Magic Pocket.” The goal was to build a storage system from scratch that was cheaper than Amazon S3 but just as reliable.

To understand the scale of this bad idea, you have to understand that S3 is a miracle of engineering. It boasts “eleven nines” of durability (99.999999999%). That means if you store 10,000 files, you might lose one every 10 million years. Replicating that level of reliability requires an obsessive, almost pathological attention to detail.

Dropbox wasn’t just buying servers from Dell and plugging them in. They were designing their own chassis. They realized that standard storage servers were too generic. They needed density. They built a custom box nicknamed “Diskotech” (because engineers love puns almost as much as they love caffeine) that could cram up to a petabyte of storage into a rack unit that was barely deeper than a coffee table.

But hardware has a nasty habit of obeying the laws of physics, and physics is often annoying.

Good vibrations and bad hard drives

When you pack hundreds of spinning hard drives into a tight metal box, you encounter a phenomenon that sounds like a joke but is actually a nightmare: vibration.

Hard drives are mechanical divas. They consist of magnetic platters spinning at 7,200 revolutions per minute, with a read/write head hovering nanometers above the surface. If the drive vibrates too much, that head can’t find the track. It misses. It has to wait for the platter to spin around again. This introduces latency. If enough drives in a rack vibrate in harmony, the performance drops off a cliff.

The Dropbox team found that even the fans cooling the servers were causing acoustic vibrations that made the hard drives sulk. They had to become experts in firmware, dampening materials, and the resonant frequencies of sheet metal. It is the kind of problem you simply do not have when you rent space in the cloud. In the cloud, a vibrating server is someone else’s ticket. When you own the metal, it’s your weekend.

Then there was the software. They couldn’t just use off-the-shelf Linux tools. They wrote their own storage software in Rust. At the time, Rust was the new kid on the block, a language that promised memory safety without the garbage collection pauses of Go or Java. Using a relatively new language to manage the world’s most precious data was a gamble, but it paid off. It allowed them to squeeze every ounce of efficiency out of the CPU, keeping the power bill (and the heat) down.

The great migration was a stealth mission

Building the “Magic Pocket” was only half the battle. The other half was moving 500 petabytes of data from Amazon to these new custom-built caverns without losing a single byte and without any user noticing.

They adopted a strategy that I like to call the “belt, suspenders, and duct tape” approach. For a long period, they used a technique called dual writing. Every time you uploaded a file, Dropbox would save a copy to Amazon S3 (the old reliable) and a copy to their new Magic Pocket (the risky experiment).

They then spent months just verifying the data. They would ask the Magic Pocket to retrieve a file, compare it to the S3 version, and check if they matched perfectly. It was a paranoia-fueled audit. Only when they were absolutely certain that the new system wasn’t eating homework did they start disconnecting the Amazon feed.

They treated the migration like a bomb disposal operation. They moved users over silently. One day, you were fetching your resume from an AWS server in Virginia; the next day, you were fetching it from a custom Dropbox server in Texas. The transfer speeds were often better, but nobody sent out a press release. The ultimate sign of success in infrastructure engineering is that nobody knows you did anything at all.

The savings were vulgar

The financial impact was immediate and staggering. Over the two years following the migration, Dropbox saved nearly $75 million in operating costs. Their gross margins, the holy grail of SaaS financials, jumped from a worrisome 33% to a healthy 67%.

By owning the hardware, they cut out the middleman’s profit margin. They also gained the ability to use “Shingled Magnetic Recording” (SMR) drives. These are cheaper, high-density drives that are notoriously slow at writing data because the data tracks overlap like roof shingles (hence the name). Standard databases hate them. But because Dropbox wrote their own software specifically for their own use case (write once, read many), they could use these cheap, slow drives without the performance penalty.

This is the hidden superpower of leaving the cloud: optimization. AWS has to build servers that work reasonably well for everyone, from Netflix to the CIA to a teenager running a Minecraft server. That means they are optimized for the average. Dropbox optimized for the specific. They built a suit that fit them perfectly, rather than buying a “one size fits all” poncho from the rack.

Why you should probably not do this

If you are reading this and thinking, “I should build my own data center,” please stop. Go for a walk. Drink some water.

Dropbox’s success is the exception that proves the rule. They had a very specific workload (huge files, rarely modified) and a scale (exabytes) that justified the massive R&D expense. They had the budget to hire world-class engineers who dream in Rust and understand the acoustic properties of cooling fans.

For 99% of companies, the cloud is still the right answer. The premium you pay to AWS or Google is not just for storage; it is an insurance policy against complexity. You are paying so that you never have to think about a failed power supply unit at 3:00 AM on a Sunday. You are paying so that you don’t have to negotiate contracts for fiber optic cables or worry about the price of real estate in Nevada.

However, Dropbox didn’t leave the cloud entirely. And this is the punchline.

Today, Dropbox is a hybrid. They store the files, the cold, heavy, static blocks of data, in their own Magic Pocket. But the metadata? The search functions? The flashy AI features that summarize your documents? That all still runs in the cloud.

They treat the public cloud like a utility kitchen. When they need to cook up something complex that requires thousands of CPUs for an hour, they rent them from Amazon or Google. When they just need to store the leftovers, they put them in their own fridge.

Adulthood is knowing when to rent

The story of Dropbox leaving the cloud is not really about leaving. It is about maturity.

In the early days of a startup, you prioritize speed. You pay the “cloud tax” because it allows you to move fast and break things. But there comes a point where the tax becomes a burden.

Dropbox realized that renting is great for flexibility, but ownership is the only way to build equity. They turned a variable cost (a bill that grows every time a user uploads a photo) into a fixed cost (a warehouse full of depreciating assets). It is less sexy. It requires more plumbing.

But there is a quiet dignity in owning your own mess. Dropbox looked at the cloud, with its infinite promise and infinite invoices, and decided that sometimes, the most radical innovation is simply buying a screwdriver, rolling up your sleeves, and building the shelf yourself. Just be prepared for the vibration.

December 7, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

The paranoia that keeps Netflix online

On a particularly bleak Monday, October 20, the internet suffered a collective nervous breakdown. Amazon Web Services decided to take a spontaneous nap, and the digital world effectively dissolved. Slack turned into a $27 billion paperweight, leaving office workers forced to endure the horror of unfiltered face-to-face conversation. Disney+ went dark, stranding thousands of toddlers mid-episode of Bluey and forcing parents to confront the terrifying reality of their own unsupervised children. DoorDash robots sat frozen on sidewalks like confused Daleks, threatening the national supply of lukewarm tacos.

Yet, in a suburban basement somewhere in Ohio, a teenager named Tyler streamed all four seasons of Stranger Things in 4K resolution. He did not see a single buffering wheel. He had no idea the cloud was burning down around him.

This is the central paradox of Netflix. They have engineered a system so pathologically untrusting, so convinced that the world is out to get it, that actual infrastructure collapses register as nothing more than a mild inconvenience. I spent weeks digging through technical documentation and bothering former Netflix engineers to understand how they pulled this off. What I found was not just a story of brilliant code. It is a story of institutional paranoia so profound it borders on performance art.

The paranoid bouncer at the door

When you click play on The Crown, your request does not simply waltz into the Netflix servers. It first has to get past the digital equivalent of a nightclub bouncer who suspects everyone of trying to sneak in a weapon. This is Amazon’s Elastic Load Balancer, or ELB.

Most load balancers are polite traffic cops. They see a server and wave you through. Netflix’s ELB is different. It assumes that every server is about three seconds away from exploding.

Picture a nightclub with 47 identical dance floors. The bouncer’s job is to frisk you, judge your shoes, and shove you toward the floor least likely to collapse under the weight of too many people doing the Macarena. The ELB does this millions of times per second. It does not distribute traffic evenly because “even” implies trust. Instead, it routes you to the server with the least outstanding requests. It is constantly taking the blood pressure of the infrastructure.

If a server takes ten milliseconds too long to respond, the ELB treats it like a contagion. It cuts it off. It ghosts it. This is the first commandment of the Netflix religion. Trust nothing. Especially not the hardware you rent by the hour from a company that also sells lawnmowers and audiobooks.

The traffic controller with a god complex

Once you make it past the bouncer, you meet Zuul.

Zuul is the API gateway, but that is a boring term for what is essentially a micromanager with a caffeine addiction. Zuul is the middle manager who insists on being copied on every single email and then rewrites them because he didn’t like your tone.

Its job is to route your request to the right backend service. But Zuul is neurotic. It operates through a series of filters that feel less like software engineering and more like airport security theater. There is an inbound filter that authenticates you (the TSA agent squinting at your passport), an endpoint filter that routes you (the air traffic controller), and an outbound filter that scrubs the response (the PR agent who makes sure the server didn’t say anything offensive).

All of this runs on the Netty server framework, which sounds cute but is actually a multi-threaded octopus capable of juggling tens of thousands of open connections without dropping a single packet. During the outage, while other companies’ gateways were choking on retries, Zuul continued to sort traffic with the cold detachment of a bureaucrat stamping forms during a fire drill.

A dysfunctional family of specialists

Inside the architecture, there is no single “Netflix” application. There is a squabbling family of thousands of microservices. These are tiny, specialized programs that refuse to speak to each other directly and communicate only through carefully negotiated contracts.

You have Uncle User Profiles, who sits in the corner nursing a grudge about that time you watched seventeen episodes of Is It Cake? at 3 AM. There is Aunt Recommendations, a know-it-all who keeps suggesting The Office because you watched five minutes of it in 2018. Then there is Cousin Billing, who only shows up when money is involved and otherwise sulks in the basement.

This family is held together by a concept called “circuit breaking.” In the old days, they used a library called Hystrix. Think of Hystrix as a court-ordered family therapist with a taser.

When a service fails, let’s say the subtitles database catches fire, most applications would keep trying to call it, waiting for a response that will never come, until the entire system locks up. Netflix does not have time for that. If the subtitle service fails, the circuit breaker pops. The therapist steps in and says, “Uncle Subtitles is having an episode and is not allowed to talk for the next thirty seconds.”

The system then serves a fallback. Maybe you don’t get subtitles for a minute. Maybe you don’t get your personalized list of “Top Picks for Fernando.” But the video plays. The application degrades gracefully rather than failing catastrophically. It is the digital equivalent of losing a limb but continuing to run the marathon because you have a really good playlist going.

Here is a simplified view of how this “fail fast” logic looks in the configuration. It is basically a list of rules for ignoring people who are slow to answer:

hystrix:
  command:
    default:
      execution:
        isolation:
          thread:
            timeoutInMilliseconds: 1000
      circuitBreaker:
        requestVolumeThreshold: 20
        sleepWindowInMilliseconds: 5000

Translated to human English, this configuration says: “If you take more than one second to answer me, you are dead to me. If you fail twenty times, I am going to ignore you for five seconds until you get your act together.”

The digital hoarder pantry

At the scale Netflix operates, data storage is less about organization and more about controlled hoarding. They use a system that only makes sense if you have given up on the concept of minimalism.

They use Cassandra, a NoSQL database, to store user history. Cassandra is like a grandmother who saves every newspaper from 1952 because “you never know.” It is designed to be distributed. You can lose half your hard drives, and Cassandra will simply shrug and serve the data from a backup node.

But the real genius, and the reason they survived the apocalypse, is EVCache. This is their homemade caching system based on Memcached. It is a massive pantry where they store snacks they know you will want before you even ask for them.

Here is the kicker. They do not just cache movie data. They cache their own credentials.

When AWS went down, the specific service that failed was often IAM (Identity and Access Management). This is the service that checks if your computer is allowed to talk to the database. When IAM died, servers all over the world suddenly forgot who they were. They were having an identity crisis.

Netflix servers did not care. They had cached their credentials locally. They had pre-loaded the permissions. It is like filling your basement with canned goods, not because you anticipate a zombie apocalypse, but because you know the grocery store manager personally and you know he is unreliable. While other companies were frantically trying to call AWS to ask, “Who am I?”, Netflix’s servers were essentially lip-syncing their way through the performance using pre-recorded tapes.

Hiring a saboteur to guard the vault

This is where the engineering culture goes from sensible to beautifully unhinged. Netflix employs the Simian Army.

This is not a metaphor. It is a suite of software tools designed to break things. The most famous is Chaos Monkey. Its job is to randomly shut down live production servers during business hours. It just kills them. No warning. No mercy.

Then there is Chaos Kong. Chaos Kong does not just kill a server. It simulates the destruction of an entire AWS region. It nukes the East Coast.

Let that sink in for a moment. Netflix pays engineers very high salaries to build software that attacks their own infrastructure. It is like hiring a pyromaniac to work as a fire inspector. Sure, he will find every flammable material in the building, but usually by setting it on fire first.

I spoke with a former engineer who described their “region evacuation” drills. “We basically declare war on ourselves,” she told me. “At 10 AM on a Tuesday, usually after the second coffee, we decide to kill us-east-1. The first time we did it, half the company needed therapy. Now? We can evacuate a region in six minutes. It’s boring.”

This is the secret sauce. The reason Netflix stayed up is that they have rehearsed the outage so many times that it feels like a chore. While other companies were discovering their disaster recovery plans were written in crayon, Netflix engineers were calmly executing a routine they practice more often than they practice dental hygiene.

Building your own highway system

There is a final plot twist. When you hit play, the video, strictly speaking, does not come from the cloud. It comes from Open Connect.

Netflix realized years ago that the public internet is a dirt road full of potholes. So they built their own private highway. They designed physical hardware, bright red boxes packed with hard drives, and shipped them to Internet Service Providers (ISPs) all over the world.

These boxes sit inside the data centers of your local internet provider. They are like mini-warehouses. When a new season of The Queen’s Gambit comes out, Netflix pre-loads it onto these boxes at 4 AM when nobody is using the internet.

So when you stream the show, the data is not traveling from an Amazon data center in Virginia. It is traveling from a box down the street. It might travel five miles instead of two thousand.

It is an invasive, brilliant strategy. It is like Netflix insisted on installing a mini-fridge in your neighbor’s garage just to ensure your beer is three degrees colder. During the cloud outage, even if the “brain” of Netflix (the control plane in AWS) was having a seizure, the “body” (the video files in Open Connect) was fine. The content was already local. The cloud could burn, but the movie was already in the house.

The beautiful absurdity of it all

The irony is delicious. Netflix is AWS’s biggest customer and its biggest success story. Yet they survive on AWS by fundamentally refusing to trust AWS. They cache credentials, they pre-pull images, they build their own delivery network, and they unleash monkeys to destroy their own servers just to prove they can survive the murder attempt.

They have weaponized Murphy’s Law. They built a company where the unofficial motto seems to be “Everything fails, all the time, so let’s get good at failing.”

So the next time the internet breaks and your Slack goes silent, do not panic. Just open Netflix. Somewhere in the dark, a Chaos Monkey is pulling a plug, a paranoid bouncer is shoving traffic away from a burning server, and your binge-watching will continue uninterrupted. The internet might be held together by duct tape and hubris, but Netflix has invested in really, really expensive duct tape.

November 28, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

The secret and anxious life of a data packet inside AWS

You press a finger against the greasy glass of your smartphone. You are in a café in Melbourne, the coffee is lukewarm, and you have made the executive decision to watch a video of a cat falling off a Roomba. It feels like a trivial action.

But for the data packet birthed by that tap, this is D-Day.

It is a tiny, nervous backpacker being kicked out into the digital wilderness with nothing but a destination address and a crippling fear of latency. Its journey through Amazon’s cloud infrastructure is not the clean, sterile diagram your systems architect drew on a whiteboard. It is a micro drama of hope, bureaucratic routing, and existential dread that plays out in roughly 200 milliseconds.

We tend to think of the internet as a series of tubes, but it is more accurate to think of it as a series of highly opinionated bouncers and overworked bureaucrats. To understand how your cat video loads, we have to follow this anxious packet through the gauntlet of Amazon Web Services (AWS).

The initial panic and the mapmaker with a god complex

Our packet leaves your phone and hits the cellular network. It is screaming for directions. It needs to find the server hosting the video, but it only has a name (e.g., cats.example.com). Computers do not speak English; they speak IP addresses.

Enter Route 53.

Amazon calls Route 53 a Domain Name System (DNS) service. In practice, it acts like a travel agent with a philosophy degree and multiple personality disorder. It does not just look up addresses; it judges you based on where you are standing and how healthy the destination looks.

If Route 53 is configured with Geolocation Routing, it acts like a local snob. It looks at our packet’s passport, sees “Melbourne,” and sneers. “You are not going to the Oregon server. The Americans are asleep, and the latency would be dreadful. You are going to Sydney.”

However, Route 53 is also a hypochondriac. Through Health Checks, it constantly pokes the servers to see if they are alive. It is the digital equivalent of texting a friend, “Are you awake?” every ten seconds. If the Sydney server fails to respond three times in a row, Route 53 assumes the worst, death, fire, or a kernel panic, and instantly reroutes our packet to Singapore. This is Failover Routing, the prepared pessimist of the group.

The packet doesn’t care about the logic. It just wants an address so it can stop hyperventilating in the void.

CloudFront is the desperate golden retriever of the internet

Armed with an IP address, our packet rushes toward the destination. But hopefully, it never actually reaches the main server. That would be inefficient. Instead, it runs into CloudFront.

CloudFront is a Content Delivery Network (CDN). Think of it as a network of convenience stores scattered all over the globe, so you don’t have to drive to the factory to buy milk. Or, more accurately, think of CloudFront as a Golden Retriever that wants to please you so badly it is vibrating.

Its job is caching. It memorizes content. When our packet arrives at the CloudFront “Edge Location” in Melbourne, the service frantically checks its pockets. “Do I have the cat video? I think I have the cat video. I fetched it for that guy in the corner five minutes ago!”

If it has the video (a Cache Hit), it hands it over immediately. The packet is relieved. The journey is over. Everyone goes home happy.

But if CloudFront cannot find the video (a Cache Miss), the mood turns sour. The Golden Retriever looks guilty. It now has to turn around and run all the way to the origin server to fetch the data fresh. This is the “Edge” of the network, a place that sounds like a U2 guitarist but is actually just a rack of humming metal in a secure facility near the airport.

The tragedy of CloudFront is the Time To Live (TTL). This is the expiration date on the data. If the TTL is set to 24 hours, CloudFront will proudly hand you a version of the website from yesterday, oblivious to the fact that you updated the spelling errors this morning. It is like a dog bringing you a dead bird it found last week, convinced it is still a great gift.

The security guard who judges your shoes

If our packet suffers a Cache Miss, it must travel deeper into the data center. But first, it has to get past the Web Application Firewall (WAF).

The WAF is not a firewall in the traditional sense; it is a nightclub bouncer who has had a very long shift and hates everyone. It stands at the velvet rope, scrutinizing every packet for signs of “malicious intent.”

It checks for SQL injection, which is the digital equivalent of trying to sneak a knife into the club tape-draped to your ankle. It checks for Cross-Site Scripting (XSS), which is essentially trying to trick the club into changing its name to “Free Drinks for Everyone.”

The WAF operates on a set of rules that range from reasonable to paranoid. Sometimes, it blocks a legitimate packet just because it looks suspicious, perhaps the packet is too large, or it came from a country the WAF has decided to distrust today. The packet pleads its innocence, but the WAF is a piece of software code; it does not negotiate. It simply returns a 403 Forbidden error, which translates roughly to: “Your shoes are ugly. Get out.”

The Application Load Balancer manages the VIP list

Having survived the bouncer, our weary packet arrives at the Application Load Balancer (ALB). If the WAF is the bouncer, the ALB is the Maitre D’ holding the clipboard.

The ALB is obsessed with fairness and health. It stands in front of a pool of identical servers (the Target Group) and decides who has to do the work. It is trying to prevent any single server from having a nervous breakdown due to overcrowding.

“Server A is busy processing a login request,” the ALB mutters. “Server B is currently restarting because it had a panic attack. You,” it points to our packet, “you go to Server C. It looks bored.”

The ALB’s relationship with the servers is codependent and toxic. It performs health checks on them relentlessly. It demands a 200 OK status code every thirty seconds. If a server takes too long to reply or replies with an error, the ALB declares it “Unhealthy” and stops sending it friends. It effectively ghosts the server until it gets its act together.

The Origin, where the magic (and heat) happens

Finally, the packet reaches the destination. The Origin.

We like to imagine the cloud as an ethereal, fluffy place. In reality, the Origin is likely an EC2 instance, a virtual slice of a computer sitting in a windowless room in Northern Virginia or Dublin. The room is deafeningly loud with the sound of cooling fans and smells of ozone and hot plastic.

Here, the application code actually runs. The request is processed, and the server realizes it needs the actual video file. It reaches out to Amazon S3 (Simple Storage Service), which is essentially a bottomless digital bucket where the internet hoards its data.

The EC2 instance grabs the video from the bucket, processes it, and prepares to send it back.

This is the most fragile part of the journey. If the code has a bug, the server might vomit a 500 Internal Server Error. This is the server saying, “I tried, but I broke something inside myself.” If the database is overwhelmed, the request might time out.

When this happens, the failure cascades back up the chain. The ALB shrugs and tells the user “502 Bad Gateway” (translation: ” The guy in the back room isn’t talking to me”). The WAF doesn’t care. CloudFront caches the error page, so now everyone sees the error for the next hour.

And somewhere, a DevOps engineer’s phone starts buzzing at 3:00 AM.

The return trip

But today, the system works. The Origin retrieves the video bytes. It hands them to the ALB, which passes them to the WAF (who checks them one last time for contraband), which hands them to CloudFront, which hands them to the cellular network.

The packet returns to your phone. The screen flickers. The cat falls off the Roomba. You chuckle, swipe up, and request the next video.

You have no idea that you just forced a tiny, digital backpacker to navigate a global bureaucracy, evade a paranoid security guard, and wake up a server in a different hemisphere, all in less time than it takes you to blink. It is a modern marvel held together by fiber optics and anxiety.

So spare a thought for the data. It has seen things you wouldn’t believe.

November 23, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

AWS Lambda SQS provisioned mode is cheaper than therapy

There is a specific flavor of nausea reserved for serverless engineering teams. It usually strikes at 2 a.m., shortly after a major product launch, when someone posts a triumphant screenshot of user traffic in Slack. While the marketing team is virtually high-fiving, CloudWatch quietly begins to draw a perfect, vertical line that looks less like a growth chart and more like a cliff edge.

Your SQS queues swell. Lambda invocations crawl. Suddenly, the phrase “fully managed service” sounds less comforting and more like a cruel punchline delivered by a distant cloud provider.

For years, the relationship between Amazon SQS and AWS Lambda has been the backbone of event-driven architecture. You wire up an event source mapping, let Lambda poll the queue, and trust the system to scale as messages arrive. Most days, this works beautifully. On the wrong day, under the wrong kind of spike, it works “eventually.”

But in the world of high-frequency trading or flash sales, “eventually” is just a polite synonym for “too late.”

With the release of AWS Lambda SQS Provisioned Mode on November 14, Amazon is finally admitting that sometimes magic is too slow. It grants you explicit control over the invisible workers that poll SQS for your function. It ensures they are already awake, caffeinated, and standing in line before the mob shows up. It allows you to trade a bit of extra planning (and money) for the guarantee that your system won’t hit the snooze button while your backlog turns into a towering monument to failure.

The uncomfortable truth about standard SQS polling

To understand why we need Provisioned Mode, we have to look at the somewhat lazy nature of the standard behavior.

Out of the box, Lambda uses an event source mapping to poll SQS on your behalf. You give it a queue and some basic configuration, and Lambda spins up pollers to check for work. You never see these pollers. They are the ghosts in the machine.

The problem with ghosts is that they are not particularly urgent. When a massive spike hits your queue, Lambda realizes it needs more pollers and more concurrent function invocations. However, it does not do this instantly. It ramps up. It adds capacity in increments, like a cautious driver merging onto a freeway.

For a steady workload, you will never notice this ramp-up. But during a viral marketing campaign or a market crash, those minutes of warming up feel like an eternity. You are essentially watching a barista who refuses to start grinding coffee beans until the line of customers has already curled around the block.

Standard SQS polling gives you tools like batch size, but it denies you direct influence over the urgency of the consumption. You cannot tell the system, “I need ten workers ready right now.” You can only stand in line and hope the algorithm notices you are drowning.

This is acceptable for background jobs like resizing images or sending emails. It is decidedly less acceptable for payment processing or fraud detection. In those cases, watching twenty thousand messages pile up while your system “automatically scales” is not an architectural feature. It is a resume-generating event.

Paying for a standing army instead of volunteers

Provisioned Mode flips the script on this reactive behavior. Instead of letting Lambda decide how many pollers to use based purely on demand, you tell it the minimum and maximum number of event pollers you want reserved for that queue.

An event poller is a dedicated worker that reads from SQS and hands batches of messages to your function. In standard mode, these pollers are summoned from a shared pool when needed. In Provisioned Mode, you are paying to keep them on retainer.

Think of it as the difference between calling a ride-share service and hiring a private driver to sit in your driveway with the engine running. One is efficient for the general public; the other is necessary if you need to leave the house in exactly three seconds.

The benefits are stark when translated into human terms.

First, you get speed. AWS advertises significantly faster scaling for SQS event source mappings in Provisioned Mode. We are talking about adding up to one thousand new concurrent invocations per minute.

Second, you get capacity. Provisioned Mode can support massive concurrency per SQS mapping, far higher than the default capabilities.

Third, and perhaps most importantly, you get predictability. A single poller is not just a warm body. It is a unit of throughput (handling up to 1 MB per second or 10 concurrent invokes). By setting a minimum number of pollers, you are mathematically guaranteeing a baseline of throughput. You are no longer hoping the waiters show up; you have paid their salaries in advance.

Configuring this without losing your mind

The good news is that Provisioned Mode is not a new service with its own terrifying learning curve. It is just a configuration toggle on the event source mapping you are already using. You can set it up in the AWS Console, the CLI, or your Infrastructure as Code tool of choice.

The interface asks for two numbers, and this is where the engineering art form comes in.

First, it asks for Minimum Pollers. This is the number of workers you always want ready.

Second, it asks for Maximum Pollers. This is the ceiling, the limit you set to ensure you do not accidentally DDoS your own database.

Choosing these numbers feels a bit like gambling, but there is a logic to it. For the minimum, pick a number that comfortably handles your typical traffic plus a standard spike. Start small. Setting this to 100 when you usually need 2 is the serverless equivalent of buying a school bus to commute to work alone.

For the maximum, look at your downstream systems. There is no point in setting a maximum that allows 5,000 concurrent Lambda functions if your relational database curls into a fetal position at 500 connections.

Once you enable it, you need to watch your metrics. Keep an eye on “Queue Depth” and “Age of Oldest Message.” If the backlog clears too slowly, buy more pollers. If your database administrator starts sending you angry emails in all caps, reduce the maximum. The goal is not perfection on day one; it is to replace guesswork with a feedback loop.

The financial hangover

Nothing in life is free, and this applies doubly to AWS features that solve headaches.

When you enable Provisioned Mode, AWS begins charging you for “Event Poller Units.” You pay for the minimum pollers you configure, regardless of whether there are messages in the queue. You are paying for readiness.

This is a mental shift for serverless purists. The whole promise of serverless was “pay for what you use.” Provisioned Mode is “pay for what you might need.”

You are essentially renting a standing army. Most of the time, they will just stand there, playing cards and eating your budget. But when the enemy (traffic) attacks, they are already in position. Standard SQS polling is cheaper because it relies on volunteers. Volunteers are free, but they take a while to put on their boots.

From a FinOps perspective, or simply from the perspective of explaining the bill to your boss, the question is not “Is this expensive?” The question is “What is the cost of latency?”

For a background report generator, a five-minute delay costs nothing. For a high-frequency trading platform, a five-second delay costs everything. You should not enable Provisioned Mode on every queue in your account. That would be financial malpractice. You reserve it for the critical paths, the workflows where the price of slowness is measured in lost customers rather than just infrastructure dollars.

Why you should care about the fourth dial

Architecturally, Provisioned Mode gives us a new layer of control. Previously, we had three main dials in event-driven systems: how fast we write to the queue, how fast the consumers process messages, and how much concurrency Lambda is allowed.

Provisioned Mode adds a fourth dial: the aggression of the retrieval.

It allows you to reason about your system deterministically. If you know that one poller provides X amount of throughput, you can stack them to meet a specific Service Level Agreement. It turns a “best effort” system into a “calculated guarantee” system.

Serverless was sold to us as freedom from capacity planning. We were told we could just write code and let the cloud handle the undignified details of scaling. For many workloads, that promise holds true.

But as your workloads become more critical, you discover the uncomfortable corners where “just let it scale” is not enough. Latency budgets shrink. Compliance rules tighten. Customers grow less patient.

AWS Lambda SQS Provisioned Mode is a small, targeted answer to that discomfort. It allows you to say, “I want at least this much readiness,” and have the platform respect that wish, even when your traffic behaves like a toddler on a sugar high.

So, pick your most critical queue. The one that keeps you awake at night. Enable Provisioned Mode, set a modest minimum, and watch the metrics. Your future self, staring at a flat latency graph during the next Black Friday, will be grateful you decided to stop trusting in magic and started paying for physics.

November 20, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

Escaping the AWS NAT Gateway toll booth

My coffee went cold. I was staring at my AWS bill, and one line item was staring back at me with a judgmental smirk: NAT Gateway: 33,01 €.

This wasn’t for compute. This wasn’t for storing terabytes of crucial data. This was for the simple, mundane privilege of letting my Lambda functions send emails and tell Stripe to charge a credit card.

Let’s talk about NAT Gateway pricing. It’s a special kind of pain.

$0.045 per hour (That’s roughly $33 a month, just for existing).
$0.045 per GB processed (You get charged for your own data).
…and that’s per Availability Zone. For High Availability, you multiply by two or three.

I was suddenly paying more for a digital toll booth operator than I was for the actual application logic running my startup. That’s when I started asking questions. Did I really need this? What was I actually paying for? And more importantly, was there another way?

This is the story of how I hunted down that 33€ line item. By the end, you’ll know exactly if you need a NAT Gateway, or if you’re just burning money to keep the AWS machine fed.

The great NAT lie

Every AWS tutorial, every Stack Overflow answer, every “serverless best practice” blog post chants the same mantra: “If your Lambda needs to access the internet, and it’s in a VPC, you need a NAT Gateway.”

It’s presented as a law of physics. Like gravity, or the fact that DNS will always be the problem. And I, like a good, obedient engineer, followed the instructions. I clicked the button. I added the NAT. And then the bill came.

It turns out that obedience is expensive.

The gilded cage we call a VPC

Before we storm the castle, we have to understand why we built the castle in the first place. Why are our Lambdas in this mess? The answer is the Virtual Private Cloud (VPC).

By default, a Lambda function is a free spirit. It’s born with a magical, AWS-managed connection to the outside world. It can call any API it wants. It’s a social butterfly.

But then, security happens.

We have a managed database, like MongoDB Atlas. We absolutely, positively do not want this database exposed to the public internet. That’s like shouting your bank details across a crowded shopping mall. So, we rightly configure it to only accept private connections.

To let our Lambda talk to this database, we have to build a “gated community” for it. That’s our VPC. We move the Lambda inside this community and set up a “VPC Peering” connection, which is like a private, guarded footpath between our VPC and the MongoDB VPC.

Our Lambda can now securely whisper secrets to the database. The traffic never touches the public internet. We are secure. We are compliant. We are… trapped.

House arrest

We solved one problem but created a massive new one. In building this fortress to protect our database, we built it with no doors to the outside world.

Our Lambda is now on house arrest.

Sure, it can talk to the database in the adjoining room. But it can no longer call the Stripe API to process a payment. It can’t call an email service. It can’t even phone its own cousins in the AWS family, like AWS Secrets Manager or S3 (not without extra work, anyway). Any attempt to reach the internet just… times out. It’s the sound of silence.

This is the dilemma. To be secure, our Lambda must be in a VPC. But once in a VPC, it’s useless for half its job.

Enter the expensive chaperone

This is where the AWS Gospel presents its solution: the NAT Gateway.

The NAT (Network Address Translation) Gateway is, in our analogy, an extremely expensive, bonded chaperone.

You place this chaperone in a “public” part of your gated community (a public subnet). When your Lambda on house arrest needs to send a letter to the outside world (like an API call to Stripe), it gives the letter to the chaperone.

The chaperone (the NAT) takes the letter, walks it to the main gate, puts its own public return address on it, and sends it. When the reply comes back, the chaperone receives it, verifies it’s for the Lambda, and delivers it.

This works. It’s secure. The Lambda’s private address is never exposed.

But this chaperone charges you. It charges you by the hour just to be on call. It charges you for every letter it carries (data processed). And as we established, you need three of them if you want to be properly redundant.

This is a racket.

The “Split Personality” solution

I refused to pay the toll. There had to be another way. The solution came from realizing I was trying to make one Lambda do two completely opposite jobs.

What if, instead of one “do-it-all” Lambda, I created two specialists?

The hermit: This Lambda lives inside the VPC. Its one and only job is to talk to the database. It is antisocial, secure, and has no idea the internet exists.
The messenger: This Lambda lives outside the VPC. It’s a “free-range” Lambda. Because it’s not attached to any VPC, AWS magically gives it that default internet access. It cannot talk to the database (which is good!), but it can talk to Stripe all day long.

The plan is simple: when The hermit (VPC Lambda) needs something from the internet, it invokes The messenger (Proxy Lambda). It hands it a note: “Please tell Stripe to charge $25.00.” The messenger runs the errand, gets the receipt, and passes it back to The hermit, who then safely logs the result in the database.

It’s a “split personality” architecture.

But is it safe?

I can hear you asking: “Wait. A Lambda with internet access? Isn’t that like leaving your front door wide open for attackers?”

No. And this is the most beautiful part.

A Lambda function, whether in a VPC or not, never gets a public IP address. It can make outbound calls, but nothing from the public internet can initiate a call to it.

It’s like having a phone that can only make calls, not receive them. It’s unreachable. The “Messenger” Lambda is perfectly safe to live outside the VPC, ready to do our bidding.

The secret tunnel system

So, I built it. The hermit. The messenger. I was a genius. I hit “test.”

…timeout.

Of course. I forgot. The hermit is still on house arrest. “Invoking” another Lambda is, itself, an AWS API call. It’s a request that has to leave the VPC to reach the AWS Lambda service. My Lambda couldn’t even call its own lawyer.

This is where the real solution lies. Not in a gateway, but in a series of tunnels.

They’re called VPC Endpoints.

A VPC Endpoint is not a big, expensive, public chaperone. It’s a private, secret tunnel that you build directly from your VPC to a specific AWS service, all within the AWS network.

So, I built two tunnels:

A tunnel to AWS Secrets Manager: Now my hermit Lambda can get its API keys directly, without ever leaving the house.
A tunnel to AWS Lambda: Now my hermit Lambda can use its private phone to “invoke” The messenger.

These endpoints have a small hourly cost, but it’s a fraction of a NAT Gateway, and the data processing fee is either tiny or free, depending on the endpoint type. We’ve replaced a $100/mo toll road with a $5/mo private footpath.

(A grumpy side note: annoyingly, some AWS services like Cognito don’t support VPC Endpoints. For those, you still have to use the Messenger proxy pattern. But for most, the tunnels work.)

Our glorious new contraption

Let’s look at our payment handler again. This little function needed to:

Get API keys from AWS Secrets Manager.
Call Stripe’s API.
Write the transaction to MongoDB.

Here is how our new, glorious, Rube Goldberg machine works:

Step 1: The Payment Lambda (The hermit) gets a request.
Step 2: It needs keys. It pops over to AWS Secrets Manager through its private tunnel (the VPC Endpoint). No internet needed.
Step 3: It needs to charge a card. It calls the invoke command, which goes through its other private tunnel to the AWS Lambda service, triggering The messenger.
Step 4: The messenger (Proxy Lambda), living in the free-range world, makes the outbound call to Stripe. Stripe, delighted, processes the payment and sends a reply.
Step 5: The messenger passes the success (or failure) response back to The hermit.
Step 6: The hermit, now holding the result, calmly turns and writes the transaction record to MongoDB via its private VPC Peering connection.

Everything works. Nothing is exposed. And the NAT Gateway bill is 0€.

For those who speak in code

Here is a simplified look at what our two specialist Lambdas are doing.

Payment Lambda (The hermit – INSIDE VPC)

// This Lambda is attached to your VPC
// It needs VPC Endpoints for 'lambda' and 'secretsmanager'

import { InvokeCommand, LambdaClient } from "@aws-sdk/client-lambda";
// ... (imports for Secrets Manager and Mongo)

const lambda = new LambdaClient({});

export const handler = async (event) => {
  try {
    const amountToCharge = 2500; // 25.00

    // 1. Get secrets via VPC Endpoint
    // const apiKeys = await getSecretsFromManager();
    
    // 2. Prepare to invoke the proxy
    const command = new InvokeCommand({
      FunctionName: process.env.PAYMENT_PROXY_FUNCTION_NAME,
      InvocationType: "RequestResponse",
      Payload: JSON.stringify({
        chargeDetails: { amount: amountToCharge, currency: "usd" },
      }),
    });

    // 3. Invoke the proxy Lambda via VPC Endpoint
    const response = await lambda.send(command);
    const proxyResponse = JSON.parse(
      Buffer.from(response.Payload).toString()
    );

    if (proxyResponse.status === "success") {
      // 4. Write to MongoDB via VPC Peering
      // await writePaymentRecordToMongo(proxyResponse.transactionId);
      
      return {
        statusCode: 200,
        body: `Payment succeeded! TxID: ${proxyResponse.transactionId}`,
      };
    } else {
      // Handle payment failure
      return { statusCode: 400, body: "Payment failed." };
    }
  } catch (error) {
    console.error(error);
    return { statusCode: 500, body: "Server error" };
  }
};

Proxy Lambda (The messenger – OUTSIDE VPC)

// This Lambda is NOT attached to a VPC
// It has default internet access

// ... (import for your Stripe client)
// const stripe = new Stripe(process.env.STRIPE_SECRET_KEY);

export const handler = async (event) => {
  // 1. Extract the data from the invoking Hermit
  const { chargeDetails } = event.payload;

  try {
    // 2. Call the external Stripe API
    // const stripeResponse = await stripe.charges.create({
    //   amount: chargeDetails.amount,
    //   currency: chargeDetails.currency,
    //   source: "tok_visa", // Example token
    // });
   
    // Mocking the Stripe call for this example
    const stripeResponse = {
        id: `txn_${Math.random().toString(36).substring(2, 15)}`,
        status: 'succeeded'
    };


    if (stripeResponse.status === 'succeeded') {
      // 3. Return the successful result
      return {
        status: "success",
        transactionId: stripeResponse.id,
      };
    } else {
      return { status: "failed", error: "Stripe decline" };
    }
  } catch (err) {
    // 4. Return any errors
    return {
      status: "failed",
      error: `Error contacting Stripe: ${err.message}`,
    };
  }
};

Was it worth it?

And there it is. A production-grade, secure, and resilient system. Our hermit Lambda is safe in its VPC, talking to the database, our Messenger Lambda is happily running errands on the internet, and our secret tunnels are connecting everything privately.

That said, figuring all this out and integrating it into a production system takes a significant amount of time. This… this contraption of proxies and endpoints is, frankly, a headache.

If you don’t want the headache, sometimes it’s easier to just pay that damn 30€ for a NAT Gateway and move on with your life.

The purpose of this article wasn’t just to save a few bucks. It was to pull back the curtain. To show that the “one true way” isn’t the only way, and to prove that with a little bit of architectural curiosity, you can, in fact, escape the AWS NAT Gateway toll booth.

November 15, 2025 by Fernando SRE Cloud stuff SRE stuff

Your Multi-Region strategy is a fantasy

The recent failure showed us the truth: your data is stuck, and active-active failover is a fantasy for 99% of us. Here’s a pragmatic high-availability strategy that actually works.

Well, that was an intense week.

When the great AWS outage of October 2025 hit, I did what every senior IT person does: I grabbed my largest coffee mug, opened our monitoring dashboard, and settled in to watch the world burn. us-east-1, the internet’s stubbornly persistent center of gravity, was having what you’d call a very bad day.

And just like clockwork, as the post-mortems rolled in, the old, tired refrain started up on social media and in Slack: “This is why you must be multi-region.”

I’m going to tell you the truth that vendors, conference speakers, and that one overly enthusiastic junior dev on your team won’t. For 99% of companies, “multi-region” is a lie.

It’s an expensive, complex, and dangerous myth sold as a silver bullet. And the recent outage just proved it.

The “Just Be Multi-Region” fantasy

On paper, it sounds so simple. It’s a lullaby for VPs.

You just run your app in us-east-1 (Virginia) and us-west-2 (Oregon). You put a shiny global load balancer in front, and if Virginia decides to spontaneously become an underwater volcano, poof! All your traffic seamlessly fails over to Oregon. Zero downtime. The SREs are heroes. Champagne for everyone.

This is a fantasy.

It’s a fantasy that costs millions of dollars and lures development teams into a labyrinth of complexity they will never escape. I’ve spent my career building systems that need to stay online. I’ve sat in the planning meetings and priced out the “real” cost. Let me tell you, true active-active multi-region isn’t just “hard”; it’s a completely different class of engineering.

And it’s one that your company almost certainly doesn’t need.

The three killers of Multi-Region dreams

It’s not the application servers. Spinning up EC2 instances or containers in another region is the easy part. That’s what we have Infrastructure as Code for. Any intern can do that.

The problem isn’t the compute. The problem is, and always has been, the data.

Killer 1: Data has gravity, and it’s a jerk

This is the single most important concept in cloud architecture. Data has gravity.

Your application code is a PDF. It’s stateless and lightweight. You can email it, copy it, and run it anywhere. Your 10TB PostgreSQL database is not a PDF. It’s the 300-pound antique oak desk the computer is sitting on. You can’t just “seamlessly fail it over” to another continent.

To have a true seamless failover, your data must be available in the second region at the exact moment of the failure. This means you need synchronous, real-time replication across thousands of miles.

Guess what that does to your write performance? It’s like trying to have a conversation with someone on Mars. The latency of a round-trip from Virginia to Oregon adds hundreds of milliseconds to every single database write. The application becomes unusably slow. Every time a user clicks “save,” they have to wait for a photon to physically travel across the country and back. Your users will hate it.

“Okay,” you say, “we’ll use asynchronous replication!”

Great. Now when us-east-1 fails, you’ve lost the last 5 minutes of data. Every transaction, every new user sign-up, every shopping cart order. Vanished. You’ve traded a “Recovery Time” of zero for a “Data Loss” that is completely unacceptable. Go explain to the finance department that you purposefully designed a system that throws away the most recent customer orders. I’ll wait.

This is the trap. Your compute is portable; your data is anchored.

Killer 2: The astronomical cost

I was on a project once where the CTO, fresh from a vendor conference, wanted a full active-active multi-region setup. We scoped it.

Running 2x the servers was fine. The real cost was the inter-region data transfer.

AWS (and all cloud providers) charge an absolute fortune for data moving between their regions. It’s the “hotel minibar” of cloud services. Every single byte your database replicates, every log, every file transfer… cha-ching.

Our projected bill for the data replication and the specialized services (like Aurora Global Databases or DynamoDB Global Tables) was three times the cost of the entire rest of the infrastructure.

You are paying a massive premium for a fleet of servers, databases, and network gateways that are sitting idle 99.9% of the time. It’s like buying the world’s most expensive gym membership and only going once every five years to “test” it. It’s an insurance policy so expensive, you can’t afford the disaster it’s meant to protect you from.

Killer 3: The crushing complexity

A multi-region system isn’t just two copies of your app. It’s a brand new, highly complex, slightly psychotic distributed system that you now have to feed and care for.

You now have to solve problems you never even thought about:

Global DNS failover: How does Route 53 know a region is down? Health checks fail. But what if the health check itself fails? What if the health check thinks Virginia is fine, but it’s just hallucinating?
Data write conflicts: This is the fun part. What if a user in New York (writing to us-east-1) and a user in California (writing to us-west-2) update the same record at the same time? Welcome to the world of split-brain. Who wins? Nobody. You now have two “canonical” truths, and your database is having an existential crisis. Your job just went from “Cloud Architect” to “Data Therapist.”
Testing: How do you even test a full regional failover? Do you have a big red “Kill Virginia” button? Are you sure you know what will happen when you press it? On a Tuesday afternoon? I didn’t think so.

You haven’t just doubled your infrastructure; you’ve 10x’d your architectural complexity.

But we have Kubernetes because we are Cloud Native

This was my favorite part of the October 2025 outage.

I saw so many teams that thought Kubernetes would save them. They had their fancy federated K8s clusters spanning multiple regions, YAML files as far as the eye could see.

And they still went down.

Why? Because Kubernetes doesn’t solve data gravity!

Your K8s cluster in us-west-2 dutifully spun up all your application pods. They woke up, stretched, and immediately started screaming: “WHERE IS MY DISK?!”

Your persistent volumes (PVs) are backed by EBS or EFS. That ‘E’ stands for ‘Elastic,’ not ‘Extradimensional.’ That disk is physically, stubbornly, regionally attached to Virginia. Your pods in Oregon can’t mount a disk that lives 3,000 miles away.

Unless you’ve invested in another layer of incredibly complex, eye-wateringly expensive storage replication software, your “cloud-native” K8s cluster was just a collection of very expensive, very confused applications shouting into the void for a database that was currently offline.

A pragmatic high availability strategy that actually works

So if multi-region is a lie, what do we do? Just give up? Go home? Take up farming?

Yes. You accept some downtime.

You stop chasing the “five nines” (99.999%) myth and start being honest with the business. Your goal is not “zero downtime.” Your goal is a tested and predictable recovery.

Here is the sane strategy.

1. Embrace Multi-AZ (The real HA)

This is what AWS actually means by “high availability.” Run your application across multiple Availability Zones (AZs) within a single region. An AZ is a physically separate data center. us-east-1a and us-east-1b are miles apart, with different power and network.

This is like having a backup generator for your house. Multi-region is like building an identical, fully-furnished duplicate house in another city just in case a meteor hits your first one.

Use a Multi-AZ RDS instance. Use an Auto Scaling Group that spans AZs. This protects you from 99% of common failures: a server rack dying, a network switch failing, or a construction crew cutting a fiber line. This should be your default. It’s cheap, it’s easy, and it works.

2. Focus on RTO and RPO

Stop talking about “nines” and start talking about two simple numbers:

RTO (Recovery Time Objective): How fast do we need to be back up?
RPO (Recovery Point Objective): How much data can we afford to lose?

Get a real answer from the business, not a fantasy. Is a 4-hour RTO and a 15-minute RPO acceptable? For almost everyone, the answer is yes.

3. Build a “Warm Standby” (The sane DR)

This is the strategy that actually works. It’s the “fire drill” plan, not the “build a duplicate city” plan.

Infrastructure: Your entire infrastructure is defined in Terraform or CloudFormation. You can rebuild it from scratch in any region with a single command.
Data: You take regular snapshots of your database (e.g., every 15 minutes) and automatically copy them to your disaster recovery region (us-west-2).
The plan: When us-east-1 dies, you declare a disaster. The on-call engineer runs the “Deploy-to-DR” script.

Here’s a taste of what that “sane” infrastructure-as-code looks like. You’re not paying for two of everything. You’re paying for a blueprint and a backup.

# main.tf (in your primary region module)
# This is just a normal server
resource "aws_instance" "app_server" {
  count         = 3 # Your normal production count
  ami           = "ami-0abcdef123456"
  instance_type = "t3.large"
  # ... other config
}

# dr.tf (in your DR region module)
# This server doesn't even exist... until you need it.
resource "aws_instance" "dr_app_server" {
  # This is the magic.
  # This resource is "off" by default (count = 0).
  # You flip one variable (is_disaster = true) to build it.
  count         = var.is_disaster ? 3 : 0
  provider      = aws.dr_region # Pointing to us-west-2
  ami           = "ami-0abcdef123456" # Same AMI
  instance_type = "t3.large"
  # ... other config
}

resource "aws_db_instance" "dr_database" {
  count                   = var.is_disaster ? 1 : 0
  provider                = aws.dr_region
  
  # Here it is: You build the new DB from the
  # latest snapshot you've been copying over.
  replicate_source_db     = var.latest_db_snapshot_arn
  
  instance_class          = "db.r5.large"
  # ... other config
}

You flip a single DNS record in Route 53 to point all traffic to the new load balancer in us-west-2.

Yes, you have downtime (your RTO of 2–4 hours). Yes, you might lose 15 minutes of data (your RPO).

But here’s the beautiful part: it actually works, it’s testable, and it costs a tiny fraction of an active-active setup.

The AWS outage in October 2025 wasn’t a lesson in the need for multi-region. It was a global, public, costly lesson in humility. It was a reminder to stop chasing mythical architectures that look good on a conference whiteboard and focus on building resilient, recoverable systems.

So, stop feeling guilty because your setup doesn’t span three continents. You’re not lazy; you’re pragmatic. You’re the sane one in a room full of people passionately arguing about the best way to build a teleporter for that 300-pound antique oak desk.

Let them have their complex, split-brain, data-therapy sessions. You’ve chosen a boring, reliable, testable “warm standby.” You’ve chosen to get some sleep.

November 7, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

Burst traffic realities for AWS API Gateway Architects

Let’s be honest. Cloud architecture promises infinite scalability, but sometimes it feels like we’re herding cats wearing rocket boots. I learned this the hard way when my shiny serverless app, built with all the modern best practices, started hiccuping like a soda-drunk kangaroo during a Black Friday sale. The culprit? AWS API Gateway throttling under bursty traffic. And no, it wasn’t my coffee intake causing the chaos.

The token bucket, a simple idea with a sneaky side

AWS API Gateway uses a token bucket algorithm to manage traffic. Picture a literal bucket. Tokens drip into it at a steady rate, your rate limit. Each incoming request steals a token to pass through. If the bucket is empty? Requests get throttled. Simple, right? Like a bouncer checking IDs at a club.

But here’s the twist: This bouncer has a strict hourly wage. If 100 requests arrive in one second, they’ll drain the bucket faster than a toddler empties a juice box. Then, even if traffic calms down, the bucket refills slowly. Your API is stuck in timeout purgatory while tokens trickle back. AWS documents this, but it’s easy to miss until your users start tweeting about your “haunted API.”

Bursty traffic is life’s unpredictable roommate

Bursty traffic isn’t a bug; it’s a feature of modern apps. Think flash sales, mobile app push notifications, or that viral TikTok dance challenge your marketing team insisted would go viral (bless their optimism). Traffic doesn’t flow like a zen garden stream. It arrives in tsunami waves.

I once watched a client’s analytics dashboard spike at 3 AM. Turns out, their smart fridge app pinged every device simultaneously after a firmware update. The bucket emptied. Alarms screamed. My weekend imploded. Bursty traffic doesn’t care about your sleep schedule.

When bursts meet buckets, the throttling tango

Here’s where things get spicy. API Gateway’s token bucket has a burst capacity. For stage-level throttling, it’s tied to your rate limit. Set a rate of 100 requests/second? Your bucket holds 100 tokens. Send 150 requests in one burst? The first 100 sail through. The next 50 get throttled, even if the average traffic is below 100/second.

It’s like a theater with 100 seats. If 150 people rush the door at once, 50 get turned away, even if half the theater is empty later. AWS isn’t being petty. It’s protecting downstream services (like your database) from sudden stampedes. But when your app is the one getting trampled? Less poetic. More infuriating.

Does this haunt all throttling types?

Good news: This quirk primarily targets stage-level and account-level throttling. Usage Plans? They play by different rules. Their buckets refill steadily, making them more burst-friendly. But stage-level throttling? It’s the diva of the trio. Configure it carelessly, and it will sabotage your bursts like a jealous ex.

If you’ve layered all three throttling types (account, stage, usage plan), stage-level settings often dominate the drama. Check your stage settings first. Always.

Taming the beast, practical fixes that work

After several caffeine-fueled debugging sessions, I’ve learned a few tricks to keep buckets full and bursts happy. None requires sacrificing a rubber chicken to the cloud gods.

1. Resize your bucket
Stage-level throttling lets you set a burst limit alongside your rate limit. Double it. Triple it. AWS allows bursts up to 5,000 requests for some tiers. Calculate your peak bursts (use CloudWatch metrics!), then set burst capacity 20% higher. Safety margins are boring until they save your launch day.

2. Queue the chaos
Offload bursts to SQS or Kinesis. Front your API with a lightweight service that accepts requests instantly, dumps them into a queue, and processes them at a civilized pace. Users get a “we got this” response. Your bucket stays calm. Everyone wins. Except the throttling gremlins.

3. Smarter clients are your friends
Teach client apps to retry intelligently. Exponential backoff with jitter isn’t just jargon, it’s the art of politely asking “Can I try again later?” instead of spamming “HELLO?!” every millisecond. AWS SDKs bake this in. Use it.

4. Distribute the pain
Got multiple stages or APIs? Spread bursts across them. A load balancer or Route 53 weighted routing can turn one screaming bucket into several murmuring ones. It’s like splitting a rowdy party into smaller rooms.

5. Monitor like a paranoid squirrel
CloudWatch alarms for 429 Too Many Requests are non-negotiable. Track ThrottledRequests and Count metrics per stage. Set alerts at 70% of your burst limit. Because knowing your bucket is half-empty is far better than discovering it via customer complaints.

The quiet triumph of preparedness

Cloud architecture is less about avoiding fires and more about not using gasoline as hand sanitizer. Bursty traffic will happen. Token buckets will empty. But with thoughtful configuration, you can transform throttling from a silent assassin into a predictable gatekeeper.

AWS gives you the tools. It’s up to us to wield them without setting the data center curtains ablaze. Start small. Test bursts in staging. And maybe keep that emergency coffee stash stocked. Just in case.

Your APIs deserve grace under pressure. Now go forth and throttle wisely. Or better yet, throttle less.

November 4, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff