SRE stuff

The secret and anxious life of a data packet inside AWS

You press a finger against the greasy glass of your smartphone. You are in a café in Melbourne, the coffee is lukewarm, and you have made the executive decision to watch a video of a cat falling off a Roomba. It feels like a trivial action.

But for the data packet birthed by that tap, this is D-Day.

It is a tiny, nervous backpacker being kicked out into the digital wilderness with nothing but a destination address and a crippling fear of latency. Its journey through Amazon’s cloud infrastructure is not the clean, sterile diagram your systems architect drew on a whiteboard. It is a micro drama of hope, bureaucratic routing, and existential dread that plays out in roughly 200 milliseconds.

We tend to think of the internet as a series of tubes, but it is more accurate to think of it as a series of highly opinionated bouncers and overworked bureaucrats. To understand how your cat video loads, we have to follow this anxious packet through the gauntlet of Amazon Web Services (AWS).

The initial panic and the mapmaker with a god complex

Our packet leaves your phone and hits the cellular network. It is screaming for directions. It needs to find the server hosting the video, but it only has a name (e.g., cats.example.com). Computers do not speak English; they speak IP addresses.

Enter Route 53.

Amazon calls Route 53 a Domain Name System (DNS) service. In practice, it acts like a travel agent with a philosophy degree and multiple personality disorder. It does not just look up addresses; it judges you based on where you are standing and how healthy the destination looks.

If Route 53 is configured with Geolocation Routing, it acts like a local snob. It looks at our packet’s passport, sees “Melbourne,” and sneers. “You are not going to the Oregon server. The Americans are asleep, and the latency would be dreadful. You are going to Sydney.”

However, Route 53 is also a hypochondriac. Through Health Checks, it constantly pokes the servers to see if they are alive. It is the digital equivalent of texting a friend, “Are you awake?” every ten seconds. If the Sydney server fails to respond three times in a row, Route 53 assumes the worst, death, fire, or a kernel panic, and instantly reroutes our packet to Singapore. This is Failover Routing, the prepared pessimist of the group.

The packet doesn’t care about the logic. It just wants an address so it can stop hyperventilating in the void.

CloudFront is the desperate golden retriever of the internet

Armed with an IP address, our packet rushes toward the destination. But hopefully, it never actually reaches the main server. That would be inefficient. Instead, it runs into CloudFront.

CloudFront is a Content Delivery Network (CDN). Think of it as a network of convenience stores scattered all over the globe, so you don’t have to drive to the factory to buy milk. Or, more accurately, think of CloudFront as a Golden Retriever that wants to please you so badly it is vibrating.

Its job is caching. It memorizes content. When our packet arrives at the CloudFront “Edge Location” in Melbourne, the service frantically checks its pockets. “Do I have the cat video? I think I have the cat video. I fetched it for that guy in the corner five minutes ago!”

If it has the video (a Cache Hit), it hands it over immediately. The packet is relieved. The journey is over. Everyone goes home happy.

But if CloudFront cannot find the video (a Cache Miss), the mood turns sour. The Golden Retriever looks guilty. It now has to turn around and run all the way to the origin server to fetch the data fresh. This is the “Edge” of the network, a place that sounds like a U2 guitarist but is actually just a rack of humming metal in a secure facility near the airport.

The tragedy of CloudFront is the Time To Live (TTL). This is the expiration date on the data. If the TTL is set to 24 hours, CloudFront will proudly hand you a version of the website from yesterday, oblivious to the fact that you updated the spelling errors this morning. It is like a dog bringing you a dead bird it found last week, convinced it is still a great gift.

The security guard who judges your shoes

If our packet suffers a Cache Miss, it must travel deeper into the data center. But first, it has to get past the Web Application Firewall (WAF).

The WAF is not a firewall in the traditional sense; it is a nightclub bouncer who has had a very long shift and hates everyone. It stands at the velvet rope, scrutinizing every packet for signs of “malicious intent.”

It checks for SQL injection, which is the digital equivalent of trying to sneak a knife into the club tape-draped to your ankle. It checks for Cross-Site Scripting (XSS), which is essentially trying to trick the club into changing its name to “Free Drinks for Everyone.”

The WAF operates on a set of rules that range from reasonable to paranoid. Sometimes, it blocks a legitimate packet just because it looks suspicious, perhaps the packet is too large, or it came from a country the WAF has decided to distrust today. The packet pleads its innocence, but the WAF is a piece of software code; it does not negotiate. It simply returns a 403 Forbidden error, which translates roughly to: “Your shoes are ugly. Get out.”

The Application Load Balancer manages the VIP list

Having survived the bouncer, our weary packet arrives at the Application Load Balancer (ALB). If the WAF is the bouncer, the ALB is the Maitre D’ holding the clipboard.

The ALB is obsessed with fairness and health. It stands in front of a pool of identical servers (the Target Group) and decides who has to do the work. It is trying to prevent any single server from having a nervous breakdown due to overcrowding.

“Server A is busy processing a login request,” the ALB mutters. “Server B is currently restarting because it had a panic attack. You,” it points to our packet, “you go to Server C. It looks bored.”

The ALB’s relationship with the servers is codependent and toxic. It performs health checks on them relentlessly. It demands a 200 OK status code every thirty seconds. If a server takes too long to reply or replies with an error, the ALB declares it “Unhealthy” and stops sending it friends. It effectively ghosts the server until it gets its act together.

The Origin, where the magic (and heat) happens

Finally, the packet reaches the destination. The Origin.

We like to imagine the cloud as an ethereal, fluffy place. In reality, the Origin is likely an EC2 instance, a virtual slice of a computer sitting in a windowless room in Northern Virginia or Dublin. The room is deafeningly loud with the sound of cooling fans and smells of ozone and hot plastic.

Here, the application code actually runs. The request is processed, and the server realizes it needs the actual video file. It reaches out to Amazon S3 (Simple Storage Service), which is essentially a bottomless digital bucket where the internet hoards its data.

The EC2 instance grabs the video from the bucket, processes it, and prepares to send it back.

This is the most fragile part of the journey. If the code has a bug, the server might vomit a 500 Internal Server Error. This is the server saying, “I tried, but I broke something inside myself.” If the database is overwhelmed, the request might time out.

When this happens, the failure cascades back up the chain. The ALB shrugs and tells the user “502 Bad Gateway” (translation: ” The guy in the back room isn’t talking to me”). The WAF doesn’t care. CloudFront caches the error page, so now everyone sees the error for the next hour.

And somewhere, a DevOps engineer’s phone starts buzzing at 3:00 AM.

The return trip

But today, the system works. The Origin retrieves the video bytes. It hands them to the ALB, which passes them to the WAF (who checks them one last time for contraband), which hands them to CloudFront, which hands them to the cellular network.

The packet returns to your phone. The screen flickers. The cat falls off the Roomba. You chuckle, swipe up, and request the next video.

You have no idea that you just forced a tiny, digital backpacker to navigate a global bureaucracy, evade a paranoid security guard, and wake up a server in a different hemisphere, all in less time than it takes you to blink. It is a modern marvel held together by fiber optics and anxiety.

So spare a thought for the data. It has seen things you wouldn’t believe.

AWS Lambda SQS provisioned mode is cheaper than therapy

There is a specific flavor of nausea reserved for serverless engineering teams. It usually strikes at 2 a.m., shortly after a major product launch, when someone posts a triumphant screenshot of user traffic in Slack. While the marketing team is virtually high-fiving, CloudWatch quietly begins to draw a perfect, vertical line that looks less like a growth chart and more like a cliff edge.

Your SQS queues swell. Lambda invocations crawl. Suddenly, the phrase “fully managed service” sounds less comforting and more like a cruel punchline delivered by a distant cloud provider.

For years, the relationship between Amazon SQS and AWS Lambda has been the backbone of event-driven architecture. You wire up an event source mapping, let Lambda poll the queue, and trust the system to scale as messages arrive. Most days, this works beautifully. On the wrong day, under the wrong kind of spike, it works “eventually.”

But in the world of high-frequency trading or flash sales, “eventually” is just a polite synonym for “too late.”

With the release of AWS Lambda SQS Provisioned Mode on November 14, Amazon is finally admitting that sometimes magic is too slow. It grants you explicit control over the invisible workers that poll SQS for your function. It ensures they are already awake, caffeinated, and standing in line before the mob shows up. It allows you to trade a bit of extra planning (and money) for the guarantee that your system won’t hit the snooze button while your backlog turns into a towering monument to failure.

The uncomfortable truth about standard SQS polling

To understand why we need Provisioned Mode, we have to look at the somewhat lazy nature of the standard behavior.

Out of the box, Lambda uses an event source mapping to poll SQS on your behalf. You give it a queue and some basic configuration, and Lambda spins up pollers to check for work. You never see these pollers. They are the ghosts in the machine.

The problem with ghosts is that they are not particularly urgent. When a massive spike hits your queue, Lambda realizes it needs more pollers and more concurrent function invocations. However, it does not do this instantly. It ramps up. It adds capacity in increments, like a cautious driver merging onto a freeway.

For a steady workload, you will never notice this ramp-up. But during a viral marketing campaign or a market crash, those minutes of warming up feel like an eternity. You are essentially watching a barista who refuses to start grinding coffee beans until the line of customers has already curled around the block.

Standard SQS polling gives you tools like batch size, but it denies you direct influence over the urgency of the consumption. You cannot tell the system, “I need ten workers ready right now.” You can only stand in line and hope the algorithm notices you are drowning.

This is acceptable for background jobs like resizing images or sending emails. It is decidedly less acceptable for payment processing or fraud detection. In those cases, watching twenty thousand messages pile up while your system “automatically scales” is not an architectural feature. It is a resume-generating event.

Paying for a standing army instead of volunteers

Provisioned Mode flips the script on this reactive behavior. Instead of letting Lambda decide how many pollers to use based purely on demand, you tell it the minimum and maximum number of event pollers you want reserved for that queue.

An event poller is a dedicated worker that reads from SQS and hands batches of messages to your function. In standard mode, these pollers are summoned from a shared pool when needed. In Provisioned Mode, you are paying to keep them on retainer.

Think of it as the difference between calling a ride-share service and hiring a private driver to sit in your driveway with the engine running. One is efficient for the general public; the other is necessary if you need to leave the house in exactly three seconds.

The benefits are stark when translated into human terms.

First, you get speed. AWS advertises significantly faster scaling for SQS event source mappings in Provisioned Mode. We are talking about adding up to one thousand new concurrent invocations per minute.

Second, you get capacity. Provisioned Mode can support massive concurrency per SQS mapping, far higher than the default capabilities.

Third, and perhaps most importantly, you get predictability. A single poller is not just a warm body. It is a unit of throughput (handling up to 1 MB per second or 10 concurrent invokes). By setting a minimum number of pollers, you are mathematically guaranteeing a baseline of throughput. You are no longer hoping the waiters show up; you have paid their salaries in advance.

Configuring this without losing your mind

The good news is that Provisioned Mode is not a new service with its own terrifying learning curve. It is just a configuration toggle on the event source mapping you are already using. You can set it up in the AWS Console, the CLI, or your Infrastructure as Code tool of choice.

The interface asks for two numbers, and this is where the engineering art form comes in.

First, it asks for Minimum Pollers. This is the number of workers you always want ready.

Second, it asks for Maximum Pollers. This is the ceiling, the limit you set to ensure you do not accidentally DDoS your own database.

Choosing these numbers feels a bit like gambling, but there is a logic to it. For the minimum, pick a number that comfortably handles your typical traffic plus a standard spike. Start small. Setting this to 100 when you usually need 2 is the serverless equivalent of buying a school bus to commute to work alone.

For the maximum, look at your downstream systems. There is no point in setting a maximum that allows 5,000 concurrent Lambda functions if your relational database curls into a fetal position at 500 connections.

Once you enable it, you need to watch your metrics. Keep an eye on “Queue Depth” and “Age of Oldest Message.” If the backlog clears too slowly, buy more pollers. If your database administrator starts sending you angry emails in all caps, reduce the maximum. The goal is not perfection on day one; it is to replace guesswork with a feedback loop.

The financial hangover

Nothing in life is free, and this applies doubly to AWS features that solve headaches.

When you enable Provisioned Mode, AWS begins charging you for “Event Poller Units.” You pay for the minimum pollers you configure, regardless of whether there are messages in the queue. You are paying for readiness.

This is a mental shift for serverless purists. The whole promise of serverless was “pay for what you use.” Provisioned Mode is “pay for what you might need.”

You are essentially renting a standing army. Most of the time, they will just stand there, playing cards and eating your budget. But when the enemy (traffic) attacks, they are already in position. Standard SQS polling is cheaper because it relies on volunteers. Volunteers are free, but they take a while to put on their boots.

From a FinOps perspective, or simply from the perspective of explaining the bill to your boss, the question is not “Is this expensive?” The question is “What is the cost of latency?”

For a background report generator, a five-minute delay costs nothing. For a high-frequency trading platform, a five-second delay costs everything. You should not enable Provisioned Mode on every queue in your account. That would be financial malpractice. You reserve it for the critical paths, the workflows where the price of slowness is measured in lost customers rather than just infrastructure dollars.

Why you should care about the fourth dial

Architecturally, Provisioned Mode gives us a new layer of control. Previously, we had three main dials in event-driven systems: how fast we write to the queue, how fast the consumers process messages, and how much concurrency Lambda is allowed.

Provisioned Mode adds a fourth dial: the aggression of the retrieval.

It allows you to reason about your system deterministically. If you know that one poller provides X amount of throughput, you can stack them to meet a specific Service Level Agreement. It turns a “best effort” system into a “calculated guarantee” system.

Serverless was sold to us as freedom from capacity planning. We were told we could just write code and let the cloud handle the undignified details of scaling. For many workloads, that promise holds true.

But as your workloads become more critical, you discover the uncomfortable corners where “just let it scale” is not enough. Latency budgets shrink. Compliance rules tighten. Customers grow less patient.

AWS Lambda SQS Provisioned Mode is a small, targeted answer to that discomfort. It allows you to say, “I want at least this much readiness,” and have the platform respect that wish, even when your traffic behaves like a toddler on a sugar high.

So, pick your most critical queue. The one that keeps you awake at night. Enable Provisioned Mode, set a modest minimum, and watch the metrics. Your future self, staring at a flat latency graph during the next Black Friday, will be grateful you decided to stop trusting in magic and started paying for physics.

November 20, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

Escaping the AWS NAT Gateway toll booth

My coffee went cold. I was staring at my AWS bill, and one line item was staring back at me with a judgmental smirk: NAT Gateway: 33,01 €.

This wasn’t for compute. This wasn’t for storing terabytes of crucial data. This was for the simple, mundane privilege of letting my Lambda functions send emails and tell Stripe to charge a credit card.

Let’s talk about NAT Gateway pricing. It’s a special kind of pain.

$0.045 per hour (That’s roughly $33 a month, just for existing).
$0.045 per GB processed (You get charged for your own data).
…and that’s per Availability Zone. For High Availability, you multiply by two or three.

I was suddenly paying more for a digital toll booth operator than I was for the actual application logic running my startup. That’s when I started asking questions. Did I really need this? What was I actually paying for? And more importantly, was there another way?

This is the story of how I hunted down that 33€ line item. By the end, you’ll know exactly if you need a NAT Gateway, or if you’re just burning money to keep the AWS machine fed.

The great NAT lie

Every AWS tutorial, every Stack Overflow answer, every “serverless best practice” blog post chants the same mantra: “If your Lambda needs to access the internet, and it’s in a VPC, you need a NAT Gateway.”

It’s presented as a law of physics. Like gravity, or the fact that DNS will always be the problem. And I, like a good, obedient engineer, followed the instructions. I clicked the button. I added the NAT. And then the bill came.

It turns out that obedience is expensive.

The gilded cage we call a VPC

Before we storm the castle, we have to understand why we built the castle in the first place. Why are our Lambdas in this mess? The answer is the Virtual Private Cloud (VPC).

By default, a Lambda function is a free spirit. It’s born with a magical, AWS-managed connection to the outside world. It can call any API it wants. It’s a social butterfly.

But then, security happens.

We have a managed database, like MongoDB Atlas. We absolutely, positively do not want this database exposed to the public internet. That’s like shouting your bank details across a crowded shopping mall. So, we rightly configure it to only accept private connections.

To let our Lambda talk to this database, we have to build a “gated community” for it. That’s our VPC. We move the Lambda inside this community and set up a “VPC Peering” connection, which is like a private, guarded footpath between our VPC and the MongoDB VPC.

Our Lambda can now securely whisper secrets to the database. The traffic never touches the public internet. We are secure. We are compliant. We are… trapped.

House arrest

We solved one problem but created a massive new one. In building this fortress to protect our database, we built it with no doors to the outside world.

Our Lambda is now on house arrest.

Sure, it can talk to the database in the adjoining room. But it can no longer call the Stripe API to process a payment. It can’t call an email service. It can’t even phone its own cousins in the AWS family, like AWS Secrets Manager or S3 (not without extra work, anyway). Any attempt to reach the internet just… times out. It’s the sound of silence.

This is the dilemma. To be secure, our Lambda must be in a VPC. But once in a VPC, it’s useless for half its job.

Enter the expensive chaperone

This is where the AWS Gospel presents its solution: the NAT Gateway.

The NAT (Network Address Translation) Gateway is, in our analogy, an extremely expensive, bonded chaperone.

You place this chaperone in a “public” part of your gated community (a public subnet). When your Lambda on house arrest needs to send a letter to the outside world (like an API call to Stripe), it gives the letter to the chaperone.

The chaperone (the NAT) takes the letter, walks it to the main gate, puts its own public return address on it, and sends it. When the reply comes back, the chaperone receives it, verifies it’s for the Lambda, and delivers it.

This works. It’s secure. The Lambda’s private address is never exposed.

But this chaperone charges you. It charges you by the hour just to be on call. It charges you for every letter it carries (data processed). And as we established, you need three of them if you want to be properly redundant.

This is a racket.

The “Split Personality” solution

I refused to pay the toll. There had to be another way. The solution came from realizing I was trying to make one Lambda do two completely opposite jobs.

What if, instead of one “do-it-all” Lambda, I created two specialists?

The hermit: This Lambda lives inside the VPC. Its one and only job is to talk to the database. It is antisocial, secure, and has no idea the internet exists.
The messenger: This Lambda lives outside the VPC. It’s a “free-range” Lambda. Because it’s not attached to any VPC, AWS magically gives it that default internet access. It cannot talk to the database (which is good!), but it can talk to Stripe all day long.

The plan is simple: when The hermit (VPC Lambda) needs something from the internet, it invokes The messenger (Proxy Lambda). It hands it a note: “Please tell Stripe to charge $25.00.” The messenger runs the errand, gets the receipt, and passes it back to The hermit, who then safely logs the result in the database.

It’s a “split personality” architecture.

But is it safe?

I can hear you asking: “Wait. A Lambda with internet access? Isn’t that like leaving your front door wide open for attackers?”

No. And this is the most beautiful part.

A Lambda function, whether in a VPC or not, never gets a public IP address. It can make outbound calls, but nothing from the public internet can initiate a call to it.

It’s like having a phone that can only make calls, not receive them. It’s unreachable. The “Messenger” Lambda is perfectly safe to live outside the VPC, ready to do our bidding.

The secret tunnel system

So, I built it. The hermit. The messenger. I was a genius. I hit “test.”

…timeout.

Of course. I forgot. The hermit is still on house arrest. “Invoking” another Lambda is, itself, an AWS API call. It’s a request that has to leave the VPC to reach the AWS Lambda service. My Lambda couldn’t even call its own lawyer.

This is where the real solution lies. Not in a gateway, but in a series of tunnels.

They’re called VPC Endpoints.

A VPC Endpoint is not a big, expensive, public chaperone. It’s a private, secret tunnel that you build directly from your VPC to a specific AWS service, all within the AWS network.

So, I built two tunnels:

A tunnel to AWS Secrets Manager: Now my hermit Lambda can get its API keys directly, without ever leaving the house.
A tunnel to AWS Lambda: Now my hermit Lambda can use its private phone to “invoke” The messenger.

These endpoints have a small hourly cost, but it’s a fraction of a NAT Gateway, and the data processing fee is either tiny or free, depending on the endpoint type. We’ve replaced a $100/mo toll road with a $5/mo private footpath.

(A grumpy side note: annoyingly, some AWS services like Cognito don’t support VPC Endpoints. For those, you still have to use the Messenger proxy pattern. But for most, the tunnels work.)

Our glorious new contraption

Let’s look at our payment handler again. This little function needed to:

Get API keys from AWS Secrets Manager.
Call Stripe’s API.
Write the transaction to MongoDB.

Here is how our new, glorious, Rube Goldberg machine works:

Step 1: The Payment Lambda (The hermit) gets a request.
Step 2: It needs keys. It pops over to AWS Secrets Manager through its private tunnel (the VPC Endpoint). No internet needed.
Step 3: It needs to charge a card. It calls the invoke command, which goes through its other private tunnel to the AWS Lambda service, triggering The messenger.
Step 4: The messenger (Proxy Lambda), living in the free-range world, makes the outbound call to Stripe. Stripe, delighted, processes the payment and sends a reply.
Step 5: The messenger passes the success (or failure) response back to The hermit.
Step 6: The hermit, now holding the result, calmly turns and writes the transaction record to MongoDB via its private VPC Peering connection.

Everything works. Nothing is exposed. And the NAT Gateway bill is 0€.

For those who speak in code

Here is a simplified look at what our two specialist Lambdas are doing.

Payment Lambda (The hermit – INSIDE VPC)

// This Lambda is attached to your VPC
// It needs VPC Endpoints for 'lambda' and 'secretsmanager'

import { InvokeCommand, LambdaClient } from "@aws-sdk/client-lambda";
// ... (imports for Secrets Manager and Mongo)

const lambda = new LambdaClient({});

export const handler = async (event) => {
  try {
    const amountToCharge = 2500; // 25.00

    // 1. Get secrets via VPC Endpoint
    // const apiKeys = await getSecretsFromManager();
    
    // 2. Prepare to invoke the proxy
    const command = new InvokeCommand({
      FunctionName: process.env.PAYMENT_PROXY_FUNCTION_NAME,
      InvocationType: "RequestResponse",
      Payload: JSON.stringify({
        chargeDetails: { amount: amountToCharge, currency: "usd" },
      }),
    });

    // 3. Invoke the proxy Lambda via VPC Endpoint
    const response = await lambda.send(command);
    const proxyResponse = JSON.parse(
      Buffer.from(response.Payload).toString()
    );

    if (proxyResponse.status === "success") {
      // 4. Write to MongoDB via VPC Peering
      // await writePaymentRecordToMongo(proxyResponse.transactionId);
      
      return {
        statusCode: 200,
        body: `Payment succeeded! TxID: ${proxyResponse.transactionId}`,
      };
    } else {
      // Handle payment failure
      return { statusCode: 400, body: "Payment failed." };
    }
  } catch (error) {
    console.error(error);
    return { statusCode: 500, body: "Server error" };
  }
};

Proxy Lambda (The messenger – OUTSIDE VPC)

// This Lambda is NOT attached to a VPC
// It has default internet access

// ... (import for your Stripe client)
// const stripe = new Stripe(process.env.STRIPE_SECRET_KEY);

export const handler = async (event) => {
  // 1. Extract the data from the invoking Hermit
  const { chargeDetails } = event.payload;

  try {
    // 2. Call the external Stripe API
    // const stripeResponse = await stripe.charges.create({
    //   amount: chargeDetails.amount,
    //   currency: chargeDetails.currency,
    //   source: "tok_visa", // Example token
    // });
   
    // Mocking the Stripe call for this example
    const stripeResponse = {
        id: `txn_${Math.random().toString(36).substring(2, 15)}`,
        status: 'succeeded'
    };


    if (stripeResponse.status === 'succeeded') {
      // 3. Return the successful result
      return {
        status: "success",
        transactionId: stripeResponse.id,
      };
    } else {
      return { status: "failed", error: "Stripe decline" };
    }
  } catch (err) {
    // 4. Return any errors
    return {
      status: "failed",
      error: `Error contacting Stripe: ${err.message}`,
    };
  }
};

Was it worth it?

And there it is. A production-grade, secure, and resilient system. Our hermit Lambda is safe in its VPC, talking to the database, our Messenger Lambda is happily running errands on the internet, and our secret tunnels are connecting everything privately.

That said, figuring all this out and integrating it into a production system takes a significant amount of time. This… this contraption of proxies and endpoints is, frankly, a headache.

If you don’t want the headache, sometimes it’s easier to just pay that damn 30€ for a NAT Gateway and move on with your life.

The purpose of this article wasn’t just to save a few bucks. It was to pull back the curtain. To show that the “one true way” isn’t the only way, and to prove that with a little bit of architectural curiosity, you can, in fact, escape the AWS NAT Gateway toll booth.

November 15, 2025 by Fernando SRE Cloud stuff SRE stuff

Your Multi-Region strategy is a fantasy

The recent failure showed us the truth: your data is stuck, and active-active failover is a fantasy for 99% of us. Here’s a pragmatic high-availability strategy that actually works.

Well, that was an intense week.

When the great AWS outage of October 2025 hit, I did what every senior IT person does: I grabbed my largest coffee mug, opened our monitoring dashboard, and settled in to watch the world burn. us-east-1, the internet’s stubbornly persistent center of gravity, was having what you’d call a very bad day.

And just like clockwork, as the post-mortems rolled in, the old, tired refrain started up on social media and in Slack: “This is why you must be multi-region.”

I’m going to tell you the truth that vendors, conference speakers, and that one overly enthusiastic junior dev on your team won’t. For 99% of companies, “multi-region” is a lie.

It’s an expensive, complex, and dangerous myth sold as a silver bullet. And the recent outage just proved it.

The “Just Be Multi-Region” fantasy

On paper, it sounds so simple. It’s a lullaby for VPs.

You just run your app in us-east-1 (Virginia) and us-west-2 (Oregon). You put a shiny global load balancer in front, and if Virginia decides to spontaneously become an underwater volcano, poof! All your traffic seamlessly fails over to Oregon. Zero downtime. The SREs are heroes. Champagne for everyone.

This is a fantasy.

It’s a fantasy that costs millions of dollars and lures development teams into a labyrinth of complexity they will never escape. I’ve spent my career building systems that need to stay online. I’ve sat in the planning meetings and priced out the “real” cost. Let me tell you, true active-active multi-region isn’t just “hard”; it’s a completely different class of engineering.

And it’s one that your company almost certainly doesn’t need.

The three killers of Multi-Region dreams

It’s not the application servers. Spinning up EC2 instances or containers in another region is the easy part. That’s what we have Infrastructure as Code for. Any intern can do that.

The problem isn’t the compute. The problem is, and always has been, the data.

Killer 1: Data has gravity, and it’s a jerk

This is the single most important concept in cloud architecture. Data has gravity.

Your application code is a PDF. It’s stateless and lightweight. You can email it, copy it, and run it anywhere. Your 10TB PostgreSQL database is not a PDF. It’s the 300-pound antique oak desk the computer is sitting on. You can’t just “seamlessly fail it over” to another continent.

To have a true seamless failover, your data must be available in the second region at the exact moment of the failure. This means you need synchronous, real-time replication across thousands of miles.

Guess what that does to your write performance? It’s like trying to have a conversation with someone on Mars. The latency of a round-trip from Virginia to Oregon adds hundreds of milliseconds to every single database write. The application becomes unusably slow. Every time a user clicks “save,” they have to wait for a photon to physically travel across the country and back. Your users will hate it.

“Okay,” you say, “we’ll use asynchronous replication!”

Great. Now when us-east-1 fails, you’ve lost the last 5 minutes of data. Every transaction, every new user sign-up, every shopping cart order. Vanished. You’ve traded a “Recovery Time” of zero for a “Data Loss” that is completely unacceptable. Go explain to the finance department that you purposefully designed a system that throws away the most recent customer orders. I’ll wait.

This is the trap. Your compute is portable; your data is anchored.

Killer 2: The astronomical cost

I was on a project once where the CTO, fresh from a vendor conference, wanted a full active-active multi-region setup. We scoped it.

Running 2x the servers was fine. The real cost was the inter-region data transfer.

AWS (and all cloud providers) charge an absolute fortune for data moving between their regions. It’s the “hotel minibar” of cloud services. Every single byte your database replicates, every log, every file transfer… cha-ching.

Our projected bill for the data replication and the specialized services (like Aurora Global Databases or DynamoDB Global Tables) was three times the cost of the entire rest of the infrastructure.

You are paying a massive premium for a fleet of servers, databases, and network gateways that are sitting idle 99.9% of the time. It’s like buying the world’s most expensive gym membership and only going once every five years to “test” it. It’s an insurance policy so expensive, you can’t afford the disaster it’s meant to protect you from.

Killer 3: The crushing complexity

A multi-region system isn’t just two copies of your app. It’s a brand new, highly complex, slightly psychotic distributed system that you now have to feed and care for.

You now have to solve problems you never even thought about:

Global DNS failover: How does Route 53 know a region is down? Health checks fail. But what if the health check itself fails? What if the health check thinks Virginia is fine, but it’s just hallucinating?
Data write conflicts: This is the fun part. What if a user in New York (writing to us-east-1) and a user in California (writing to us-west-2) update the same record at the same time? Welcome to the world of split-brain. Who wins? Nobody. You now have two “canonical” truths, and your database is having an existential crisis. Your job just went from “Cloud Architect” to “Data Therapist.”
Testing: How do you even test a full regional failover? Do you have a big red “Kill Virginia” button? Are you sure you know what will happen when you press it? On a Tuesday afternoon? I didn’t think so.

You haven’t just doubled your infrastructure; you’ve 10x’d your architectural complexity.

But we have Kubernetes because we are Cloud Native

This was my favorite part of the October 2025 outage.

I saw so many teams that thought Kubernetes would save them. They had their fancy federated K8s clusters spanning multiple regions, YAML files as far as the eye could see.

And they still went down.

Why? Because Kubernetes doesn’t solve data gravity!

Your K8s cluster in us-west-2 dutifully spun up all your application pods. They woke up, stretched, and immediately started screaming: “WHERE IS MY DISK?!”

Your persistent volumes (PVs) are backed by EBS or EFS. That ‘E’ stands for ‘Elastic,’ not ‘Extradimensional.’ That disk is physically, stubbornly, regionally attached to Virginia. Your pods in Oregon can’t mount a disk that lives 3,000 miles away.

Unless you’ve invested in another layer of incredibly complex, eye-wateringly expensive storage replication software, your “cloud-native” K8s cluster was just a collection of very expensive, very confused applications shouting into the void for a database that was currently offline.

A pragmatic high availability strategy that actually works

So if multi-region is a lie, what do we do? Just give up? Go home? Take up farming?

Yes. You accept some downtime.

You stop chasing the “five nines” (99.999%) myth and start being honest with the business. Your goal is not “zero downtime.” Your goal is a tested and predictable recovery.

Here is the sane strategy.

1. Embrace Multi-AZ (The real HA)

This is what AWS actually means by “high availability.” Run your application across multiple Availability Zones (AZs) within a single region. An AZ is a physically separate data center. us-east-1a and us-east-1b are miles apart, with different power and network.

This is like having a backup generator for your house. Multi-region is like building an identical, fully-furnished duplicate house in another city just in case a meteor hits your first one.

Use a Multi-AZ RDS instance. Use an Auto Scaling Group that spans AZs. This protects you from 99% of common failures: a server rack dying, a network switch failing, or a construction crew cutting a fiber line. This should be your default. It’s cheap, it’s easy, and it works.

2. Focus on RTO and RPO

Stop talking about “nines” and start talking about two simple numbers:

RTO (Recovery Time Objective): How fast do we need to be back up?
RPO (Recovery Point Objective): How much data can we afford to lose?

Get a real answer from the business, not a fantasy. Is a 4-hour RTO and a 15-minute RPO acceptable? For almost everyone, the answer is yes.

3. Build a “Warm Standby” (The sane DR)

This is the strategy that actually works. It’s the “fire drill” plan, not the “build a duplicate city” plan.

Infrastructure: Your entire infrastructure is defined in Terraform or CloudFormation. You can rebuild it from scratch in any region with a single command.
Data: You take regular snapshots of your database (e.g., every 15 minutes) and automatically copy them to your disaster recovery region (us-west-2).
The plan: When us-east-1 dies, you declare a disaster. The on-call engineer runs the “Deploy-to-DR” script.

Here’s a taste of what that “sane” infrastructure-as-code looks like. You’re not paying for two of everything. You’re paying for a blueprint and a backup.

# main.tf (in your primary region module)
# This is just a normal server
resource "aws_instance" "app_server" {
  count         = 3 # Your normal production count
  ami           = "ami-0abcdef123456"
  instance_type = "t3.large"
  # ... other config
}

# dr.tf (in your DR region module)
# This server doesn't even exist... until you need it.
resource "aws_instance" "dr_app_server" {
  # This is the magic.
  # This resource is "off" by default (count = 0).
  # You flip one variable (is_disaster = true) to build it.
  count         = var.is_disaster ? 3 : 0
  provider      = aws.dr_region # Pointing to us-west-2
  ami           = "ami-0abcdef123456" # Same AMI
  instance_type = "t3.large"
  # ... other config
}

resource "aws_db_instance" "dr_database" {
  count                   = var.is_disaster ? 1 : 0
  provider                = aws.dr_region
  
  # Here it is: You build the new DB from the
  # latest snapshot you've been copying over.
  replicate_source_db     = var.latest_db_snapshot_arn
  
  instance_class          = "db.r5.large"
  # ... other config
}

You flip a single DNS record in Route 53 to point all traffic to the new load balancer in us-west-2.

Yes, you have downtime (your RTO of 2–4 hours). Yes, you might lose 15 minutes of data (your RPO).

But here’s the beautiful part: it actually works, it’s testable, and it costs a tiny fraction of an active-active setup.

The AWS outage in October 2025 wasn’t a lesson in the need for multi-region. It was a global, public, costly lesson in humility. It was a reminder to stop chasing mythical architectures that look good on a conference whiteboard and focus on building resilient, recoverable systems.

So, stop feeling guilty because your setup doesn’t span three continents. You’re not lazy; you’re pragmatic. You’re the sane one in a room full of people passionately arguing about the best way to build a teleporter for that 300-pound antique oak desk.

Let them have their complex, split-brain, data-therapy sessions. You’ve chosen a boring, reliable, testable “warm standby.” You’ve chosen to get some sleep.

November 7, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

Burst traffic realities for AWS API Gateway Architects

Let’s be honest. Cloud architecture promises infinite scalability, but sometimes it feels like we’re herding cats wearing rocket boots. I learned this the hard way when my shiny serverless app, built with all the modern best practices, started hiccuping like a soda-drunk kangaroo during a Black Friday sale. The culprit? AWS API Gateway throttling under bursty traffic. And no, it wasn’t my coffee intake causing the chaos.

The token bucket, a simple idea with a sneaky side

AWS API Gateway uses a token bucket algorithm to manage traffic. Picture a literal bucket. Tokens drip into it at a steady rate, your rate limit. Each incoming request steals a token to pass through. If the bucket is empty? Requests get throttled. Simple, right? Like a bouncer checking IDs at a club.

But here’s the twist: This bouncer has a strict hourly wage. If 100 requests arrive in one second, they’ll drain the bucket faster than a toddler empties a juice box. Then, even if traffic calms down, the bucket refills slowly. Your API is stuck in timeout purgatory while tokens trickle back. AWS documents this, but it’s easy to miss until your users start tweeting about your “haunted API.”

Bursty traffic is life’s unpredictable roommate

Bursty traffic isn’t a bug; it’s a feature of modern apps. Think flash sales, mobile app push notifications, or that viral TikTok dance challenge your marketing team insisted would go viral (bless their optimism). Traffic doesn’t flow like a zen garden stream. It arrives in tsunami waves.

I once watched a client’s analytics dashboard spike at 3 AM. Turns out, their smart fridge app pinged every device simultaneously after a firmware update. The bucket emptied. Alarms screamed. My weekend imploded. Bursty traffic doesn’t care about your sleep schedule.

When bursts meet buckets, the throttling tango

Here’s where things get spicy. API Gateway’s token bucket has a burst capacity. For stage-level throttling, it’s tied to your rate limit. Set a rate of 100 requests/second? Your bucket holds 100 tokens. Send 150 requests in one burst? The first 100 sail through. The next 50 get throttled, even if the average traffic is below 100/second.

It’s like a theater with 100 seats. If 150 people rush the door at once, 50 get turned away, even if half the theater is empty later. AWS isn’t being petty. It’s protecting downstream services (like your database) from sudden stampedes. But when your app is the one getting trampled? Less poetic. More infuriating.

Does this haunt all throttling types?

Good news: This quirk primarily targets stage-level and account-level throttling. Usage Plans? They play by different rules. Their buckets refill steadily, making them more burst-friendly. But stage-level throttling? It’s the diva of the trio. Configure it carelessly, and it will sabotage your bursts like a jealous ex.

If you’ve layered all three throttling types (account, stage, usage plan), stage-level settings often dominate the drama. Check your stage settings first. Always.

Taming the beast, practical fixes that work

After several caffeine-fueled debugging sessions, I’ve learned a few tricks to keep buckets full and bursts happy. None requires sacrificing a rubber chicken to the cloud gods.

1. Resize your bucket
Stage-level throttling lets you set a burst limit alongside your rate limit. Double it. Triple it. AWS allows bursts up to 5,000 requests for some tiers. Calculate your peak bursts (use CloudWatch metrics!), then set burst capacity 20% higher. Safety margins are boring until they save your launch day.

2. Queue the chaos
Offload bursts to SQS or Kinesis. Front your API with a lightweight service that accepts requests instantly, dumps them into a queue, and processes them at a civilized pace. Users get a “we got this” response. Your bucket stays calm. Everyone wins. Except the throttling gremlins.

3. Smarter clients are your friends
Teach client apps to retry intelligently. Exponential backoff with jitter isn’t just jargon, it’s the art of politely asking “Can I try again later?” instead of spamming “HELLO?!” every millisecond. AWS SDKs bake this in. Use it.

4. Distribute the pain
Got multiple stages or APIs? Spread bursts across them. A load balancer or Route 53 weighted routing can turn one screaming bucket into several murmuring ones. It’s like splitting a rowdy party into smaller rooms.

5. Monitor like a paranoid squirrel
CloudWatch alarms for 429 Too Many Requests are non-negotiable. Track ThrottledRequests and Count metrics per stage. Set alerts at 70% of your burst limit. Because knowing your bucket is half-empty is far better than discovering it via customer complaints.

The quiet triumph of preparedness

Cloud architecture is less about avoiding fires and more about not using gasoline as hand sanitizer. Bursty traffic will happen. Token buckets will empty. But with thoughtful configuration, you can transform throttling from a silent assassin into a predictable gatekeeper.

AWS gives you the tools. It’s up to us to wield them without setting the data center curtains ablaze. Start small. Test bursts in staging. And maybe keep that emergency coffee stash stocked. Just in case.

Your APIs deserve grace under pressure. Now go forth and throttle wisely. Or better yet, throttle less.

November 4, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

The slow unceremonious death of EC2 Autoscaling

Let’s pour one out for an old friend.

AWS recently announced a small, seemingly boring new feature for EC2 Auto Scaling: the ability to cancel a pending instance refresh. If you squinted, you might have missed it. It sounds like a minor quality-of-life update, something to make a sysadmin’s Tuesday slightly less terrible.

But this isn’t a feature. It’s a gold watch. It’s the pat on the back and the “thanks for your service” speech at the awkward retirement party.

The EC2 Auto Scaling Group (ASG), the bedrock of cloud elasticity, the one tool we all reflexively reached for, is being quietly put out to pasture.

No, AWS hasn’t officially killed it. You can still spin one up, just like you can still technically send a fax. AWS will happily support it. But its days as the default, go-to solution for modern workloads are decisively over. The battle for the future of scaling has ended, and the ASG wasn’t the winner. The new default is serverless containers, hyper-optimized Spot fleets, and platforms so abstract they’re practically invisible.

If you’re still building your infrastructure around the ASG, you’re building a brand-new house with plumbing from 1985. It’s time to talk about why our old friend is retiring and meet the eager new hires who are already measuring the drapes in its office.

So why is the ASG getting the boot?

We loved the ASG. It was a revolutionary idea. But like that one brilliant relative everyone dreads sitting next to at dinner, it was also exhausting. Its retirement was long overdue, and the reasons are the same frustrations we’ve all been quietly grumbling about into our coffee for years.

It promised automation but gave us chores

The ASG’s sales pitch was simple: “I’ll handle the scaling!” But that promise came with a three-page, fine-print addendum of chores.

It was the operational overhead that killed us. We were promised a self-driving car and ended up with a stick-shift that required constant, neurotic supervision. We became part-time Launch Template librarians, meticulously versioning every tiny change. We became health-check philosophers, endlessly debating the finer points of ELB vs. EC2 health checks.

And then… the Lifecycle Hooks.

A “Lifecycle Hook” is a polite, clinical term for a Rube Goldberg machine of desperation. It’s a panic button that triggers a Lambda, which calls a Systems Manager script, which sends a carrier pigeon to… maybe… drain a connection pool before the instance is ruthlessly terminated. Trying to debug one at 3 AM was a rite of passage, a surefire way to lose precious engineering time and a little bit of your soul.

It moves at a glacial pace

The second nail in the coffin was its speed. Or rather, the complete lack of it.

The ASG scales at the speed of a full VM boot. In our world of spiky, unpredictable traffic, that’s an eternity. It’s like pre-heating a giant, industrial pizza oven for 45 minutes just to toast a single slice of bread. By the time your new instance is booted, configured, service-discovered, and finally “InService,” the spike in traffic has already come and gone, leaving you with a bigger bill and a cohort of very annoyed users.

It’s an expensive insurance policy

The ASG model is fundamentally wasteful. You run a “warm” fleet, paying for idle capacity just in case you need it. It’s like paying rent on a 5-bedroom house for your family of three, just in case 30 cousins decide to visit unannounced.

This “scale-up” model was slow, and the “scale-down” was even worse, riddled with fears of terminating the wrong instance and triggering a cascading failure. We ended up over-provisioning to avoid the pain of scaling, which completely defeats the purpose of “auto-scaling.”

The eager interns taking over the desk

So, the ASG has cleared out its desk. Who’s moving in? It turns out there’s a whole line of replacements, each one leaner, faster, and blissfully unconcerned with managing a “fleet.”

1. The appliance Fargate and Cloud Run

First up is the “serverless container”. This is the hyper-efficient new hire who just says, “Give me the Dockerfile. I’ll handle the rest.”

With AWS Fargate or Google’s Cloud Run, you don’t have a fleet. You don’t manage VMs. You don’t patch operating systems. You don’t even think about an instance. You just define a task, give it some CPU and memory, and tell it how many copies you want. It scales from zero to a thousand in seconds.

This is the appliance model. When you buy a toaster, you don’t worry about wiring the heating elements or managing its power supply. You just put in bread and get toast. Fargate is the toaster. The ASG was the “build-your-own-toaster” kit that came with a 200-page manual on electrical engineering.

Just look at the cognitive load. This is what it takes to get a basic ASG running via the CLI:

# The "Old Way": Just one of the many steps...
aws autoscaling create-auto-scaling-group \
    --auto-scaling-group-name my-legacy-asg \
    --launch-template "LaunchTemplateName=my-launch-template,Version='1'" \
    --min-size 1 \
    --max-size 5 \
    --desired-capacity 2 \
    --vpc-zone-identifier "subnet-0571c54b67EXAMPLE,subnet-0c1f4e4776EXAMPLE" \
    --health-check-type ELB \
    --health-check-grace-period 300 \
    --tag "Key=Name,Value=My-ASG-Instance,PropagateAtLaunch=true"

You still need to define the launch template, the subnets, the load balancer, the health checks…

Now, here’s the core of a Fargate task definition. It’s just a simple JSON file:

// The "New Way": A snippet from a Fargate Task Definition
{
  "family": "my-modern-app",
  "containerDefinitions": [
    {
      "name": "my-container",
      "image": "nginx:latest",
      "cpu": 256,
      "memory": 512,
      "portMappings": [
        {
          "containerPort": 80,
          "hostPort": 80
        }
      ]
    }
  ],
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512"
}

You define what you need, and the platform handles everything else.

2. The extreme couponer Spot fleets

For workloads that are less “instant spike” and more “giant batch job,” we have the “optimized fleet”. This is the high-stakes, high-reward world of Spot Instances.

Spot used to be terrifying. AWS could pull the plug with two minutes’ notice, and your entire workload would evaporate. But now, with Spot Fleets and diversification, it’s the smartest tool in the box. You can tell AWS, “I need 1,000 vCPUs, and I don’t care what instance types you give me, just find the cheapest ones.”

The platform then builds a diversified fleet for you across multiple instance types and Availability Zones, making it incredibly resilient to any single Spot pool termination. It’s perfect for data processing, CI/CD runners, and any batch job that can be interrupted and resumed. The ASG was always too rigid for this kind of dynamic, cost-driven scaling.

3. The paranoid security guard MicroVMs

Then there’s the truly weird stuff: Firecracker. This is the technology that powers AWS Lambda and Fargate. It’s a “MicroVM” that gives you the iron-clad security isolation of a full virtual machine but with the lightning-fast startup speed of a container.

We’re talking boot times of under 125 milliseconds. This is for when you need to run thousands of tiny, separate, untrusted workloads simultaneously without them ever being able to see each other. It’s the ultimate “multi-tenant” dream, giving every user their own tiny, disposable, fire-walled VM in the blink of an eye.

4. The invisible platform Edge runtimes

Finally, we have the platforms that are so abstract they’re “scaled to invisibility”. This is the world of Edge. Think Lambda@Edge or CloudFront Functions.

With these, you’re not even scaling in a region anymore. Your logic, your code, is automatically replicated and executed at hundreds of Points of Presence around the globe, as close to the end-user as possible. The entire concept of a “fleet” or “instance” just… disappears. The logic scales with the request.

Life after the funeral. How to adapt

Okay, the eulogy is over. The ASG is in its rocking chair on the porch. What does this mean for us, the builders? It’s time to sort through the old belongings and modernize the house.

Go full Marie Kondo on your architecture

First, you need to re-evaluate. Open up your AWS console and take a hard look at every single ASG you’re running. Be honest. Ask the tough questions:

Does this workload really need to be stateful?
Do I really need VM-level control, or am I just clinging to it for comfort?
Is this a stateless web app that I’ve just been too lazy to containerize?

If it doesn’t spark joy (or isn’t a snowflake legacy app that’s impossible to change), thank it for its service and plan its migration.

Stop shopping for engines, start shopping for cars

The most important shift is this: Pick the runtime, not the infrastructure.

For too long, our first question was, “What EC2 instance type do I need?” That’s the wrong question. That’s like trying to build a new car by starting at the hardware store to buy pistons.

The right question is, “What’s the best runtime for my workload?”

Is it a simple, event-driven piece of logic? That’s a Function (Lambda).
Is it a stateless web app in a container? That’s a Serverless Container (Fargate).
Is it a massive, interruptible batch job? That’s an Optimized Fleet (Spot).
Is it a cranky, stateful monolith that needs a pet VM? Only then do you fall back to an Instance (EC2, maybe even with an ASG).

Automate logic, not instance counts

Your job is no longer to be a VM mechanic. Your team’s skills need to shift. Stop manually tuning desired_capacity and start designing event-driven systems.

Focus on scaling logic, not servers. Your scaling trigger shouldn’t be “CPU is at 80%.” It should be “The SQS queue depth is greater than 100” or “API latency just breached 200ms”. Let the platform, be it Lambda, Fargate, or a KEDA-powered Kubernetes cluster, figure out how to add more processing power.

Was it really better in the old days?

Of course, this move to abstraction isn’t without trade-offs. We’re gaining a lot, but we’re also losing something.

The gain is obvious: We get our nights and weekends back. We get drastically reduced operational overhead, faster scaling, and for most stateless workloads, a much lower bill.

The loss is control. You can’t SSH into a Fargate container. You can’t run a custom kernel module on Lambda. For those few, truly special, high-customization legacy workloads, this is a dealbreaker. They will be the ASG’s loyal companions in the retirement home.

But for everything else? The ASG is a relic. It was a brilliant, necessary solution for the problems of 2010. But the problems of 2025 and beyond are different. The cloud has evolved to scale logic, functions, and containers, not just nodes.

The king isn’t just dead. The very concept of a throne has been replaced by a highly efficient, distributed, and slightly impersonal serverless committee. And frankly, it’s about time.

October 24, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

The great AWS Tag standoff

You tried to launch an EC2 instance. Simple task. Routine, even.

Instead, AWS handed you an AccessDenied error like a parking ticket you didn’t know you’d earned.

Nobody touched the IAM policy. At least, not that you can prove.

Yet here you are, staring at a red banner while your coffee goes cold and your standup meeting starts without you.

Turns out, AWS doesn’t just care what you do; it cares what you call it.

Welcome to the quiet civil war between two IAM condition keys that look alike, sound alike, and yet refuse to share the same room: ResourceTag and RequestTag.

The day my EC2 instance got grounded

It happened on a Tuesday. Not because Tuesdays are cursed, but because Tuesdays are when everyone tries to get ahead before the week collapses into chaos.

A developer on your team ran `aws ec2 run-instances` with all the right parameters and a hopeful heart. The response? A polite but firm refusal.

The policy hadn’t changed. The role hadn’t changed. The only thing that had changed was the expectation that tagging was optional.

In AWS, tags aren’t just metadata. They’re gatekeepers. And if your request doesn’t speak their language, the door stays shut.

Meet the two Tag twins nobody told you about

Think of aws:ResourceTag as the librarian who won’t let you check out a book unless it’s already labeled “Fiction” in neat, archival ink. It evaluates tags on existing resources. You’re not creating anything, you’re interacting with something that’s already there. Want to stop an EC2 instance? Fine, but only if it carries the tag `Environment = Production`. No tag? No dice.

Now meet aws:RequestTag, the nightclub bouncer who won’t let you in unless you show up wearing a wristband that says “VIP,” and you brought the wristband yourself. This condition checks the tags you’re trying to apply when you create a new resource. It’s not about what exists. It’s about what you promise to bring into the world.

One looks backward. The other looks forward. Confuse them, and your policy becomes a riddle with no answer.

Why your policy is lying to you

Here’s the uncomfortable truth: not all AWS services play nice with these conditions.

Lambda? Mostly shrugs. S3? Cooperates, but only if you ask nicely (and include `s3:PutBucketTagging`). EC2? Oh, EC2 loves a good trap.

When you run `ec2:RunInstances`, you’re not just creating an instance. You’re also (silently) creating volumes, network interfaces, and possibly a public IP. Each of those needs tagging permissions. And if your policy only allows `ec2:RunInstances` but forgets `ec2:CreateTags`? AccessDenied. Again.

And don’t assume the AWS Console saves you. Clicking “Add tags” in the UI doesn’t magically bypass IAM. If your role lacks the right conditions, those tags vanish into the void before the resource is born.

CloudTrail won’t judge you, but it will show you exactly which tags your request claimed to send. Sometimes, the truth hurts less than the guesswork.

Building a Tag policy that doesn’t backfire

Let’s build something that works in 2025, not 2018.
Start with a simple rule: all new S3 buckets must carry `CostCenter` and `Owner`. Your policy might look like this:

{
  "Effect": "Allow",
  "Action": ["s3:CreateBucket", "s3:PutBucketTagging"],
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "aws:RequestTag/CostCenter": ["Marketing", "Engineering", "Finance"],
      "aws:RequestTag/Owner": ["*"]
    },
    "Null": {
      "aws:RequestTag/CostCenter": "false",
      "aws:RequestTag/Owner": "false"
    }
  }
}

Notice the `Null` condition. It’s the unsung hero that blocks requests missing the required tags entirely.

For extra credit, layer this with AWS Organizations Service Control Policies (SCPs) to enforce tagging at the account level, and pair it with AWS Tag Policies (via Resource Groups) to standardize tag keys and values across your estate. Defense in depth isn’t paranoia, it’s peace of mind.

Testing your policy without breaking production

The IAM Policy Simulator is helpful, sure. But it won’t catch the subtle dance between `RunInstances` and `CreateTags`.

Better approach: spin up a sandbox account. Write a Terraform module or a Python script that tries to create resources with and without tags. Watch what succeeds, what fails, and, most importantly, why.

Automate these tests. Run them in CI. Treat IAM policies like code, because they are.

Remember: in IAM, hope is not a strategy, but a good test plan is.

The human side of tagging

Tags aren’t for machines. Machines don’t care.

Tags are for the human who inherits your account at 2 a.m. during an outage. For the finance team trying to allocate cloud spend. For the auditor who needs to prove compliance without summoning a séance.

A well-designed tagging policy isn’t about control. It’s about kindness, to your future self, your teammates, and the poor soul who has to clean up after you.

So next time you write a condition with `ResourceTag` or `RequestTag`, ask yourself: am I building a fence or a welcome mat?

Because in the cloud, even silence speaks, if you’re listening to the tags.

October 22, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

Why your AWS bill secretly hates Graviton

The party always ends when the bill arrives.

Your team ships a brilliant release. The dashboards glow a satisfying, healthy green. The celebratory GIFs echo through the Slack channels. For a few glorious days, you are a master of the universe, a conductor of digital symphonies.

And then it shows up. The AWS invoice doesn’t knock. It just appears in your inbox with the silent, judgmental stare of a Victorian governess who caught you eating dessert before dinner. You shipped performance, yes. You also shipped a small fleet of x86 instances that are now burning actual, tangible money while you sleep.

Engineers live in a constant tug-of-war between making things faster and making them cheaper. We’re told the solution is another coupon code or just turning off a few replicas over the weekend. But real, lasting savings don’t come from tinkering at the edges. They show up when you change the underlying math. In the world of AWS, that often means changing the very silicon running the show.

Enter a family of servers that look unassuming on the console but quietly punch far above their weight. Migrate the right workloads, and they do the same work for less money. Welcome to AWS Graviton.

What is this Graviton thing anyway?

Let’s be honest. The first time someone says “ARM-based processor,” your brain conjures images of your phone, or maybe a high-end Raspberry Pi. The immediate, skeptical thought is, “Are we really going to run our production fleet on that?”

Well, yes. And it turns out that when you own the entire datacenter, you can design a chip that’s ridiculously good at cloud workloads, without the decades of baggage x86 has been carrying around. Switching to Graviton is like swapping that gas-guzzling ’70s muscle car for a sleek, silent electric skateboard that somehow still manages to tow your boat. It feels wrong… until you see your fuel bill. You’re swapping raw, hot, expensive grunt for cool, cheap efficiency.

Amazon designed these chips to optimize the whole stack, from the physical hardware to the hypervisor to the services you click on. This control means better performance-per-watt and, more importantly, a better price for every bit of work you do.

The lineup is simple:

Graviton2: The reliable workhorse. Great for general-purpose and memory-hungry tasks.
Graviton3: The souped-up model. Faster cores, better at cryptography, and sips memory bandwidth through a wider straw.
Graviton3E: The specialist. Tuned for high-performance computing (HPC) and anything that loves vector math.

This isn’t some lab experiment. Graviton is already powering massive production fleets. If your stack includes common tools like NGINX, Redis, Java, Go, Node.js, Python, or containers on ECS or EKS, you’re already walking on paved roads.

The real numbers behind the hype

The headline from AWS is tantalizing. “Up to 40 percent better price-performance.” “Up to,” of course, are marketing’s two favorite words. It’s the engineering equivalent of a dating profile saying they enjoy “adventures.” It could mean anything.

But even with a healthy dose of cynicism, the trend is hard to ignore. Your mileage will vary depending on your code and where your bottlenecks are, but the gains are real.

Here’s where teams often find the gold:

Web and API services: Handling the same requests per second at a lower instance cost.
CI/CD Pipelines: Faster compile times for languages like Go and Rust on cheaper build runners.
Data and Streaming: Popular engines like NGINX, Envoy, Redis, Memcached, and Kafka clients run beautifully on ARM.
Batch and HPC: Heavy computational jobs get a serious boost from the Graviton3E chips.

There’s also a footprint bonus. Better performance-per-watt means you can hit your ESG (Environmental, Social, and Governance) goals without ever having to create a single sustainability slide deck. A win for engineering, a win for the planet, and a win for dodging boring meetings.

But will my stuff actually run on it?

This is the moment every engineer flinches. The suggestion of “recompiling for ARM” triggers flashbacks to obscure linker errors and a trip down dependency hell.

The good news? The water’s fine. For most modern workloads, the transition is surprisingly anticlimactic. Here’s a quick compatibility scan:

You compile from source or use open-source software? Very likely portable.
Using closed-source agents or vendor libraries? Time to do some testing and maybe send a polite-but-firm support ticket.
Running containers? Fantastic. Multi-architecture images are your new best friend.
What about languages? Java, Go, Node.js, .NET 6+, Python, Ruby, and PHP are all happy on ARM on Linux.
C and C++? Just recompile and link against ARM64 libraries.

The easiest first wins are usually stateless services sitting behind a load balancer, sidecars like log forwarders, or any kind of queue worker where raw throughput is king.

A calm path to migration

Heroic, caffeine-fueled weekend migrations are for rookies. A calm, boring checklist is how professionals do it.

Phase 1: Test in a safe place

Launch a Graviton sibling of your current instance family (e.g., a c7g.large instead of a c6i.large). Replay production traffic to it or run your standard benchmarks. Compare CPU utilization, latency, and error rates. No surprises allowed.

Phase 2: Build for both worlds

It’s time to create multi-arch container images. docker buildx is the tool for the job. This command builds an image for both chip architectures and pushes them to your registry under a single tag.

# Build and push an image for both amd64 and arm64 from one command
docker buildx build \
  --platform linux/amd64,linux/arm64 \
  --tag $YOUR_ACCOUNT.dkr.ecr.$[REGION.amazonaws.com/my-web-app:v1.2.3](https://REGION.amazonaws.com/my-web-app:v1.2.3) \
  --push .

Phase 3: Canary and verify

Slowly introduce the new instances. Route just 5% of traffic to the Graviton pool using weighted target groups. Stare intently at your dashboards. Your “golden signals”, latency, traffic, errors, and saturation, should look identical across both pools.

Here’s a conceptual Terraform snippet of what that weighting looks like:

resource "aws_lb_target_group" "x86_pool" {
  name     = "my-app-x86-pool"
  # ... other config
}

resource "aws_lb_target_group" "arm_pool" {
  name     = "my-app-arm-pool"
  # ... other config
}

resource "aws_lb_listener_rule" "weighted_routing" {
  listener_arn = aws_lb_listener.frontend.arn
  priority     = 100

  action {
    type = "forward"

    forward {
      target_group {
        arn    = aws_lb_target_group.x86_pool.arn
        weight = 95
      }
      target_group {
        arn    = aws_lb_target_group.arm_pool.arn
        weight = 5
      }
    }
  }

  condition {
    path_pattern {
      values = ["/*"]
    }
  }
}

Phase 4: Full rollout with a parachute

If the canary looks healthy, gradually increase traffic: 25%, 50%, then 100%. Keep the old x86 pool warm for a day or two, just in case. It’s your escape hatch. Once it’s done, go show the finance team the new, smaller bill. They love that.

Common gotchas and easy fixes

Here are a few fun ways to ruin your Friday afternoon, and how to avoid them.

The sneaky base image: You built your beautiful ARM application… on an x86 foundation. Your FROM amazonlinux:2023 defaulted to the amd64 architecture. Your container dies instantly. The fix: Explicitly pin your base images to an ARM64 version, like FROM –platform=linux/arm64 public.ecr.aws/amazonlinux/amazonlinux:2023.
The native extension puzzle: Your Python, Ruby, or Node.js app fails because a native dependency couldn’t be built. The fix: Ensure you’re building on an ARM machine or using pre-compiled manylinux wheels that support aarch64.
The lagging agent: Your favorite observability tool’s agent doesn’t have an official ARM64 build yet. The fix: Check if they have a containerized version or gently nudge their support team. Most major vendors are on board now.

A shift in mindset

For decades, we’ve treated the processor as a given, an unchangeable law of physics in our digital world. The x86 architecture was simply the landscape on which we built everything. Graviton isn’t just a new hill on that landscape; it’s a sign the tectonic plates are shifting beneath our feet. This is more than a cost-saving trick; it’s an invitation to question the expensive assumptions we’ve been living with for years.

You don’t need a degree in electrical engineering to benefit from this, though it might help you win arguments on Hacker News. All you really need is a healthy dose of professional curiosity and a good benchmark script.

So here’s the experiment. Pick one of your workhorse stateless services, the ones that do the boring, repetitive work without complaining. The digital equivalent of a dishwasher. Build a multi-arch image for it. Cordon off a tiny, five-percent slice of your traffic and send it to a Graviton pool. Then, watch. Treat your service like a lab specimen. Don’t just glance at the CPU percentage; analyze the cost-per-million-requests. Scrutinize the p99 latency.

If the numbers tell a happy story, you haven’t just tweaked a deployment. You’ve fundamentally changed the economics of that service. You’ve found a powerful new lever to pull. If they don’t, you’ve lost a few hours and gained something more valuable: hard data. You’ve replaced a vague “what if” with a definitive “we tried that.”

Either way, you’ve sent a clear message to that smug monthly invoice. You’re paying attention. And you’re getting smarter. Doing the same work for less money isn’t a stunt. It’s just good engineering.

October 19, 2025 by Fernando SRE Cloud stuff Computer Science stuff DevOps stuff SRE stuff

Your Terraform S3 backend is confused not broken

You’ve done everything right. You wrote your Terraform config with the care of someone assembling IKEA furniture while mildly sleep-deprived. You double-checked your indentation (because yes, it matters). You even remembered to enable encryption, something your future self will thank you for while sipping margaritas on a beach far from production outages.

And then, just as you run terraform init, Terraform stares back at you like a cat that’s just been asked to fetch the newspaper.

Error: Failed to load state: NoSuchBucket: The specified bucket does not exist

But… you know the bucket exists. You saw it in the AWS console five minutes ago. You named it something sensible like company-terraform-states-prod. Or maybe you didn’t. Maybe you named it tf-bucket-please-dont-delete in a moment of vulnerability. Either way, it’s there.

So why is Terraform acting like you asked it to store your state in Narnia?

The truth is, Terraform’s S3 backend isn’t broken. It’s just spectacularly bad at telling you what’s wrong. It doesn’t throw tantrums, it just fails silently, or with error messages so vague they could double as fortune cookie advice.

Let’s decode its passive-aggressive signals together.

The backend block that pretends to listen

At the heart of remote state management lies the backend “s3” block. It looks innocent enough:

terraform {
  backend "s3" {
    bucket         = "my-team-terraform-state"
    key            = "networking/main.tfstate"
    region         = "us-west-2"
    dynamodb_table = "tf-lock-table"
    encrypt        = true
  }
}

Simple, right? But this block is like a toddler with a walkie-talkie: it only hears what it wants to hear. If one tiny detail is off, region, permissions, bucket name, it won’t say “Hey, your bucket is in Ohio but you told me it’s in Oregon.” It’ll just shrug and fail.

And because Terraform backends are loaded before variable interpolation, you can’t use variables inside this block. Yes, really. You’re stuck with hardcoded strings. It’s like being forced to write your grocery list in permanent marker.

The four ways Terraform quietly sabotages you

Over the years, I’ve learned that S3 backend errors almost always fall into one of four buckets (pun very much intended).

1. The credentials that vanished into thin air

Terraform needs AWS credentials. Not “kind of.” Not “maybe.” It needs them like a coffee machine needs beans. But it won’t tell you they’re missing, it’ll just say the bucket doesn’t exist, even if you’re looking at it in the console.

Why? Because without valid credentials, AWS returns a 403 Forbidden, and Terraform interprets that as “bucket not found” to avoid leaking information. Helpful for security. Infuriating for debugging.

Fix it: Make sure your credentials are loaded via environment variables, AWS CLI profile, or IAM roles if you’re on an EC2 instance. And no, copying your colleague’s .aws/credentials file while they’re on vacation doesn’t count as “secure.”

2. The region that lied to everyone

You created your bucket in eu-central-1. Your backend says us-east-1. Terraform tries to talk to the bucket in Virginia. The bucket, being in Frankfurt, doesn’t answer.

Result? Another “bucket not found” error. Because of course.

S3 buckets are region-locked, but the error message won’t mention regions. It assumes you already know. (Spoiler: you don’t.)

Fix it: Run this to check your bucket’s real region:

aws s3api get-bucket-location --bucket my-team-terraform-state

Then update your backend block accordingly. And maybe add a sticky note to your monitor: “Regions matter. Always.”

3. The lock table that forgot to show up

State locking with DynamoDB is one of Terraform’s best features; it stops two engineers from simultaneously destroying the same VPC like overeager toddlers with a piñata.

But if you declare a dynamodb_table in your backend and that table doesn’t exist? Terraform won’t create it for you. It’ll just fail with a cryptic message about “unable to acquire state lock.”

Fix it: Create the table manually (or with separate Terraform code). It only needs one attribute: LockID (string). And make sure your IAM user has dynamodb:GetItem, PutItem, and DeleteItem permissions on it.

Think of DynamoDB as the bouncer at a club: if it’s not there, anyone can stumble in and start redecorating.

4. The missing safety nets

Versioning and encryption aren’t strictly required, but skipping them is like driving without seatbelts because “nothing bad has happened yet.”

Without versioning, a bad terraform apply can overwrite your state forever. No undo. No recovery. Just you, your terminal, and the slow realization that you’ve deleted production.

Enable versioning:

aws s3api put-bucket-versioning \
  --bucket my-team-terraform-state \
  --versioning-configuration Status=Enabled

And always set encrypt = true. Your state file contains secrets, IDs, and the blueprint of your infrastructure. Treat it like your diary, not your shopping list.

Debugging without losing your mind

When things go sideways, don’t guess. Ask Terraform nicely for more details:

TF_LOG=DEBUG terraform init

Yes, it spits out a firehose of logs. But buried in there is the actual AWS API call, and the real error code. Look for lines containing AWS request or ErrorResponse. That’s where the truth hides.

Also, never run terraform init once and assume it’s locked in. If you change your backend config, you must run:

terraform init -reconfigure

Otherwise, Terraform will keep using the old settings cached in .terraform/. It’s stubborn like that.

A few quiet rules for peaceful coexistence

After enough late-night debugging sessions, I’ve adopted a few personal commandments:

One project, one bucket. Don’t mix dev and prod states in the same bucket. It’s like keeping your tax documents and grocery receipts in the same shoebox, technically possible, spiritually exhausting.
Name your state files clearly. Use paths like prod/web.tfstate instead of final-final-v3.tfstate.
Never commit backend configs with real bucket names to public repos. (Yes, people still do this. No, it’s not cute.)
Test your backend setup in a sandbox first. A $0.02 bucket and a tiny DynamoDB table can save you a $10,000 mistake.

It’s not you, it’s the docs

Terraform’s S3 backend works beautifully, once everything aligns. The problem isn’t the tool. It’s that the error messages assume you’re psychic, and the documentation reads like it was written by someone who’s never made a mistake in their life.

But now you know its tells. The fake “bucket not found.” The silent region betrayal. The locking table that ghosts you.

Next time it acts up, don’t panic. Pour a coffee, check your region, verify your credentials, and whisper gently: “I know you’re trying your best.”

Because honestly? It is.

October 13, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

Playing detective with dead Kubernetes nodes

It arrives without warning, a digital tap on the shoulder that quickly turns into a full-blown alarm. Maybe you’re mid-sentence in a meeting, or maybe you’re just enjoying a rare moment of quiet. Suddenly, a shriek from your phone cuts through everything. It’s the on-call alert, flashing a single, dreaded message: NodeNotReady.

Your beautifully orchestrated city of containers, a masterpiece of modern engineering, now has a major power outage in one of its districts. One of your worker nodes, a once-diligent and productive member of the cluster, has gone completely silent. It’s not responding to calls, it’s not picking up new work, and its existing jobs are in limbo. In the world of Kubernetes, this isn’t just a technical issue; it’s a ghosting of the highest order.

Before you start questioning your life choices or sacrificing a rubber chicken to the networking gods, take a deep breath. Put on your detective’s trench coat. We have a case to solve.

First on the scene, the initial triage

Every good investigation starts by surveying the crime scene and asking the most basic question: What the heck happened here? In our world, this means a quick and clean interrogation of the Kubernetes API server. It’s time for a roll call.

kubectl get nodes -o wide

This little command is your first clue. It lines up all your nodes and points a big, accusatory finger at the one in the Not Ready state.

NAME                    STATUS     ROLES    AGE   VERSION   INTERNAL-IP      EXTERNAL-IP     OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
k8s-master-1            Ready      master   90d   v1.28.2   10.128.0.2       34.67.123.1     Ubuntu 22.04.1 LTS   5.15.0-78-generic   containerd://1.6.9
k8s-worker-node-7b5d    NotReady   <none>   45d   v1.28.2   10.128.0.5       35.190.45.6     Ubuntu 22.04.1 LTS   5.15.0-78-generic   containerd://1.6.9
k8s-worker-node-fg9h    Ready      <none>   45d   v1.28.2   10.128.0.4       35.190.78.9     Ubuntu 22.04.1 LTS   5.15.0-78-generic   containerd://1.6.9

There’s our problem child: k8s-worker-node-7b5d. Now that we’ve identified our silent suspect, it’s time to pull it into the interrogation room for a more personal chat.

kubectl describe node k8s-worker-node-7b5d

The output of describe is where the juicy gossip lives. You’re not just looking at specs; you’re looking for a story. Scroll down to the Conditions and, most importantly, the Events section at the bottom. This is where the node often leaves a trail of breadcrumbs explaining exactly why it decided to take an unscheduled vacation.

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Mon, 13 Oct 2025 09:55:12 +0200   Mon, 13 Oct 2025 09:45:30 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 13 Oct 2025 09:55:12 +0200   Mon, 13 Oct 2025 09:45:30 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 13 Oct 2025 09:55:12 +0200   Mon, 13 Oct 2025 09:45:30 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Mon, 13 Oct 2025 09:55:12 +0200   Mon, 13 Oct 2025 09:50:05 +0200   KubeletNotReady              container runtime network not ready: CNI plugin reporting error: rpc error: code = Unavailable desc = connection error

Events:
  Type     Reason                   Age                  From                       Message
  ----     ------                   ----                 ----                       -------
  Normal   Starting                 25m                  kubelet                    Starting kubelet.
  Warning  ContainerRuntimeNotReady 5m12s (x120 over 25m) kubelet                    container runtime network not ready: CNI plugin reporting error: rpc error: code = Unavailable desc = connection error

Aha! Look at that. The Events log is screaming for help. A repeating warning, ContainerRuntimeNotReady, points to a CNI (Container Network Interface) plugin having a full-blown tantrum. We’ve moved from a mystery to a specific lead.

The usual suspects, a rogues’ gallery

When a node goes quiet, the culprit is usually one of a few repeat offenders. Let’s line them up.

1. The silent saboteur network issues

This is the most common villain. Your node might be perfectly healthy, but if it can’t talk to the control plane, it might as well be on a deserted island. Think of the control plane as the central office trying to call its remote employee (the node). If the phone line is cut, the office assumes the employee is gone. This can be caused by firewall rules blocking ports, misconfigured VPC routes, or a DNS server that’s decided to take the day off.

2. The overworked informant, the kubelet

The kubelet is the control plane’s informant on every node. It’s a tireless little agent that reports on the node’s health and carries out orders. But sometimes, this agent gets sick. It might have crashed, stalled, or is struggling with misconfigured credentials (like expired TLS certificates) and can’t authenticate with the mothership. If the informant goes silent, the node is immediately marked as a person of interest.

You can check on its health directly on the node:

# SSH into the problematic node
ssh user@<node-ip>

# Check the kubelet's vital signs
systemctl status kubelet

A healthy output should say active (running). Anything else, and you’ve found a key piece of evidence.

3. The glutton resource exhaustion

Your node has a finite amount of CPU, memory, and disk space. If a greedy application (or a swarm of them) consumes everything, the node itself can become starved. The kubelet and other critical system daemons need resources to breathe. Without them, they suffocate and stop reporting in. It’s like one person eating the entire buffet, leaving nothing for the hosts of the party.

A quick way to check for gluttons is with:

kubectl top node <your-problem-child-node-name>

If you see CPU or memory usage kissing 100%, you’ve likely found your culprit.

The forensic toolkit: digging deeper

If the initial triage and lineup didn’t reveal the killer, it’s time to break out the forensic tools and get our hands dirty.

Sifting Through the Diary with journalctl

The journalctl command is your window into the kubelet’s soul (or, more accurately, its log files). This is where it writes down its every thought, fear, and error.

# On the node, tail the kubelet's logs for clues
journalctl -u kubelet -f --since "10 minutes ago"

Look for recurring error messages, failed connection attempts, or anything that looks suspiciously out of place.

Quarantining the patient with drain

Before you start performing open-heart surgery on the node, it’s wise to evacuate the civilians. The kubectl drain command gracefully evicts all the pods from the node, allowing them to be rescheduled elsewhere.

kubectl drain k8s-worker-node-7b5d --ignore-daemonsets --delete-local-data

This isolates the patient, letting you work without causing a city-wide service outage.

Confirming the phone lines with curl

Don’t just trust the error messages. Verify them. From the problematic node, try to contact the API server directly. This tells you if the fundamental network path is even open.

# From the problem node, try to reach the API server endpoint
curl -k https://<api-server-ip>:<port>/healthz

If you get ok, the basic connection is fine. If it times out or gets rejected, you’ve confirmed a networking black hole.

Crime prevention: keeping your nodes out of trouble

Solving the case is satisfying, but a true detective also works to prevent future crimes.

Set up a neighborhood watch: Implement robust monitoring with tools like Prometheus and Grafana. Set up alerts for high resource usage, disk pressure, and node status changes. It’s better to spot a prowler before they break in.
Install self-healing robots: Most cloud providers (GKE, EKS, AKS) offer node auto-repair features. If a node fails its health checks, the platform will automatically attempt to repair it or replace it. Turn this on. It’s your 24/7 robotic police force.
Enforce city zoning laws: Use resource requests and limits on your deployments. This prevents any single application from building a resource-hogging skyscraper that blocks the sun for everyone else.
Schedule regular health checkups: Keep your cluster components, operating systems, and container runtimes updated. Many Not Ready mysteries are caused by long-solved bugs that you could have avoided with a simple patch.

The case is closed for now

So there you have it. The rogue node is back in line, the pods are humming along, and the city of containers is once again at peace. You can hang up your trench coat, put your feet up, and enjoy that lukewarm coffee you made three hours ago. The mystery is solved.

But let’s be honest. Debugging a Not Ready node is less like a thrilling Sherlock Holmes novel and more like trying to figure out why your toaster only toasts one side of the bread. It’s a methodical, often maddening, process of elimination. You start with grand theories of network conspiracies and end up discovering the culprit was a single, misplaced comma in a YAML file, the digital equivalent of the butler tripping over the rug.

So the next time an alert yanks you from your peaceful existence, don’t panic. Remember that you are a digital detective, a whisperer of broken machines. Your job is to patiently ask the right questions until the silent, uncooperative suspect finally confesses. After all, in the world of Kubernetes, a node is never truly dead. It’s just being dramatic and waiting for a good detective to find the clues, and maybe, just maybe, restart its kubelet. The city is safe… until the next time. And there is always a next time.

October 13, 2025 by Fernando SRE Cloud stuff DevOps stuff Kubernetes SRE stuff