AWS

AWSMap for smarter AWS migrations

Most AWS migrations begin with a noble ambition and a faintly ridiculous problem.

The ambition is to modernise an estate, reduce risk, tidy the architecture, and perhaps, if fortune smiles, stop paying for three things nobody remembers creating.

The ridiculous problem is that before you can migrate anything, you must first work out what is actually there.

That sounds straightforward until you inherit an AWS account with the accumulated habits of several teams, three naming conventions, resources scattered across regions, and the sort of IAM sprawl that suggests people were granting permissions with the calm restraint of a man feeding pigeons. At that point, architecture gives way to archaeology.

I do not work for AWS, and this is not a sponsored love letter to a shiny console feature. I am an AWS and GCP architect working in the industry, and I have used AWSMap when assessing environments ahead of migration work. The reason I am writing about it is simple enough. It is one of those practical tools that solves a very real problem, and somehow remains less widely known than it deserves.

AWSMap is a third-party command-line utility that inventories AWS resources across regions and services, then lets you explore the result through HTML reports, SQL queries, and plain-English questions. In other words, it turns the early phase of a migration from endless clicking into something closer to a repeatable assessment process.

That does not make it perfect, and it certainly does not replace native AWS services. But in the awkward first hours of understanding an inherited environment, it can be remarkably useful.

The migration problem before the migration

A cloud migration plan usually looks sensible on paper. There will be discovery, analysis, target architecture, dependency mapping, sequencing, testing, cutover, and the usual brave optimism seen in project plans everywhere.

In reality, the first task is often much humbler. You are trying to answer questions that should be easy and rarely are.

What is running in this account?

Which regions are actually in use?

Are there old snapshots, orphaned EIPs, forgotten load balancers, or buckets with names that sound important enough to frighten everyone into leaving them alone?

Which workloads are genuinely active, and which are just historical luggage with a monthly invoice attached?

You can answer those questions from the AWS Management Console, of course. Given enough tabs, enough patience, and a willingness to spend part of your afternoon wandering through services you had not planned to visit, you will eventually get there. But that is not a particularly elegant way to begin a migration.

This is where AWSMap becomes handy. Instead of treating discovery as a long guided tour of the console, it treats it as a data collection exercise.

What AWSMap does well

At its core, AWSMap scans an AWS environment and produces an inventory of resources. The current public package description on PyPI describes it as covering more than 150 AWS services, while version 1.5.0 covers 140 plus services, which is a good reminder that the coverage evolves. The important point is not the exact number on a given Tuesday morning, but that it covers a broad enough slice of the estate to be genuinely useful in early assessments.

What makes the tool more interesting is what it does after the scan.

It can generate a standalone HTML report, store results locally in SQLite, let you query the inventory with SQL, run named audit queries, and translate plain-English prompts into database queries without sending your infrastructure metadata off to an LLM service. The release notes for v1.5.0 describe local SQLite storage, raw SQL querying, named queries, typo-tolerant natural language questions, tag filtering, scoped account views, and browsable examples.

That combination matters because migrations are rarely single, clean events. They are usually a series of discoveries, corrections, and mildly awkward conversations. Having the inventory preserved locally means the account does not need to be rediscovered from scratch every time someone asks a new question two days later.

The report you can actually hand to people

One of the surprisingly practical parts of AWSMap is the report output.

The tool can generate a self-contained HTML report that opens locally in a browser. That sounds almost suspiciously modest, but it is useful precisely because it is modest. You can attach it to a ticket, share it with a teammate, or open it during a workshop without building a whole reporting pipeline first. The v1.5.0 release notes describe the report as a single, standalone HTML file with filtering, search, charts, and export options.

That makes it suitable for the sort of migration meeting where someone says, “Can we quickly check whether eu-west-1 is really the only active region?” and you would rather not spend the next ten minutes performing a slow ritual through five console pages.

A simple scan might look like this:

awsmap -p client-prod

If you want to narrow the blast radius a bit and focus on a few services that often matter early in migration discovery, you could do this:

awsmap -p client-prod -s ec2,rds,elb,lambda,iam

And if the account is a thicket of shared infrastructure, tags can help reduce the noise:

awsmap -p client-prod -t Environment=Production -t Owner=platform-team

That kind of filtering is helpful when the account contains equal parts business workload and historical clutter, which is to say, most real accounts.

Why SQLite is more important than it sounds

The feature I like most is not the report. It is the local SQLite database.

Every scan can be stored locally, so the inventory becomes queryable over time instead of vanishing the moment the terminal output scrolls away. The default local database path is ‘~/.awsmap/inventory.db’, and the scan results from different runs can accumulate there for later analysis.

This changes the character of the tool quite a bit. It stops being a disposable scanner and becomes something closer to a field notebook.

Suppose you scan a client account today, then return to the same work three days later, after someone mentions an old DR region nobody had documented. Without persistence, you start from scratch. With persistence, you ask the database.

That is a much more civilised way to work.

A query for the busiest services in the collected inventory might look like this:

awsmap query "SELECT service, COUNT(*) AS total
FROM resources
GROUP BY service
ORDER BY total DESC
LIMIT 12"

And a more migration-focused query might be something like:

awsmap query "SELECT account, region, service, name
FROM resources
WHERE service IN ('ec2', 'rds', 'elb', 'lambda')
ORDER BY account, region, service, name"

Neither query is glamorous, but migrations are not built on glamour. They are built on being able to answer dull, important questions reliably.

Security and hygiene checks without the acrobatics

AWSMap also includes named queries for common audit scenarios, which is useful for two reasons.

First, most people do not wake up eager to write SQL joins against IAM relationships. Second, migration assessments almost always drift into security checks sooner or later.

The public release notes describe named queries for scenarios such as admin users, public S3 buckets, unencrypted EBS volumes, unused Elastic IPs, and secrets without rotation.

That means you can move from “What exists?” to “What looks questionable?” without much ceremony.

For example:

awsmap query -n admin-users
awsmap query -n public-s3-buckets
awsmap query -n ebs-unencrypted
awsmap query -n unused-eips

Those are not, strictly speaking, migration-only questions. But they are precisely the kind of questions that surface during migration planning, especially when the destination design is meant to improve governance rather than merely relocate the furniture.

Asking questions in plain English

One of the nicer additions in the newer version is the ability to ask plain-English questions.

That is the sort of feature that normally causes one to brace for disappointment. But here the approach is intentionally local and deterministic. This functionality is a built-in parser rather than an LLM-based service, which means no API keys, no network calls to an external model, and no need to ship resource metadata somewhere mysterious.

That matters in enterprise environments where the phrase “just send the metadata to a third-party AI service” tends to receive the warm reception usually reserved for wasps.

Some examples:

awsmap ask show me lambda functions by region
awsmap ask list databases older than 180 days
awsmap ask find ec2 instances without Owner tag

Even when the exact wording varies, the basic idea is appealing. Team members who do not want to write SQL can still interrogate the inventory. That lowers the barrier for using the tool during workshops, handovers, and review sessions.

Where AWSMap fits next to AWS native services

This is the part worth stating clearly.

AWSMap is useful, but it is not a replacement for AWS Resource Explorer, AWS Config, or every other native mechanism you might use for discovery, governance, and inventory.

AWS Resource Explorer can search across supported resource types and, since 2024, can also discover all tagged AWS resources using the ‘tag:all’ operator. AWS documentation also notes an important limitation for IAM tags in Resource Explorer search.

AWS Config, meanwhile, continues to expand the resource types it can record, assess, and aggregate. AWS has announced multiple additions in 2025 and 2026 alone, which underlines that the native inventory and compliance story is still moving quickly.

So why use AWSMap at all?

Because its strengths are slightly different.

It is local.

It is quick to run.

It gives you a portable HTML report.

It stores results in SQLite for later interrogation.

It lets you query the inventory directly without setting up a broader governance platform first.

That makes it particularly handy in the early assessment phase, in consultancy-style discovery work, or in those awkward inherited environments where you need a fast baseline before deciding what the more permanent controls should be.

The weak points worth admitting

No serious article about a tool should pretend the tool has descended from heaven in perfect condition, so here are the caveats.

First, coverage breadth is not the same thing as universal depth. A tool can support a large number of services and still provide uneven detail between them. That is true of almost every inventory tool ever made.

Second, the quality of the result still depends on the credentials and permissions you use. If your access is partial, your inventory will be partial, and no amount of cheerful HTML will alter that fact.

Third, local storage is convenient, but it also means you should be disciplined about how scan outputs are handled on your machine, especially if you are working with client environments. Convenience and hygiene should remain on speaking terms.

Fourth, for organisation-wide governance, compliance history, managed rules, and native integrations, AWS services such as Config still have an obvious place. AWSMap is best seen as a sharp assessment tool, not a universal control plane.

That is not a criticism so much as a matter of proper expectations.

A practical workflow for migration discovery

If I were using AWSMap at the start of a migration assessment, the workflow would be something like this.

First, run a broad scan of the account or profile you care about.

awsmap -p client-prod

Then, if the account is noisy, refine the scope.

awsmap -p client-prod -s ec2,rds,elb,iam,route53
awsmap -p client-prod --exclude-defaults

Next, use a few named queries to surface obvious issues.

awsmap query -n public-s3-buckets
awsmap query -n secrets-no-rotation
awsmap query -n admin-users

After that, ask targeted questions in either SQL or plain English.

awsmap ask list load balancers by region
awsmap ask show databases with no backup tag
awsmap query "SELECT region, COUNT(*) AS total
FROM resources
WHERE service='ec2'
GROUP BY region
ORDER BY total DESC"

And finally, keep the HTML report and local inventory as a baseline for later design discussions.

That is where the tool earns its keep. It gives you a reasonably fast, reasonably structured picture of an estate before the migration plan turns into a debate based on memory, folklore, and screenshots in old slide decks.

When the guessing stops

There is a particular kind of misery in cloud work that comes from being asked to improve an environment before anyone has properly described it.

Tools do not eliminate that misery, but some of them reduce it to a more manageable size.

AWSMap is one of those.

It is not the only way to inventory AWS resources. It is not a substitute for native governance services. It is not magic. But it is practical, fast to understand, and surprisingly helpful when the first job in a migration is simply to stop guessing.

That alone makes it worth knowing about.

And in cloud migrations, a tool that helps replace guessing with evidence is already doing better than half the room.

Tracing the origin of S3 objects with Amazon Athena

A while back, I opened the console for one of my busier S3 buckets and there, nestling among hundreds of perfectly ordinary files, sat something called quarterly_report_20260305.pdf. It looked exactly like the mystery jar of jam you find at the back of the fridge after a family gathering: you know it wasn’t there yesterday, yet nobody owns up to putting it there. In shared buckets, this sort of thing happens all the time, and the console, bless it, only tells you the last modified date. It never whispers who the actual culprit was.

That is where Amazon Athena and S3 server access logs come in. With a few straightforward steps, you can turn those silent log files into a clear answer delivered in plain SQL. The whole process feels pleasantly like detective work you can finish while your tea is still hot, and the article you are reading now should take you no more than ten minutes from start to finish.

The tech stack

Amazon S3 is the sturdy cupboard that holds nearly everything these days, yet for all its virtues, it keeps the identity of each uploader politely hidden. To make that identity visible, you must first switch on server access logging. Once enabled, every request (upload, download, delete) is written as plain text and dropped into a bucket you choose. The logs arrive free of charge to enable, though you do pay the modest storage cost in the destination bucket, a small price for sanity.

Enter Amazon Athena, a serverless query service that lets you run ordinary SQL straight against those log files without moving a single byte. The first time I used it, I felt the same quiet thrill you get when you finally locate the right key in a drawer full of oddments: everything suddenly becomes possible with almost no effort.

Step 1. Ensure logging is enabled

Go to the Properties tab of your source bucket and turn on Server access logging. Point the logs at a separate bucket (I like to call mine something obvious like logs-companyname) and give them a clean prefix such as access-logs/.

A small practical note I learned after waiting in vain one afternoon: the logs can take up to a few hours to appear, rather like a post that has gone round the long way. They do not backfill old activity, so if you need history, you must wait until new events occur. Also, check the target bucket’s policy; a missing permission once left me staring at empty folders and muttering mild British curses at the screen.

Step 2. Create the Athena table with partitioning

Now tell Athena how to read the logs. The following DDL uses partition projection, a quietly brilliant feature that means you never have to run MSCK REPAIR TABLE again. It works out the date folders automatically.

Run this in the Athena query editor (I have changed the names and paths from my real ones so nothing sensitive slips through):

CREATE DATABASE IF NOT EXISTS s3_logs_database;

CREATE EXTERNAL TABLE IF NOT EXISTS access_logs_table (
    bucketowner string,
    bucket_name string,
    requestdatetime string,
    remoteip string,
    requester string,
    requestid string,
    operation string,
    key string,
    request_uri string,
    httpstatus string,
    errorcode string,
    bytessent bigint,
    objectsize bigint,
    totaltime string,
    turnaroundtime string,
    referrer string,
    useragent string,
    versionid string,
    hostid string,
    sigv string,
    ciphersuite string,
    authtype string,
    endpoint string,
    tlsversion string,
    accesspointarn string,
    aclrequired string
)
PARTITIONED BY (timestamp string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    'input.regex' = '([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$'
)
LOCATION 's3://my-company-logs/access-logs/'
TBLPROPERTIES (
    'projection.enabled' = 'true',
    'projection.timestamp.type' = 'date',
    'projection.timestamp.range' = '2024/01/01,NOW',
    'projection.timestamp.format' = 'yyyy/MM/dd',
    'projection.timestamp.interval' = '1',
    'projection.timestamp.interval.unit' = 'DAYS',
    'storage.location.template' = 's3://my-company-logs/access-logs/${timestamp}/'
);

Replace the LOCATION and storage.location.template with your own bucket and prefix. The regex looks like the sort of thing a medieval monk would doodle in the margin, but it is the official AWS pattern and works reliably.

Step 3. Uncover the upload details

With the table ready, finding who dropped that quarterly report is almost childishly simple. Here is the query I use (again with changed details):

SELECT 
    requestdatetime AS upload_time,
    requester,
    remoteip,
    operation,
    key AS file_path
FROM access_logs_table
WHERE timestamp = '2026/03/05'
  AND key LIKE '%quarterly_report_20260305%'
  AND operation LIKE '%PUT%'
LIMIT 10;

The requester column is the star of the show: it shows the IAM user ARN or role that performed the action. Requestdatetime gives the exact second in UTC. Operation tells you it was indeed an upload. I once traced a suspicious file only to discover it was my own weekend script behaving like an overenthusiastic puppy that had wandered off and come back with a stick.

If the requester says Anonymous, you know the upload came from public access, a gentle reminder to tighten the bucket policy before the next family gathering in the cloud.

Cost saving tips with partitions

Athena charges five dollars per terabyte scanned. Without the timestamp filter, you can accidentally scan months of logs while hunting for one file, an expense roughly equivalent to buying an entire chocolate cake when you only wanted one slice. With partitioning, the query reads only the folder for that single day, and the bill usually stays under the price of a decent biscuit.

I speak from experience: one forgetful broad query once produced a polite note from AWS that made my eyebrows rise. Since then, I filter on timestamp first and sleep better at night.

Wrapping up

What began as a mild annoyance with a stray file has turned into a small daily pleasure. A few lines of SQL, a properly partitioned table, and the cloud cupboard becomes pleasantly transparent. The same technique helps with compliance checks, pipeline debugging, and the quiet satisfaction of knowing exactly what is going on in your storage.

If you have a few minutes this afternoon, turn on logging for a test bucket and give the table a try. You may find, as I did, that the small effort pays back in clarity and calm far beyond the few pennies it costs. And the next time something mysterious appears in your S3 cupboard, you will know exactly who left it there and when.

Why generic auto scaling is terrible for healthcare pipelines

Let us talk about healthcare data pipelines. Running high volume payer processing pipelines is a lot like hosting a mandatory potluck dinner for a group of deeply eccentric people with severe and conflicting dietary restrictions. Each payer behaves with maddening uniqueness. One payer bursts through the door, demanding an entire roasted pig, which they intend to consume in three minutes flat. This requires massive, short-lived computational horsepower. Another payer arrives with a single boiled pea and proceeds to chew it methodically for the next five hours, requiring a small but agonizingly persistent trickle of processing power.

On top of this culinary nightmare, there are strict rules of etiquette. You absolutely must digest the member data before you even look at the claims data. Eligibility files must be validated before anyone is allowed to touch the dessert tray of downstream jobs. The workload is not just heavy. It is incredibly uneven and delightfully complicated.

Buying folding chairs for a banquet

On paper, Amazon Web Services managed Auto Scaling Mechanisms should fix this problem. They are designed to look at a growing pile of work and automatically hire more help. But applying generic auto scaling to healthcare pipelines is like a restaurant manager seeing a line out the door and solving the problem by buying fifty identical plastic folding chairs.

The manager does not care that one guest needs a high chair and another requires a reinforced steel bench. Auto scaling reacts to the generic brute force of the system load. It cannot look at a specific payer and tailor the compute shape to fit their weird eating habits. It cannot enforce the strict social hierarchy of job priorities. It scales the infrastructure, but it completely fails to scale the intention.

This is why we abandoned the generic approach and built our own dynamic EC2 provisioning system. Instead of maintaining a herd of generic servers waiting around for something to do, we create bespoke servers on demand based on a central configuration table.

The ruthless nightclub bouncer of job scheduling

Let us look at how this actually works regarding prioritization. Our system relies on that central configuration table to dictate order. Think of this table as the guest list at an obnoxiously exclusive nightclub. Our scheduler acts as the ruthless bouncer.

When jobs arrive at the queue, the bouncer checks the list. Member data? Right this way to the VIP lounge, sir. Claims data? Stand on the curb behind the velvet rope until the members are comfortably seated. Generic auto scaling has no native concept of this social hierarchy. It just sees a mob outside the club and opens the front doors wide. Our dynamic approach gives us perfect, tyrannical control over who gets processed first, ensuring our pipelines execute in a beautifully deterministic way. We spin up exactly the compute we specify, exactly when we want it.

Leaving your car running in the garage

Then there is the financial absurdity of warm pools. Standard auto scaling often relies on keeping a baseline of idle instances warm and ready, just in case a payer decides to drop a massive batch of files at two in the morning.

Keeping idle servers running is the technological equivalent of leaving your car engine idling in the closed garage all night just in case you get a sudden craving for a carton of milk at dawn. It is expensive, it is wasteful, and it makes you look a bit foolish when the AWS bill arrives.

Our dynamic system operates with a baseline of zero. We experience one hundred percent burst efficiency because we only pay for the exact compute we use, precisely when we use it. Cost savings happen naturally when you refuse to pay for things that are sitting around doing nothing.

A delightfully brutal server lifecycle

The operational model we ended up with is almost comically simple compared to traditional methods. A generic scaling group requires complex scaling policies, tricky cooldown periods, and endless tweaking of CloudWatch alarms. It is like managing a highly sensitive, moody teenager.

Our dynamic EC2 model is wonderfully ruthless. We create the instance and inject it with a single, highly specific purpose via a startup script. The instance wakes up, processes the healthcare data with absolute precision, and then politely self destructs so it stops billing us. They are the mayflies of the cloud computing world. They live just long enough to do their job, and then they vanish. There are no orphaned instances wandering the cloud.

This dynamic provisioning model has fundamentally altered how we digest payer workloads. We have somehow achieved a weird but perfect holy grail of cloud architecture. We get the granular flexibility of serverless functions, the raw, unadulterated horsepower of dedicated EC2 instances, and the stingy cost efficiency of a pure event-driven design.

If your processing jobs vary wildly from payer to payer, and if you care deeply about enforcing priorities without burning money on idle metal, building a disposable compute army might be exactly what your architecture is missing. We said goodbye to our idle servers, and honestly, we do not miss them at all.

The lazy cloud architect guide to AWS automation

The shortcuts I use on every project now, after learning that scale mostly changes the bill, not the mistakes.

Let me tell you how this started. I used to measure my productivity by how many AWS services I could haphazardly stitch together in a single afternoon. Big mistake.

One night, I was deploying what should have been a boring, routine feature. Nothing fancy. Just basic plumbing. Six hours later, I was still babysitting the deployment, clicking through the AWS console like a caffeinated lab rat, re-running scripts, and manually patching up tiny human errors.

That is when the epiphany hit me like a rogue server rack. I was not slow because AWS is a labyrinth of complexity. I was slow because I was doing things manually that AWS already knows how to do in its sleep.

The patterns below did not come from sanitized tutorials. They were forged in the fires of shipping systems under immense pressure and desperately wanting my weekends back.

Event-driven everything and absolutely no polling

If you are polling, you are essentially paying Jeff Bezos for the privilege of wasting your own time. Polling is the digital equivalent of sitting in the backseat of a car and constantly asking, “Are we there yet?” every five seconds.

AWS is an event machine. Treat it like one. Instead of writing cron jobs that anxiously ask the database if something changed, just let AWS tap you on the shoulder when it actually happens.

Where this shines:

  • File uploads
  • Database updates
  • Infrastructure state changes
  • Cross-account automation

Example of reacting to an S3 upload instantly:

def lambda_handler(event, context):
    for record in event['Records']:
        bucket_name = record['s3']['bucket']['name']
        object_key = record['s3']['object']['key']

        # Stop asking if the file is there. AWS just handed it to you.
        trigger_completely_automated_workflow(bucket_name, object_key)

No loops. No waiting. Just action.

Pro tip: Event-driven systems fail less frequently simply because they do less work. They are the lazy geniuses of the cloud world.

Immutable deployments or nothing

SSH is not a deployment strategy. It is a desperate cry for help.

If your deployment plan involves SSH, SCP, or uttering the cursed phrase “just this one quick change in production”, you do not have a system. You have a fragile ecosystem built on hope and duct tape. I stopped “fixing” servers years ago. Now, I just murder them and replace them with fresh clones.

The pattern is brutally simple:

  1. Build once
  2. Deploy new
  3. Destroy old

Example of launching a new EC2 version programmatically:

import boto3

ec2_client = boto3.client('ec2', region_name='eu-west-1')
response = ec2_client.run_instances(
    ImageId='ami-0123456789abcdef0', # Totally fake AMI
    InstanceType='t3a.nano',
    MinCount=1,
    MaxCount=1,
    TagSpecifications=[{
        'ResourceType': 'instance',
        'Tags': [{'Key': 'Purpose', 'Value': 'EphemeralClone'}]
    }]
)

It is like doing open-heart surgery. Instead of trying to fix the heart while the patient is running a marathon, just build a new patient with a healthy heart and disintegrate the old one. When something breaks, I do not debug the server. I debug the build process. That is where the real parasites live.

Infrastructure as code for the forgettable things

Most teams only use IaC for the big, glamorous stuff. VPCs. Kubernetes clusters. Massive databases.

This is completely backwards. It is like wearing a bespoke tuxedo but forgetting your underwear. The small, forgettable resources are the ones that will inevitably bite you when you least expect it.

What I automate with religious fervor:

  • IAM roles
  • Alarms
  • Schedules
  • Policies
  • Log retention

Example of creating a CloudWatch alarm in code:

cloudwatch.put_metric_alarm(
    AlarmName="QueueIsExploding",
    MetricName="ApproximateNumberOfMessagesVisible",
    Namespace="AWS/SQS",
    Threshold=10000,
    ComparisonOperator="GreaterThanThreshold",
    EvaluationPeriods=1,
    Period=300,
    Statistic="Sum"
)

If it matters in production, it lives in code. No exceptions.

Let Step Functions own the flow

Early in my career, I crammed all my business logic into Lambdas. Retries, branching, timeouts, bizarre edge cases. I treated them like a digital junk drawer.

I do not do that anymore. Lambdas should be as dumb and fast as a golden retriever chasing a tennis ball.

The new rule: One Lambda equals one job. If you need a workflow, use Step Functions. They are the micromanaging middle managers your architecture desperately needs.

Example of a simple workflow state:

{
  "Type": "Task",
  "Resource": "arn:aws:lambda:eu-west-1:123456789012:function:DoOneThingWell",
  "Retry": [
    {
      "ErrorEquals": ["States.TaskFailed"],
      "IntervalSeconds": 3,
      "MaxAttempts": 2
    }
  ],
  "Next": "CelebrateSuccess"
}

This separation makes debugging highly visual, makes retries explicit, and makes onboarding the new guy infinitely less painful. Your future self will thank you.

Kill cron jobs and use managed schedulers

Cron jobs are perfectly fine until they suddenly are not.

They are the ghosts of your infrastructure. They are completely invisible until they fail, and when they do fail, they die in absolute silence like a ninja with a sudden heart condition. AWS gives you managed scheduling. Just use it.

Why this is fundamentally faster:

  • Central visibility
  • Built-in retries
  • IAM-native permissions

Example of creating a scheduled rule:

eventbridge.put_rule(
    Name="TriggerNightlyChaos",
    ScheduleExpression="cron(0 2 * * ? *)",
    State="ENABLED",
    Description="Wakes up the system when nobody is looking"
)

Automation should be highly observable. Cron jobs are just waiting in the dark to ruin your Tuesday.

Bake cost controls into automation

Speed without cost awareness is just a highly efficient way to bankrupt your employer. The fastest teams I have ever worked with were not just shipping fast. They were failing cheaply.

What I automate now with the ruthlessness of a debt collector:

  • Budget alerts
  • Resource TTLs
  • Auto-shutdowns for non-production environments

Example of tagging resources with an expiration date:

ec2.create_tags(
    Resources=['i-0deadbeef12345678'],
    Tags=[
        {"Key": "TerminateAfter", "Value": "2026-12-31"},
        {"Key": "Owner", "Value": "TheVoid"}
    ]
)

Leaving resources without an owner or an expiration date is like leaving the stove on, except this stove bills you by the millisecond. Anything without a TTL is just technical debt waiting to invoice you.

A quote I live by: “Automation does not cut costs by magic. It cuts costs by quietly preventing the expensive little mistakes humans call normal.”

The death of the cloud hero

These patterns did not make me faster because they are particularly clever. They made me faster because they completely eliminated the need to make decisions.

Less clicking. Less remembering. Absolutely zero heroics.

If you want to move ten times faster on AWS, stop asking what to build next. Once automation is in charge, real speed usually arrives as work you no longer have to remember.

How we ditched AWS ELB and accidentally built a time machine

I was staring at our AWS bill at two in the morning, nursing my third cup of coffee, when I realized something that should have been obvious months earlier. We were paying more to distribute our traffic than to process it. Our Application Load Balancer, that innocent-looking service that simply forwards packets from point A to point B, was consuming $3,900 every month. That is $46,800 a year. For a traffic cop. A very expensive traffic cop that could not even handle our peak loads without breaking into a sweat.

The particularly galling part was that we had accepted this as normal. Everyone uses AWS load balancers, right? They are the standard, the default, the path of least resistance. It is like paying rent for an apartment you only use to store your shoes. Technically functional, financially absurd.

So we did what any reasonable engineering team would do at that hour. We started googling. And that is how we discovered IPVS, a technology so old that half our engineering team had not been born when it was first released. IPVS stands for IP Virtual Server, which sounds like something from a 1990s hacker movie, and honestly, that is not far off. It was written in 1998 by a fellow named Wensong Zhang, who presumably had no idea that twenty-eight years later, a group of bleary-eyed engineers would be using his code to save more than forty-six thousand dollars a year.

The expensive traffic cop

To understand why we were so eager to jettison our load balancer, you need to understand how AWS pricing works. Or rather, how it accumulates like barnacles on the hull of a ship, slowly dragging you down until you wonder why you are moving so slowly.

An Application Load Balancer costs $0.0225 per hour. That sounds reasonable, about sixteen dollars a month. But then there are LCUs, or Load Balancer Capacity Units, which charge you for every new connection, every rule evaluation, every processed byte. It is like buying a car and then discovering you have to pay extra every time you turn the steering wheel.

In practice, this meant our ALB was consuming fifteen to twenty percent of our entire infrastructure budget. Not for compute, not for storage, not for anything that actually creates value. Just for forwarding packets. It was the technological equivalent of paying a butler to hand you the remote control.

The ALB also had some architectural quirks that made us scratch our heads. It terminated TLS, which sounds helpful until you realize we were already terminating TLS at our ingress. So we were decrypting traffic, then re-encrypting it, then decrypting it again. It was like putting on a coat to go outside, then taking it off and putting on another identical coat, then finally going outside. The security theater was strong with this one.

A trip to 1999

I should confess that when we started this project, I had no idea what IPVS even stood for. I had heard it mentioned in passing by a colleague who used to work at a large Chinese tech company, where apparently everyone uses it. He described it with the kind of reverence usually reserved for vintage wine or classic cars. “It just works,” he said, which in engineering terms is the highest possible praise.

IPVS, I learned, lives inside the Linux kernel itself. Not in a container, not in a microservice, not in some cloud-managed abstraction. In the actual kernel. This means when a packet arrives at your server, the kernel looks at it, consults its internal routing table, and forwards it directly. No context switches, no user-space handoffs, no “let me ask my manager” delays. Just pure, elegant packet forwarding.

The first time I saw it in action, I felt something I had not felt in years of cloud engineering. I felt wonder. Here was code written when Bill Clinton was president, when the iPod was still three years away, when people used modems to connect to the internet. And it was outperforming a service that AWS charges thousands of dollars for. It was like discovering that your grandfather’s pocket watch keeps better time than your smartwatch.

How the magic happens

Our setup is almost embarrassingly simple. We run a DaemonSet called ipvs-router on dedicated, tiny nodes in each Availability Zone. Each pod does four things, and it does them with the kind of efficiency that makes you question everything else in your stack.

First, it claims an Elastic IP using kube-vip, a CNCF project that lets Kubernetes pods take ownership of spare EIPs. No AWS load balancer required. The pod simply announces “this IP is mine now”, and the network obliges. It feels almost rude how straightforward it is.

Second, it programs IPVS in the kernel. IPVS builds an L4 load-balancing table that forwards packets at line rate. No proxies, no user-space hops. The kernel becomes your load balancer, which is a bit like discovering your car engine can also make excellent toast. Unexpected, but delightful.

Third, it syncs with Kubernetes endpoints. A lightweight controller watches for new pods, and when one appears, IPVS adds it to the rotation in less than a hundred milliseconds. Scaling feels instantaneous because, well, it basically is.

But the real trick is the fourth thing. We use something called Direct Server Return, or DSR. Here is how it works. When a request comes in, it travels from the client to IPVS to the pod. But the response goes directly from the pod back to the client, bypassing the load balancer entirely. The load balancer never sees response traffic. That is how we get ten times the throughput. It is like having a traffic cop who only directs cars into the city but does not care how they leave.

The code that makes it work

Here is what our DaemonSet looks like. I have simplified it slightly for readability, but this is essentially what runs in our production cluster:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ipvs-router
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: ipvs-router
  template:
    metadata:
      labels:
        app: ipvs-router
    spec:
      hostNetwork: true
      containers:
      - name: ipvs-router
        image: ghcr.io/kube-vip/kube-vip:v0.8.0
        args:
        - manager
        env:
        - name: vip_arp
          value: ""true""
        - name: port
          value: ""443""
        - name: vip_interface
          value: eth0
        - name: vip_cidr
          value: ""32""
        - name: cp_enable
          value: ""true""
        - name: cp_namespace
          value: kube-system
        - name: svc_enable
          value: ""true""
        - name: vip_leaderelection
          value: ""true""
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
            - NET_RAW

The key here is hostNetwork: true, which gives the pod direct access to the host’s network stack. Combined with the NET_ADMIN capability, this allows IPVS to manipulate the kernel’s routing tables directly. It requires a certain level of trust in your containers, but then again, so does running a load balancer in the first place.

We also use a custom controller to sync Kubernetes endpoints with IPVS. Here is the core logic:

# Simplified endpoint sync logic
def sync_endpoints(service_name, namespace):
    # Get current endpoints from Kubernetes
    endpoints = k8s_client.list_namespaced_endpoints(
        namespace=namespace,
        field_selector=f""metadata.name={service_name}""
    )
    
    # Extract pod IPs
    pod_ips = []
    for subset in endpoints.items[0].subsets:
        for address in subset.addresses:
            pod_ips.append(address.ip)
    
    # Build IPVS rules using ipvsadm
    for ip in pod_ips:
        subprocess.run([
            ""ipvsadm"", ""-a"", ""-t"", 
            f""{VIP}:443"", ""-r"", f""{ip}:443"", ""-g""
        ])
    
    # The -g flag enables Direct Server Return (DSR)
    return len(pod_ips)

The numbers that matter

Let me tell you about the math, because the math is almost embarrassing for AWS. Our old ALB took about five milliseconds to set up a new connection. IPVS takes less than half a millisecond. That is not an improvement. That is a different category of existence. It is the difference between walking to the shops and being teleported there.

While our ALB would start getting nervous around one hundred thousand concurrent connections, IPVS just does not. It could handle millions. The only limit is how much memory your kernel has, which in our case meant we could have hosted the entire internet circa 2003 without breaking a sweat.

In terms of throughput, our ALB topped out around 2.5 gigabits per second. IPVS saturates the 25-gigabit NIC on our c7g.medium instances. That is ten times the throughput, for those keeping score at home. The load balancer stopped being the bottleneck, which was refreshing because previously it had been like trying to fill a swimming pool through a drinking straw.

But the real kicker is the cost. Here is the breakdown. We run one c7g.medium spot instance per availability zone, three zones total. Each costs about $0.017 per hour. That is $0.051 per hour for compute. We also have three Elastic IPs at $0.005 per hour each, which is $0.015 per hour. With Direct Server Return, outbound transfer costs are effectively zero because responses bypass the load balancer entirely.

The total? A mere $0.066 per hour. Divide that among three availability zones, and you’re looking at roughly $0.009 per hour per zone. That’s nine-tenths of a cent per hour. Let’s not call it optimization, let’s call it a financial exorcism. We went from shelling out $3,900 a month to a modest $48. The savings alone could probably afford a very capable engineer’s caffeine habit.

But what about L7 routing

At this point, you might be raising a valid objection. IPVS is dumb L4. It does not inspect HTTP headers, it does not route based on gRPC metadata, and it does not care about your carefully crafted REST API conventions. It just forwards packets based on IP and port. It is the postal worker of the networking world. Reliable, fast, and utterly indifferent to what is in the envelope.

This is where we layer in Envoy, because intelligence should live where it makes sense. Here is how the request flow works. A client connects to one of our Elastic IPs. IPVS forwards that connection to a random healthy pod. Inside that pod, an Envoy sidecar inspects the HTTP/2 headers or gRPC metadata and routes to the correct internal service.

The result is L4 performance at the edge and L7 intelligence at the pod. We get the speed of kernel-level packet forwarding combined with the flexibility of modern service mesh routing. It is like having a Formula 1 engine in a car that also has comfortable seats and a good sound system. Best of both worlds. Our Envoy configuration looks something like this:

static_resources:
  listeners:
  - name: ingress_listener
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 443
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          ""@type"": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress
          route_config:
            name: local_route
            virtual_hosts:
            - name: api
              domains:
              - ""api.ourcompany.com""
              routes:
              - match:
                  prefix: ""/v1/users""
                route:
                  cluster: user_service
              - match:
                  prefix: ""/v1/orders""
                route:
                  cluster: order_service
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              ""@type"": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

The afternoon we broke everything

I should mention that our first attempt did not go smoothly. In fact, it went so poorly that we briefly considered pretending the whole thing had never happened and going back to our expensive ALBs.

The problem was DNS. We pointed our api.ourcompany.com domain at the new Elastic IPs, and then we waited. And waited. And nothing happened. Traffic was still going to the old ALB. It turned out that our DNS provider had a TTL of one hour, which meant that even after we updated the record, most clients were still using the old IP address for, well, an hour.

But that was not the real problem. The real problem was that we had forgotten to update our health checks. Our monitoring system was still pinging the old ALB’s health endpoint, which was now returning 404s because we had deleted the target group. So our alerts were going off, our pagers were buzzing, and our on-call engineer was having what I can only describe as a difficult afternoon.

We fixed it, of course. Updated the health checks, waited for DNS to propagate, and watched as traffic slowly shifted to the new setup. But for about thirty minutes, we were flying blind, which is not a feeling I recommend to anyone who values their peace of mind.

Deploying this yourself

If you are thinking about trying this yourself, the good news is that it is surprisingly straightforward. The bad news is that you will need to know your way around Kubernetes and be comfortable with the idea of pods manipulating kernel networking tables. If that sounds terrifying, perhaps stick with your ALB. It is expensive, but it is someone else’s problem.

Here is the deployment process in a nutshell. First, deploy the DaemonSet. Then allocate some spare Elastic IPs in your subnet. There is a particular quirk in AWS networking that can ruin your afternoon: the source/destination check. By default, EC2 instances are configured to reject traffic that does not match their assigned IP address. Since our setup explicitly relies on handling traffic for IP addresses that the instance does not technically ‘own’ (our Virtual IPs), AWS treats this as suspicious activity and drops the packets. You must disable the source/destination check on any instance running these router pods. It is a simple checkbox in the console, but forgetting it is the difference between a working load balancer and a black hole.
The pods will auto-claim them using kube-vip. Also, ensure your worker node IAM roles have permission to reassociate Elastic IPs, or your pods will shout into the void without anyone listening. Update your DNS to point at the new IPs, using latency-based routing if you want to be fancy. Then watch as your ALB target group drains, and delete the ALB next week after you are confident everything is working.

The whole setup takes about three hours the first time, and maybe thirty minutes if you do it again. Three hours of work for $46,000 per year in savings. That is $15,000 per hour, which is not a bad rate by anyone’s standards.

What we learned about Cloud computing

Three months after we made the switch, I found myself at an AWS conference, listening to a presentation about their newest managed load balancing service. It was impressive, all machine learning and auto-scaling and intelligent routing. It was also, I calculated quietly, about four hundred times more expensive than our little IPVS setup.

I did not say anything. Some lessons are better learned the hard way. And as I sat there, sipping my overpriced conference coffee, I could not help but smile.

AWS managed services are built for speed of adoption and lowest-common-denominator use cases. They are not built for peak efficiency, extreme performance, or cost discipline. For foundational infrastructure like load balancing, a little DIY unlocks exponential gains.

The embarrassing truth is that we should have done this years ago. We were so accustomed to reaching for managed services that we never stopped to ask whether we actually needed them. It took a 2 AM coffee-fueled bill review to make us question the assumptions we had been carrying around.

Sometimes the future of cloud computing looks a lot like 1999. And honestly, that is exactly what makes it beautiful. There is something deeply satisfying about discovering that the solution to your expensive modern problem was solved decades ago by someone working on a much simpler internet, with much simpler tools, and probably much more sleep.

Wensong Zhang, wherever you are, thank you. Your code from 1998 is still making engineers happy in 2026. That is not a bad legacy for any piece of software.

The author would like to thank his patient colleagues who did not complain (much) during the DNS propagation incident, and the kube-vip maintainers who answered his increasingly desperate questions on Slack.

AWS architecture choices I would not repeat

I was holding a lukewarm Americano in my left hand and a lukewarm sense of dread in my right when the Slack notifications started arriving. It was one of those golden hour afternoons where the light hits your monitor at exactly the wrong angle, turning your screen into a mirror that reflects your own panic back at you. CloudWatch was screaming. Not the dignified beep of a minor alert, but the full banshee wail of latency charts gone vertical.

My coffee had developed that particular skin on top that lukewarm coffee gets when you have forgotten it exists. I stared at the graph. Our system, which I had personally architected with the confidence of a man who had read half a documentation page, was melting in real time. The app was not even big. We had fewer concurrent users than a mid-sized bowling league, yet there we were. Throttling errors stacked up like dirty dishes in a shared apartment kitchen. Cold starts multiplied like rabbits on a vitamin regimen. Costs were rising faster than my blood pressure, which at that moment could have powered a small turbine.

That afternoon changed how I design systems. After four years of writing Python and just enough AWS experience to be dangerous, I learned the cardinal rule. Most architectures that look elegant at small scale are just disasters wearing tuxedos. Here is how I built a Rube Goldberg machine of regret, and how I eventually stopped lighting my own infrastructure on fire.

The Godzilla Lambda and the art of overeating

At first, it felt elegant. One Lambda function to handle everything. Image resizing, email sending, report generation, user authentication, and probably the kitchen sink if I had thought to attach plumbing. One deployment. One mental model. One massive mistake.

I called it my Swiss Army knife approach. Except this particular knife weighed eighty pounds and required three weeks’ notice to open. The function had more conditional branches than a family tree in a soap opera. If the event type was ‘resize_image’, it did one thing. If it was ‘send_email’, it did another. It was essentially a diner where the chef was also the waiter, the dishwasher, and the person who had to physically restrain customers who complained about the meatloaf.

The cold starts were spectacular. My function would wake up slower than a teenager on a Monday morning after an all-night gaming session. It dragged itself into consciousness, looked around, and slowly remembered it had responsibilities. Deployments became existential gambles. Change a comma in the email formatting logic, and you risk taking down the image processing pipeline that paying customers actually cared about. Logs turned into a crime scene where every suspect had the same fingerprint.

The automation scripts I had written to manage this beast were just duct tape on top of more duct tape. They had to account for the fact that the entry point was a fragile monolith masquerading as serverless elegance.

Now I build small, single-purpose functions. Each one does exactly one thing, like a very boring but highly reliable employee. My resize handler resizes. My email handler emails. They do not mingle. They do not gossip. They do not share IAM policies at the same coffee station.

Here is the only snippet of code you need to see today, mostly because it is so short it could fit in a tweet from someone with a short attention span.

def handler(event, context):
    return process_invoice(event.get("invoice_id"))

That is it. No if statements doing interpretive dance. No switch cases having an identity crisis. If a Lambda needs more than one IAM policy, it is already too big. It is like needing two different keys to open your refrigerator. If that is the case, you have designed a refrigerator incorrectly.

Using HTTP to check the mailbox

API Gateway is powerful. It is also expensive, verbose, and absolutely overkill for workflows where no human is holding a browser. I learned this the day I decided to route every single background job through API Gateway because I valued consistency over solvency. My AWS bill arrived looking like a phone number. A long one.

I was using HTTP requests for internal automation. Let that sink in. I was essentially hiring a limousine to drive across the street to check my mailbox. Every time a background job needed to trigger another background job, it went through API Gateway. That meant authentication layers, request validation, and pricing tiers designed for enterprise traffic handling, my little cron job that cleaned up temporary files.

Debugging was a nightmare wrapped in an OAuth flow. I spent three hours one Tuesday trying to figure out why an internal service could not authenticate, only to realize I had designed a system where my left hand needed to show my right hand three forms of government ID just to borrow a stapler.

The fix was to remember that computers can talk to each other without pretending to be web browsers. I switched to event-driven architecture using SNS and SQS. Now my producers throw messages into a queue like dropping letters into a mailbox, and they do not care who picks them up. The consumers grab what they need when they are ready.

sns_client = boto3.client("sns")
sns_client.publish(
    TopicArn=REPORT_GENERATION_TOPIC,
    Message=json.dumps({"customer_id": "CUST-8842", "report_type": "quarterly"})
)

The producers have no idea who consumes the message. They do not need to know. It is like leaving a note on the fridge instead of calling your roommate on their cell phone every time you need to tell them the milk is sour. If humans are not calling the endpoint, it probably should not be HTTP. Save your API Gateway budget for something that actually faces the internet, like that side project you will never finish.

The Server with amnesia

This one still stings. I used to run cron jobs on EC2 instances. Backups, cleanup scripts, data pipelines, all scheduled on a server that I treated like a reliable employee instead of the forgetful intern it actually was.

It worked perfectly until the instance restarted. Which instances do. They reboot for maintenance, for updates, for mysterious AWS reasons that arrive in emails written in that particular corporate tone that suggests everything is fine while your world burns. Every time the server came back up, it had the memory of a goldfish with a head injury. Scheduled jobs vanished into the ether. Backups did not happen. Cleanup scripts sat idle while storage costs climbed.

I spent three mornings a week SSHing into instances like a nervous parent checking if a sleeping teenager is still breathing. I would type crontab -l with the same trepidation one might use when opening a credit card statement after a vacation. Is everything there? Did it forget? Is the database backup running, or am I going to explain to the CEO why our disaster recovery plan is actually just a disaster?

If your automation depends on a server staying alive, it is not automation. It is hope dressed up in a shell script.

I replaced it with EventBridge and Lambda. EventBridge does not forget. EventBridge does not take vacations. EventBridge does not require you to log in at 3 AM in your pajamas to check if it is still breathing. It triggers the function, the function does the work, and if something breaks, it either retries or sends a message to a dead letter queue where you can ignore it at your leisure during business hours.

Trusting the Database to save itself

I trusted RDS autoscaling because the documentation made it sound intelligent. Like having a butler who watches your dinner party and quietly brings more chairs when guests arrive. The reality was more like having a butler who stands in the corner watching the house catch fire, then asks if you would like a chair.

The database would hit a traffic spike. Connections would pile up like shoppers at a Black Friday doorbuster sale. The application layer would be perfectly healthy, humming along, wondering why the database was on fire. By the time RDS autoscaling decided to add capacity, the damage was done. The connection pool had already exhausted itself. Automation scripts designed to recover the situation could not even connect to run their recovery logic. It was like calling the fire department only to find out they start driving when they smell smoke, not when the alarm rings.

Now I automate predictive scaling. It is not fancy. It is just intentional. I have scripts that check expected connection loads against current capacity. If we are going to hit five hundred connections, the script starts warming up a larger instance class before we need it. It is like preheating an oven instead of shoving a turkey into a cold metal box and hoping for the best.

AWS gives you primitives. Architecture is deciding when not to trust the defaults, because the defaults are designed to keep AWS running, not to keep you sane.

Reading tea leaves in a hurricane

I once thought centralized logging meant dumping everything into CloudWatch and calling it observability. This is the equivalent of shoveling all your mail into a closet and claiming you have a filing system. Technically true, practically useless.

My automation depended on parsing these logs. I wrote regex patterns that looked like ancient Sumerian curses. They would match error messages sometimes, ignore them other times, and occasionally trigger alerts on completely irrelevant noise because someone had logged the word error in a debugging statement about their lunch order.

During incidents, I would stare at these logs trying to find patterns. It was like trying to identify a specific scream in a horror movie marathon. Everything was urgent. Nothing was actionable. My scripts could not tell the difference between a critical database failure and a debug message about cache expiration. They were essentially reading entrails.

Structured logs saved my sanity. Now everything gets dumped as JSON with actual fields. Event types, durations, identifiers, all labeled and searchable. My automation can trigger follow-up jobs when specific events complete. It can detect anomalies by looking at actual numeric fields instead of trying to parse human-readable text like some kind of desperate fortune teller.

logger.info(
    "task_completed",
    extra={
        "job_type": "inventory_sync",
        "warehouse_id": "WH-15",
        "duration_ms": 1420,
        "items_processed": 847
    }
)

Logs are not for humans anymore. They are for systems. Humans should read dashboards. Systems should read logs. Confuse the two, and you end up with alerts that cry wolf at 3 AM because someone spelled success wrong.

The quiet killer wearing a price tag

This is the one that really hurts. Everything worked. Latency was acceptable. Automation was smooth. The system scaled. Then the bill arrived, and I nearly spilled my coffee onto the keyboard. If cost is not part of your architecture, scale will punish you like a gym teacher who has decided you need motivation.

I had built something that scaled technically but not financially. It was like designing an airplane that flies beautifully but requires fuel that costs more than the GDP of a small nation. Every request through API Gateway, every idle EC2 waiting for a cron job that might not come, every poorly optimized Lambda running for fifteen seconds because I had not bothered to trim the dependencies, it all added up.

Now I automate cost checks. Before expensive jobs run, they estimate their impact. If the daily budget threshold approaches, the system starts making choices. It defers non-critical tasks. It sends warnings. It acts like a responsible adult at a bar when the tab starts getting too high.

def should_process_batch(estimated_cost, daily_spend):
    remaining_budget = DAILY_LIMIT - daily_spend
    return estimated_cost < (remaining_budget * 0.8)

Simple guardrails save real money. There is a saying I keep taped to my monitor now. If it scales technically but not financially, it does not scale. It is just a very efficient way to go bankrupt.

The art of rehearsed failure

Every bad decision I made had the same DNA. I optimized for speed of development. I ignored the longevity of automation. I trusted defaults because reading the full documentation seemed like work for people who had more time than I did. I treated AWS like a magic wand instead of a very powerful, very expensive tool that requires respect.

Good architecture is not about services. It is about failure modes you have already rehearsed in your head. It is about assuming you will forget what you built in six months, because you will. It is about assuming growth will happen, failure will happen, and at some point, you will be trying to debug this thing while your phone buzzes with angry messages from people who just want the system to work.

Build like you are designing a kitchen for a very forgetful, very busy chef who might be slightly drunk. Label everything. Make the dangerous stuff hard to do by accident. Keep the receipts. And for the love of all that is holy, do not put cron jobs on EC2.

Your cloud bill is just a mirror of your engineering soul

I have a theory that usually gets me uninvited to the best tech parties. It is a controversial opinion, the kind that makes people shift uncomfortably in their ergonomic chairs and check their phones. Here it is. AWS is not expensive. AWS is actually a remarkably fair judge of character. Most of us are just bad at using it. We are not unlucky, nor are we victims of some grand conspiracy by Jeff Bezos to empty our bank accounts. We are simply lazy in ways that we are too embarrassed to admit.

I learned this the hard way, through a process that felt less like a financial audit and more like a very public intervention.

The expensive silence of a six-figure mistake

Last year, our AWS bill crossed a number that made the people in finance visibly sweat. It was a six-figure sum appearing monthly, a recurring nightmare dressed up as an invoice. The immediate reactions from the team were predictable: a chorus of denial that sounded like a broken record. People started whispering about the insanity of cloud pricing. We talked about negotiating discounts, even though we had no leverage. There was serious talk of going multi-cloud, which is usually just a way to double your problems while hoping for a synergy that never comes. Someone even suggested going back to on-prem servers, which is the technological equivalent of moving back in with your parents because your rent is too high.

We were looking for a villain, but the only villain in the room was our own negligence. Instead of pointing fingers at Amazon, we froze all new infrastructure for two weeks. We locked the doors and audited why every single dollar existed. It was painful. It was awkward. It was necessary.

We hired a therapist for our infrastructure

What we found was not a technical failure. It was a behavioral disorder. We found that AWS was not charging us for scale. It was charging us for our profound indifference. It was like leaving the water running in every sink in the house and then blaming the utility company for the price of water.

We had EC2 instances sized “just to be safe.” This is the engineering equivalent of buying a pair of XXXL sweatpants just in case you decide to take up sumo wrestling next Tuesday. We were paying for capacity we did not need, for a traffic spike that existed only in our anxious imaginations.

We discovered Kubernetes clusters wheezing along at 15% utilization. Imagine buying a Ferrari to drive to the mailbox at the end of the driveway once a week. That was our cluster. Expensive, powerful, and utterly bored.

There were NAT Gateways chugging along in the background, charging us by the gigabyte to forward traffic that nobody remembered creating. It was like paying a toll to cross a bridge that went nowhere. We had RDS instances over-provisioned for traffic that never arrived, like a restaurant staffed with fifty waiters for a lunch crowd of three.

Perhaps the most revealing discovery was our log retention policy. We were keeping CloudWatch logs forever because “storage is cheap.” It is not cheap when you are hoarding digital exhaust like a cat lady hoarding newspapers. We had autoscaling enabled without upper bounds, which is a bit like giving your credit card to a teenager and telling them to have fun. We had Lambdas retrying silently into infinity, little workers banging their heads against a wall forever.

None of this was AWS being greedy. This was engineering apathy. This was the result of a comforting myth that engineers love to tell themselves.

The hoarding habit of the modern engineer

“If it works, do not touch it.”

This mantra makes sense for stability. It is a lovely sentiment for a grandmother’s antique clock. It is a disaster for a cloud budget. AWS does not reward working systems. It rewards intentional systems. Every unmanaged default becomes a subscription you never canceled, a gym membership you keep paying for because you are too lazy to pick up the phone and cancel it.

Big companies can survive this kind of bad cloud usage because they can hide the waste in the couch cushions of their massive budgets. Startups cannot. For a startup, a few bad decisions can double your runway burn, force hiring freezes, and kill experimentation before it begins. I have seen companies rip out AWS, not because the technology failed, but because they never learned how to say no to it. They treated the cloud like an all you can eat buffet, where they forgot to pay the bill first.

Denial is a terrible financial strategy

If your AWS bill feels random, you do not understand your system. If cost surprises you, your architecture is opaque. It is like finding a surprise charge on your credit card and realizing you have no idea what you bought. It is a loss of control.

We realized that if we needed a “FinOps tool” to explain our bill, our infrastructure was already too complex. We did not need another dashboard. We needed a mirror.

The boring magic of actually caring

We did not switch clouds. We did not hire expensive consultants to tell us what we already knew. We did not buy magic software to fix our mess. We did four boring, profoundly unsexy things.

First, every resource needed an owner. We stopped treating servers like communal property. If you spun it up, you fed it. Second, every service needed a cost ceiling. We put a leash on the spending. Third, every autoscaler needed a maximum limit. We stopped the machines from reproducing without permission. Fourth, every log needed a delete date. We learned to take out the trash.

The results were almost insulting in their simplicity. Costs dropped 43% in 30 days. There were no outages. There were no late night heroics. We did not rewrite the core platform. We just applied a little bit of discipline.

Why this makes engineers uncomfortable

Cost optimization exposes bad decisions. It forces you to admit that you over engineered a solution. It forces you to admit that you scaled too early. It forces you to admit that you trusted defaults because you were too busy to read the manual. It forces you to admit that you avoided the hard conversations about budget.

It is much easier to blame AWS. It is comforting to think of them as a villain. It is harder to admit that we built something nobody questioned.

The brutal honesty of the invoice

AWS is not the villain here. It is a mirror. It shows you exactly how careless or thoughtful your architecture is, and it translates that carelessness into dollars. You can call it expensive. You can call it unfair. You can migrate to another cloud provider. But until you fix how you design systems, every cloud will punish you the same way. The problem is not the landlord. The problem is how you are living in the house.

It brings me to a final question that every engineering leader should ask themselves. If your AWS bill doubled tomorrow, would you know why? Would you know exactly where the money was going? Would you know what to delete first?

If the answer is no, the problem is not AWS. And deep down, in the quiet moments when the invoice arrives, you already know that. This article might make some people angry. That is good. Anger is cheaper than denial. And frankly, it is much better for your bottom line.

Let IAM handle the secrets you can avoid

There are two kinds of secrets in cloud security.

The first kind is the legitimate kind: a third-party API token, a password for something you do not control, a certificate you cannot simply wish into existence.

The second kind is the kind we invent because we are in a hurry: long-lived access keys, copied into a config file, then copied into a Docker image, then copied into a ticket, then copied into the attacker’s weekend plans.

This article is about refusing to participate in that second category.

Not because secrets are evil. Because static credentials are the “spare house key under the flowerpot” of AWS. Convenient, popular, and a little too generous with access for something that can be photographed.

The goal is not “no secrets exist.” The goal is no secrets live in code, in images, or in long-lived credentials.

If you do that, your security posture stops depending on perfect human behavior, which is great because humans are famously inconsistent. (We cannot all be trusted with a jar of cookies, and we definitely cannot all be trusted with production AWS keys.)

Why this works in real life

AWS already has a mechanism designed to prevent your applications from holding permanent credentials: IAM roles and temporary credentials (STS).

When your Lambda runs with an execution role, AWS hands it short-lived credentials automatically. They rotate on their own. There is nothing to copy, nothing to stash, nothing to rotate in a spreadsheet named FINAL-final-rotation-plan.xlsx.

What remains are the unavoidable secrets, usually tied to systems outside AWS. For those, you store them in AWS Secrets Manager and retrieve them at runtime. Not at build time. Not at deploy time. Not by pasting them into an environment variable and calling it “secure” because you used uppercase letters.

This gives you a practical split:

  • Avoidable secrets are replaced by IAM roles and temporary credentials
  • Unavoidable secrets go into Secrets Manager, encrypted and tightly scoped

The architecture in one picture

A simple flow to keep in mind:

  1. A Lambda function runs with an IAM execution role
  2. The function fetches one third-party API key from Secrets Manager at runtime
  3. The function calls the third-party API and writes results to DynamoDB
  4. Network access to Secrets Manager stays private through a VPC interface endpoint (when the Lambda runs in a VPC)

The best part is what you do not see.

No access keys. No “temporary” keys that have been temporary since 2021. No secrets baked into ZIPs or container layers.

What this protects you from

This pattern is not a magic spell. It is a seatbelt.

It helps reduce the chance of:

  • Credentials leaking through Git history, build logs, tickets, screenshots, or well-meaning copy-paste
  • Forgotten key rotation schedules that quietly become “never.”
  • Overpowered policies that turn a small bug into a full account cleanup
  • Unnecessary public internet paths for sensitive AWS API calls

Now let’s build it, step by step, with code snippets that are intentionally sanitized.

Step 1 build an IAM execution role with tight policies

The execution role is the front door key your Lambda carries.

If you give it access to everything, it will eventually use that access, if only because your future self will forget why it was there and leave it in place “just in case.”

Keep it boring. Keep it small.

Here is an example IAM policy for a Lambda that only needs to:

  • write to one DynamoDB table
  • read one secret from Secrets Manager
  • decrypt using one KMS key (optional, depending on how you configure encryption)
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "WriteToOneTable",
      "Effect": "Allow",
      "Action": [
        "dynamodb:PutItem",
        "dynamodb:UpdateItem"
      ],
      "Resource": "arn:aws:dynamodb:eu-west-1:111122223333:table/app-results-prod"
    },
    {
      "Sid": "ReadOneSecret",
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetSecretValue"
      ],
      "Resource": "arn:aws:secretsmanager:eu-west-1:111122223333:secret:thirdparty/weather-api-key-*"
    },
    {
      "Sid": "DecryptOnlyThatKey",
      "Effect": "Allow",
      "Action": [
        "kms:Decrypt"
      ],
      "Resource": "arn:aws:kms:eu-west-1:111122223333:key/12345678-90ab-cdef-1234-567890abcdef",
      "Condition": {
        "StringEquals": {
          "kms:ViaService": "secretsmanager.eu-west-1.amazonaws.com"
        }
      }
    }
  ]
}

A few notes that save you from future regret:

  • The secret ARN ends with -* because Secrets Manager appends a random suffix.
  • The KMS condition helps ensure the key is used only through Secrets Manager, not as a general-purpose decryption service.
  • You can skip the explicit kms:Decrypt statement if you use the AWS-managed key and accept the default behavior, but customer-managed keys are common in regulated environments.

Step 2 store the unavoidable secret properly

Secrets Manager is not a place to dump everything. It is a place to store what you truly cannot avoid.

A third-party API key is a perfect example because IAM cannot replace it. AWS cannot assume a role in someone else’s SaaS.

Use a JSON secret so you can extend it later without creating a new secret every time you add a field.

{
  "api_key": "REDACTED-EXAMPLE-TOKEN"
}

If you like the CLI (and I do, because buttons are too easy to misclick), create the secret like this:

aws secretsmanager create-secret \
  --name "thirdparty/weather-api-key" \
  --description "Token for the Weatherly API used by the ingestion Lambda" \
  --secret-string '{"api_key":"REDACTED-EXAMPLE-TOKEN"}' \
  --region eu-west-1

Then configure:

  • encryption with a customer-managed KMS key if required
  • rotation if the provider supports it (rotation is amazing when it is real, and decorative when the vendor does not allow it)

If the vendor does not support rotation, you still benefit from central storage, access control, audit logging, and removing the secret from code.

Step 3 lock down secret access with a resource policy

Identity-based policies on the Lambda role are necessary, but resource policies are a nice extra lock.

Think of it like this: your role policy is the key. The resource policy is the bouncer who checks the wristband.

Here is a resource policy that allows only one role to read the secret.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowOnlyIngestionRole",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111122223333:role/lambda-ingestion-prod"
      },
      "Action": "secretsmanager:GetSecretValue",
      "Resource": "*"
    },
    {
      "Sid": "DenyEverythingElse",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "secretsmanager:GetSecretValue",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalArn": "arn:aws:iam::111122223333:role/lambda-ingestion-prod"
        }
      }
    }
  ]
}

This is intentionally strict. Strict is good. Strict is how you avoid writing apology emails.

Step 4 keep Secrets Manager traffic private with a VPC endpoint

If your Lambda runs inside a VPC, it will not automatically have internet access. That is often the point.

In that case, you do not want the function reaching Secrets Manager through a NAT gateway if you can avoid it. NAT works, but it is like walking your valuables through a crowded shopping mall because the back door is locked.

Use an interface VPC endpoint for Secrets Manager.

Here is a Terraform example (sanitized) that creates the endpoint and limits access using a dedicated security group.

resource "aws_security_group" "secrets_endpoint_sg" {
  name        = "secrets-endpoint-sg"
  description = "Allow HTTPS from Lambda to Secrets Manager endpoint"
  vpc_id      = aws_vpc.main.id

  ingress {
    from_port       = 443
    to_port         = 443
    protocol        = "tcp"
    security_groups = [aws_security_group.lambda_sg.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_vpc_endpoint" "secretsmanager" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.eu-west-1.secretsmanager"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = [aws_subnet.private_a.id, aws_subnet.private_b.id]
  private_dns_enabled = true
  security_group_ids  = [aws_security_group.secrets_endpoint_sg.id]
}

If your Lambda is not in a VPC, you do not need this step. The function will reach Secrets Manager over AWS’s managed network path by default.

If you want to go further, consider adding a DynamoDB gateway endpoint too, so your function can write to DynamoDB without touching the public internet.

Step 5 retrieve the secret at runtime without turning logs into a confession

This is where many teams accidentally reinvent the problem.

They remove the secret from the code, then log it. Or they put it in an environment variable because “it is not in the repository,” which is a bit like saying “the spare key is not under the flowerpot, it is under the welcome mat.”

The clean approach is:

  • store only the secret name (not the secret value) as configuration
  • retrieve the value at runtime
  • cache it briefly to reduce calls and latency
  • never print it, even when debugging, especially when debugging

Here is a Python example for AWS Lambda with a tiny TTL cache.

import json
import os
import time
import boto3

_secrets_client = boto3.client("secretsmanager")
_cached_value = None
_cached_until = 0

SECRET_ID = os.getenv("THIRDPARTY_SECRET_ID", "thirdparty/weather-api-key")
CACHE_TTL_SECONDS = int(os.getenv("SECRET_CACHE_TTL_SECONDS", "300"))


def _get_api_key() -> str:
    global _cached_value, _cached_until

    now = int(time.time())
    if _cached_value and now < _cached_until:
        return _cached_value

    resp = _secrets_client.get_secret_value(SecretId=SECRET_ID)
    payload = json.loads(resp["SecretString"])

    api_key = payload["api_key"]
    _cached_value = api_key
    _cached_until = now + CACHE_TTL_SECONDS
    return api_key


def lambda_handler(event, context):
    api_key = _get_api_key()

    # Use the key without ever logging it
    results = call_weatherly_api(api_key=api_key, city=event.get("city", "Seville"))

    write_to_dynamodb(results)

    return {
        "status": "ok",
        "items": len(results) if hasattr(results, "__len__") else 1
    }

This snippet is intentionally short. The important part is the pattern:

  • minimal secret access
  • controlled cache
  • zero secret output

If you prefer a library, AWS provides a Secrets Manager caching client for some runtimes, and AWS Lambda Powertools can help with structured logging. Use them if they fit your stack.

Step 6 make security noisy with logs and alarms

Security without visibility is just hope with a nicer font.

At a minimum:

  • enable CloudTrail in the account
  • ensure Secrets Manager events are captured
  • alert on unusual secret access patterns

A simple and practical approach is a CloudWatch metric filter for GetSecretValue events coming from unexpected principals. Another is to build a dashboard showing:

  • Lambda errors
  • Secrets Manager throttles
  • sudden spikes in secret reads

Here is a tiny Terraform example that keeps your Lambda logs from living forever (because storage is forever, but your attention span is not).

resource "aws_cloudwatch_log_group" "lambda_logs" {
  name              = "/aws/lambda/lambda-ingestion-prod"
  retention_in_days = 14
}

Also consider:

  • IAM Access Analyzer to spot risky resource policies
  • AWS Config rules or guardrails if your organization uses them
  • an alarm on unexpected NAT data processing if you intended to keep traffic private

Common mistakes I have made, so you do not have to

I am listing these because I have either done them personally or watched them happen in slow motion.

  1. Using a wildcard secret policy
    secretsmanager:GetSecretValue on * feels convenient until it is a breach multiplier.
  2. Putting secret values into environment variables
    Environment variables are not evil, but they are easy to leak through debugging, dumps, tooling, or careless logging. Store secret names there, not secret contents.
  3. Retrieving secrets at build time
    Build logs live forever in the places you forget to clean. Runtime retrieval keeps secrets out of build systems.
  4. Logging too much while debugging
    The fastest way to leak a secret is to print it “just once.” It will not be just once.
  5. Skipping the endpoint and relying on NAT by accident
    The NAT gateway is not evil either. It is just an expensive and unnecessary hallway if a private door exists.

A two minute checklist you can steal

  • Your Lambda uses an IAM execution role, not access keys
  • The role policy scopes Secrets Manager access to one secret ARN pattern
  • The secret has a resource policy that only allows the expected role
  • Secrets are encrypted with KMS when required
  • The secret value is never stored in code, images, build logs, or environment variables
  • If Lambda runs in a VPC, you use an interface VPC endpoint for Secrets Manager
  • You have CloudTrail enabled and you can answer “who accessed this secret” without guessing

Extra thoughts

If you remove long-lived credentials from your applications, you remove an entire class of problems.

You stop rotating keys that should never have existed in the first place.

You stop pretending that “we will remember to clean it up later” is a security strategy.

And you get a calmer life, which is underrated in engineering.

Let IAM handle the secrets you can avoid.

Then let Secrets Manager handle the secrets you cannot.

And let your code do what it was meant to do: process data, not babysit keys like they are a toddler holding a permanent marker.

How Dropbox saved millions by leaving AWS

Most of us treat cloud storage like a magical, bottomless attic. You throw your digital clutter into a folder: PDFs of tax returns from 2014, blurred photos of a cat that has long since passed away, unfinished drafts of novels, and you forget about them. It feels weightless. It feels ephemeral. But somewhere in a windowless concrete bunker in Virginia or Oregon, a spinning platter of rust is working very hard to keep those cat photos alive. And every time that platter spins, a meter is running.

For the first decade of its existence, Dropbox was essentially a very polished, user-friendly frontend for Amazon’s garage. When you saved a file to Dropbox, their servers handled the metadata (the index card that says where the file is), but the actual payload (the bytes themselves) was quietly ushered into Amazon S3. It was a brilliant arrangement. It allowed a small startup to scale without worrying about hard drives catching fire or power supplies exploding.

But then Dropbox grew up. And when you grow up, living in a hotel starts to get expensive.

By 2015, Dropbox was storing exabytes of data. The problem wasn’t just the storage fee, which is akin to paying rent. The real killer was the “egress” and request fees. Amazon’s business model is brilliantly designed to function like the Hotel California: you can check out any time you like, but leaving with your luggage is going to cost you a fortune. Every time a user opened a file, edited a document, or synced a folder, a tiny cash register dinged in Jeff Bezos’s headquarters.

The bill was no longer just an operating expense. It was an existential threat. The unit economics were starting to look less like a software business and more like a philanthropy dedicated to funding Amazon’s R&D department.

So, they decided to do something that is generally considered suicidal in the modern software era. They decided to leave the cloud.

The audacity of building your own closet

In Silicon Valley, telling investors you plan to build your own data centers is like telling your spouse you plan to perform your own appendectomy using a steak knife and a YouTube tutorial. It is seen as messy, dangerous, and generally regressive. The prevailing wisdom is that hardware is a commodity, a utility like electricity or sewage, and you should let the professionals handle the sludge.

Dropbox ignored this. They launched a project with the internally ironic name “Magic Pocket.” The goal was to build a storage system from scratch that was cheaper than Amazon S3 but just as reliable.

To understand the scale of this bad idea, you have to understand that S3 is a miracle of engineering. It boasts “eleven nines” of durability (99.999999999%). That means if you store 10,000 files, you might lose one every 10 million years. Replicating that level of reliability requires an obsessive, almost pathological attention to detail.

Dropbox wasn’t just buying servers from Dell and plugging them in. They were designing their own chassis. They realized that standard storage servers were too generic. They needed density. They built a custom box nicknamed “Diskotech” (because engineers love puns almost as much as they love caffeine) that could cram up to a petabyte of storage into a rack unit that was barely deeper than a coffee table.

But hardware has a nasty habit of obeying the laws of physics, and physics is often annoying.

Good vibrations and bad hard drives

When you pack hundreds of spinning hard drives into a tight metal box, you encounter a phenomenon that sounds like a joke but is actually a nightmare: vibration.

Hard drives are mechanical divas. They consist of magnetic platters spinning at 7,200 revolutions per minute, with a read/write head hovering nanometers above the surface. If the drive vibrates too much, that head can’t find the track. It misses. It has to wait for the platter to spin around again. This introduces latency. If enough drives in a rack vibrate in harmony, the performance drops off a cliff.

The Dropbox team found that even the fans cooling the servers were causing acoustic vibrations that made the hard drives sulk. They had to become experts in firmware, dampening materials, and the resonant frequencies of sheet metal. It is the kind of problem you simply do not have when you rent space in the cloud. In the cloud, a vibrating server is someone else’s ticket. When you own the metal, it’s your weekend.

Then there was the software. They couldn’t just use off-the-shelf Linux tools. They wrote their own storage software in Rust. At the time, Rust was the new kid on the block, a language that promised memory safety without the garbage collection pauses of Go or Java. Using a relatively new language to manage the world’s most precious data was a gamble, but it paid off. It allowed them to squeeze every ounce of efficiency out of the CPU, keeping the power bill (and the heat) down.

The great migration was a stealth mission

Building the “Magic Pocket” was only half the battle. The other half was moving 500 petabytes of data from Amazon to these new custom-built caverns without losing a single byte and without any user noticing.

They adopted a strategy that I like to call the “belt, suspenders, and duct tape” approach. For a long period, they used a technique called dual writing. Every time you uploaded a file, Dropbox would save a copy to Amazon S3 (the old reliable) and a copy to their new Magic Pocket (the risky experiment).

They then spent months just verifying the data. They would ask the Magic Pocket to retrieve a file, compare it to the S3 version, and check if they matched perfectly. It was a paranoia-fueled audit. Only when they were absolutely certain that the new system wasn’t eating homework did they start disconnecting the Amazon feed.

They treated the migration like a bomb disposal operation. They moved users over silently. One day, you were fetching your resume from an AWS server in Virginia; the next day, you were fetching it from a custom Dropbox server in Texas. The transfer speeds were often better, but nobody sent out a press release. The ultimate sign of success in infrastructure engineering is that nobody knows you did anything at all.

The savings were vulgar

The financial impact was immediate and staggering. Over the two years following the migration, Dropbox saved nearly $75 million in operating costs. Their gross margins, the holy grail of SaaS financials, jumped from a worrisome 33% to a healthy 67%.

By owning the hardware, they cut out the middleman’s profit margin. They also gained the ability to use “Shingled Magnetic Recording” (SMR) drives. These are cheaper, high-density drives that are notoriously slow at writing data because the data tracks overlap like roof shingles (hence the name). Standard databases hate them. But because Dropbox wrote their own software specifically for their own use case (write once, read many), they could use these cheap, slow drives without the performance penalty.

This is the hidden superpower of leaving the cloud: optimization. AWS has to build servers that work reasonably well for everyone, from Netflix to the CIA to a teenager running a Minecraft server. That means they are optimized for the average. Dropbox optimized for the specific. They built a suit that fit them perfectly, rather than buying a “one size fits all” poncho from the rack.

Why you should probably not do this

If you are reading this and thinking, “I should build my own data center,” please stop. Go for a walk. Drink some water.

Dropbox’s success is the exception that proves the rule. They had a very specific workload (huge files, rarely modified) and a scale (exabytes) that justified the massive R&D expense. They had the budget to hire world-class engineers who dream in Rust and understand the acoustic properties of cooling fans.

For 99% of companies, the cloud is still the right answer. The premium you pay to AWS or Google is not just for storage; it is an insurance policy against complexity. You are paying so that you never have to think about a failed power supply unit at 3:00 AM on a Sunday. You are paying so that you don’t have to negotiate contracts for fiber optic cables or worry about the price of real estate in Nevada.

However, Dropbox didn’t leave the cloud entirely. And this is the punchline.

Today, Dropbox is a hybrid. They store the files, the cold, heavy, static blocks of data, in their own Magic Pocket. But the metadata? The search functions? The flashy AI features that summarize your documents? That all still runs in the cloud.

They treat the public cloud like a utility kitchen. When they need to cook up something complex that requires thousands of CPUs for an hour, they rent them from Amazon or Google. When they just need to store the leftovers, they put them in their own fridge.

Adulthood is knowing when to rent

The story of Dropbox leaving the cloud is not really about leaving. It is about maturity.

In the early days of a startup, you prioritize speed. You pay the “cloud tax” because it allows you to move fast and break things. But there comes a point where the tax becomes a burden.

Dropbox realized that renting is great for flexibility, but ownership is the only way to build equity. They turned a variable cost (a bill that grows every time a user uploads a photo) into a fixed cost (a warehouse full of depreciating assets). It is less sexy. It requires more plumbing.

But there is a quiet dignity in owning your own mess. Dropbox looked at the cloud, with its infinite promise and infinite invoices, and decided that sometimes, the most radical innovation is simply buying a screwdriver, rolling up your sleeves, and building the shelf yourself. Just be prepared for the vibration.

The paranoia that keeps Netflix online

On a particularly bleak Monday, October 20, the internet suffered a collective nervous breakdown. Amazon Web Services decided to take a spontaneous nap, and the digital world effectively dissolved. Slack turned into a $27 billion paperweight, leaving office workers forced to endure the horror of unfiltered face-to-face conversation. Disney+ went dark, stranding thousands of toddlers mid-episode of Bluey and forcing parents to confront the terrifying reality of their own unsupervised children. DoorDash robots sat frozen on sidewalks like confused Daleks, threatening the national supply of lukewarm tacos.

Yet, in a suburban basement somewhere in Ohio, a teenager named Tyler streamed all four seasons of Stranger Things in 4K resolution. He did not see a single buffering wheel. He had no idea the cloud was burning down around him.

This is the central paradox of Netflix. They have engineered a system so pathologically untrusting, so convinced that the world is out to get it, that actual infrastructure collapses register as nothing more than a mild inconvenience. I spent weeks digging through technical documentation and bothering former Netflix engineers to understand how they pulled this off. What I found was not just a story of brilliant code. It is a story of institutional paranoia so profound it borders on performance art.

The paranoid bouncer at the door

When you click play on The Crown, your request does not simply waltz into the Netflix servers. It first has to get past the digital equivalent of a nightclub bouncer who suspects everyone of trying to sneak in a weapon. This is Amazon’s Elastic Load Balancer, or ELB.

Most load balancers are polite traffic cops. They see a server and wave you through. Netflix’s ELB is different. It assumes that every server is about three seconds away from exploding.

Picture a nightclub with 47 identical dance floors. The bouncer’s job is to frisk you, judge your shoes, and shove you toward the floor least likely to collapse under the weight of too many people doing the Macarena. The ELB does this millions of times per second. It does not distribute traffic evenly because “even” implies trust. Instead, it routes you to the server with the least outstanding requests. It is constantly taking the blood pressure of the infrastructure.

If a server takes ten milliseconds too long to respond, the ELB treats it like a contagion. It cuts it off. It ghosts it. This is the first commandment of the Netflix religion. Trust nothing. Especially not the hardware you rent by the hour from a company that also sells lawnmowers and audiobooks.

The traffic controller with a god complex

Once you make it past the bouncer, you meet Zuul.

Zuul is the API gateway, but that is a boring term for what is essentially a micromanager with a caffeine addiction. Zuul is the middle manager who insists on being copied on every single email and then rewrites them because he didn’t like your tone.

Its job is to route your request to the right backend service. But Zuul is neurotic. It operates through a series of filters that feel less like software engineering and more like airport security theater. There is an inbound filter that authenticates you (the TSA agent squinting at your passport), an endpoint filter that routes you (the air traffic controller), and an outbound filter that scrubs the response (the PR agent who makes sure the server didn’t say anything offensive).

All of this runs on the Netty server framework, which sounds cute but is actually a multi-threaded octopus capable of juggling tens of thousands of open connections without dropping a single packet. During the outage, while other companies’ gateways were choking on retries, Zuul continued to sort traffic with the cold detachment of a bureaucrat stamping forms during a fire drill.

A dysfunctional family of specialists

Inside the architecture, there is no single “Netflix” application. There is a squabbling family of thousands of microservices. These are tiny, specialized programs that refuse to speak to each other directly and communicate only through carefully negotiated contracts.

You have Uncle User Profiles, who sits in the corner nursing a grudge about that time you watched seventeen episodes of Is It Cake? at 3 AM. There is Aunt Recommendations, a know-it-all who keeps suggesting The Office because you watched five minutes of it in 2018. Then there is Cousin Billing, who only shows up when money is involved and otherwise sulks in the basement.

This family is held together by a concept called “circuit breaking.” In the old days, they used a library called Hystrix. Think of Hystrix as a court-ordered family therapist with a taser.

When a service fails, let’s say the subtitles database catches fire, most applications would keep trying to call it, waiting for a response that will never come, until the entire system locks up. Netflix does not have time for that. If the subtitle service fails, the circuit breaker pops. The therapist steps in and says, “Uncle Subtitles is having an episode and is not allowed to talk for the next thirty seconds.”

The system then serves a fallback. Maybe you don’t get subtitles for a minute. Maybe you don’t get your personalized list of “Top Picks for Fernando.” But the video plays. The application degrades gracefully rather than failing catastrophically. It is the digital equivalent of losing a limb but continuing to run the marathon because you have a really good playlist going.

Here is a simplified view of how this “fail fast” logic looks in the configuration. It is basically a list of rules for ignoring people who are slow to answer:

hystrix:
  command:
    default:
      execution:
        isolation:
          thread:
            timeoutInMilliseconds: 1000
      circuitBreaker:
        requestVolumeThreshold: 20
        sleepWindowInMilliseconds: 5000

Translated to human English, this configuration says: “If you take more than one second to answer me, you are dead to me. If you fail twenty times, I am going to ignore you for five seconds until you get your act together.”

The digital hoarder pantry

At the scale Netflix operates, data storage is less about organization and more about controlled hoarding. They use a system that only makes sense if you have given up on the concept of minimalism.

They use Cassandra, a NoSQL database, to store user history. Cassandra is like a grandmother who saves every newspaper from 1952 because “you never know.” It is designed to be distributed. You can lose half your hard drives, and Cassandra will simply shrug and serve the data from a backup node.

But the real genius, and the reason they survived the apocalypse, is EVCache. This is their homemade caching system based on Memcached. It is a massive pantry where they store snacks they know you will want before you even ask for them.

Here is the kicker. They do not just cache movie data. They cache their own credentials.

When AWS went down, the specific service that failed was often IAM (Identity and Access Management). This is the service that checks if your computer is allowed to talk to the database. When IAM died, servers all over the world suddenly forgot who they were. They were having an identity crisis.

Netflix servers did not care. They had cached their credentials locally. They had pre-loaded the permissions. It is like filling your basement with canned goods, not because you anticipate a zombie apocalypse, but because you know the grocery store manager personally and you know he is unreliable. While other companies were frantically trying to call AWS to ask, “Who am I?”, Netflix’s servers were essentially lip-syncing their way through the performance using pre-recorded tapes.

Hiring a saboteur to guard the vault

This is where the engineering culture goes from sensible to beautifully unhinged. Netflix employs the Simian Army.

This is not a metaphor. It is a suite of software tools designed to break things. The most famous is Chaos Monkey. Its job is to randomly shut down live production servers during business hours. It just kills them. No warning. No mercy.

Then there is Chaos Kong. Chaos Kong does not just kill a server. It simulates the destruction of an entire AWS region. It nukes the East Coast.

Let that sink in for a moment. Netflix pays engineers very high salaries to build software that attacks their own infrastructure. It is like hiring a pyromaniac to work as a fire inspector. Sure, he will find every flammable material in the building, but usually by setting it on fire first.

I spoke with a former engineer who described their “region evacuation” drills. “We basically declare war on ourselves,” she told me. “At 10 AM on a Tuesday, usually after the second coffee, we decide to kill us-east-1. The first time we did it, half the company needed therapy. Now? We can evacuate a region in six minutes. It’s boring.”

This is the secret sauce. The reason Netflix stayed up is that they have rehearsed the outage so many times that it feels like a chore. While other companies were discovering their disaster recovery plans were written in crayon, Netflix engineers were calmly executing a routine they practice more often than they practice dental hygiene.

Building your own highway system

There is a final plot twist. When you hit play, the video, strictly speaking, does not come from the cloud. It comes from Open Connect.

Netflix realized years ago that the public internet is a dirt road full of potholes. So they built their own private highway. They designed physical hardware, bright red boxes packed with hard drives, and shipped them to Internet Service Providers (ISPs) all over the world.

These boxes sit inside the data centers of your local internet provider. They are like mini-warehouses. When a new season of The Queen’s Gambit comes out, Netflix pre-loads it onto these boxes at 4 AM when nobody is using the internet.

So when you stream the show, the data is not traveling from an Amazon data center in Virginia. It is traveling from a box down the street. It might travel five miles instead of two thousand.

It is an invasive, brilliant strategy. It is like Netflix insisted on installing a mini-fridge in your neighbor’s garage just to ensure your beer is three degrees colder. During the cloud outage, even if the “brain” of Netflix (the control plane in AWS) was having a seizure, the “body” (the video files in Open Connect) was fine. The content was already local. The cloud could burn, but the movie was already in the house.

The beautiful absurdity of it all

The irony is delicious. Netflix is AWS’s biggest customer and its biggest success story. Yet they survive on AWS by fundamentally refusing to trust AWS. They cache credentials, they pre-pull images, they build their own delivery network, and they unleash monkeys to destroy their own servers just to prove they can survive the murder attempt.

They have weaponized Murphy’s Law. They built a company where the unofficial motto seems to be “Everything fails, all the time, so let’s get good at failing.”

So the next time the internet breaks and your Slack goes silent, do not panic. Just open Netflix. Somewhere in the dark, a Chaos Monkey is pulling a plug, a paranoid bouncer is shoving traffic away from a burning server, and your binge-watching will continue uninterrupted. The internet might be held together by duct tape and hubris, but Netflix has invested in really, really expensive duct tape.