Cloud stuff

Random Thoughts on Different Cloud Computing

They left AWS to save money. Coming back cost even more

Not long ago, a partner I work with told me about a company that decided it had finally had enough of AWS.

The monthly bill had become the sort of document people opened with the facial expression usually reserved for dental estimates. Consultants were invited in. Spreadsheets were produced. Serious people said serious things about control, efficiency, and the wisdom of getting off the cloud treadmill.

The conclusion sounded almost virtuous. Leave AWS, move the workloads to a colocation facility, buy the hardware, and stop renting what could surely be owned more cheaply.

It was neat. It was rational. It was, for a while, deeply satisfying.

And then reality arrived, carrying invoices.

The company spent a substantial sum getting out of AWS. Servers were bought. Contracts were signed. Staff had to be hired to manage all the things cloud providers manage quietly in the background while everyone else gets on with their jobs. Not long after, the economics began to fray. Reversing course costs even more than leaving in the first place.

That is the part worth paying attention to.

Not because it makes for a dramatic story, though it does. Not because it is especially rare, but because it is not. It matters because it exposes one of the oldest tricks in infrastructure decision-making. Companies compare a visible bill with an invisible burden, decide the bill is the scandal, and only later discover that the burden was doing quite a lot of useful work.

The spreadsheet seduction

On paper, the move away from AWS looked wonderfully sensible.

The cloud bill was obvious, monthly, and impolite enough to keep turning up. On-premises looked calmer. Hardware could be amortized. Rack space, power, and bandwidth could be priced. With a bit of care, the whole thing could be made to resemble prudence.

This is where many repatriation plans become dangerously persuasive. The cloud is cast as an extravagant landlord. On-premises is presented as the mature decision to stop renting and finally buy the house.

Unfortunately, a data center is not a house. It is closer to owning a very large hotel whose plumbing, wiring, keys, security, fire precautions, laundry, and unexpected midnight incidents are all your responsibility, except the guests are servers and none of them leave a tip.

The spreadsheet had done a decent job of pricing the obvious things. Hardware. Colocation space. Power. Connectivity.

What was priced badly were all the dull, expensive capabilities that public cloud tends to bundle into the bill. Managed failover. Backup automation. Key rotation. Elastic capacity. Security controls. Compliance support. Monitoring that does not depend on a specific engineer being awake, available, and emotionally prepared.

What looked like cloud excess turned out to include a great deal of cloud competence.

That distinction matters.

A large cloud bill is easy to resent because it is visible. Operational competence is harder to resent because it tends to be hidden in the walls.

What the cloud had been doing all along

One of the costliest mistakes in infrastructure is confusing convenience with fluff.

A managed database can look expensive right up to the moment you have to build and test failover yourself, define recovery objectives, handle maintenance windows, rotate credentials, validate backups, and explain to auditors why one awkward part of the process still depends on a human remembering to do something after lunch.

A content delivery network may seem like a luxury until you try to reproduce low-latency delivery, edge caching, certificate handling, resilience, and attack mitigation with a mixture of hardware, internal effort, procurement delays, and hope.

The company, in this case, had not really been paying AWS only for compute and storage. It had been paying AWS to absorb a long list of repetitive operational chores, specialized platform decisions, and uncomfortable edge cases.

Once those chores came back in-house, they did not return politely.

Redundancy stopped being a feature and became a budget line, followed by an implementation plan, followed by a maintenance burden. Security controls that had once been inherited now had to be selected, deployed, documented, checked, and defended. Compliance work that had once been partly automated became a steady stream of evidence gathering, procedural discipline, and administrative repetition.

Cloud bills can look high. So can plumbing. You only discover its emotional value when it stops working.

The talent tax

The easiest part of moving on premises is buying equipment.

The harder part is finding enough people who know how to run the surrounding world properly.

Cloud expertise is now common enough that many companies can hire engineers comfortable with infrastructure as code, IAM, managed services, container platforms, observability, autoscaling, and cost controls. Strong cloud engineers are not cheap, but they are at least visible in the market.

Deep on-premises expertise is another matter. People who are strong in storage, backup infrastructure, virtualization, physical networking, hardware lifecycle, and operational recovery still exist, but they are not standing about in large numbers waiting to be discovered. They are experienced, expensive, and often well aware of their market value.

There is also a cultural issue that rarely appears in repatriation slide decks. A great many engineers would rather write Terraform than troubleshoot a hardware issue under unflattering lighting at two in the morning. This is not a moral failure. It is simple market gravity. The industry has spent years abstracting away routine infrastructure pain because abstraction is usually a better use of skilled human attention.

The partner who told me this story was particularly clear on this point. The staffing line looked manageable in planning. In practice, it turned into one of the most stubborn and underestimated parts of the whole effort.

Cloud is not cheap because expertise is cheap. Cloud is often cheaper because rebuilding enough expertise inside one company is very expensive.

Why does utilization lie so beautifully

Projected utilization is one of those numbers that becomes more charming the less time it spends near reality.

Many repatriation models assume that servers will be well used, capacity will be planned sensibly, and waste will be modest. It sounds disciplined. Responsible, even.

Real workloads behave less like equations and more like kitchens during a family gathering. There are quiet periods, sudden rushes, abandoned experiments, quarter-end panics, new projects that arrive with urgency and no warning, and services no one remembers until they break.

Elasticity is not a decorative feature added by cloud providers to justify themselves. It is one of the main ways organizations avoid buying for peak demand and then spending the rest of the year paying for machinery to sit about waiting.

Without elasticity, you provision for the busiest day and fund the silence in between.

Silence, in infrastructure, is expensive.

A half-used on-premises platform still consumes power, occupies space, demands maintenance, requires patching, and waits patiently for a workload spike that visits only now and then. Spare capacity has excellent manners. It makes no fuss. It simply eats money quietly and on schedule.

This was one of the turning points in the story I heard. Forecast utilization turned out to be far more flattering than actual utilization. Once that happened, the economics began to sag under their own good intentions.

The cost of becoming slower

Traditional total-cost comparisons handle direct spending reasonably well. They are much worse at pricing lost momentum.

When a company runs on a large cloud platform, it does not merely rent infrastructure. It also gains access to a constant flow of improvements and options. Better analytics tools. New security integrations. Managed AI services. Identity features. Database capabilities. Deployment patterns. Networking enhancements. Observability tooling.

No single addition changes everything overnight. The effect is cumulative. It is a thousand small conveniences arriving over time and sparing teams from having to rebuild ordinary civilization every quarter.

An on-premises platform can be stable and well run. For the right workloads, that may be perfectly acceptable. But it does not evolve at the pace of a hyperscaler. Upgrades become projects. New capabilities require procurement, testing, staffing, and patience. The platform becomes more careful and, usually, slower.

That slower pace does not always show up neatly in a spreadsheet, but engineers feel it almost immediately.

While competitors are experimenting with new managed services or shipping new capabilities faster, the repatriated organization may be spending its time improving backup procedures, standardizing tools, negotiating maintenance arrangements, or replacing hardware that has chosen an inconvenient moment to become philosophical.

There is nothing glamorous about that. There is also nothing free about it.

Who should actually consider on-premises

None of this means on-premises is foolish.

That would be a lazy conclusion, and lazy conclusions are where expensive architecture plans begin.

For some organizations, on-premises remains entirely reasonable. It makes sense for highly predictable workloads with very little variability. It can make sense in tightly regulated environments where legal, sovereignty, or operational constraints sharply limit the use of public cloud. And at a very large scale, some organizations genuinely can justify building substantial parts of their own platform.

But most companies tempted by repatriation are not in that category.

They are not hyperscalers. They are not all running flat, perfectly predictable workloads. They are not all boxed in by constraints that make public cloud impossible. More often, they are reacting to a painful cloud bill caused by weak cost governance, poor workload fit, loose architecture discipline, or a lack of serious FinOps.

That is a very different problem.

Leaving AWS because you are using AWS badly is a bit like selling your refrigerator because the groceries keep going off while the door is open. The appliance may not be the heart of the matter.

The middle ground companies skip past

One of the stranger features of cloud debates is how quickly they become binary.

Either remain in public cloud forever, or march solemnly back to racks and cages as if returning to a lost ancestral craft.

There is, of course, a middle ground.

Some workloads do benefit from local placement because of latency, residency, plant integration, or operational constraints. But needing hardware closer to the ground does not automatically mean rebuilding the entire service model from scratch. The more useful question is often not whether the hardware should be local, but whether the control plane, automation model, and day-to-day operations should still feel cloud-like.

That is a much more practical conversation.

A company may need some infrastructure nearby while still gaining enormous value from managed identity, familiar APIs, consistent automation, and operational patterns learned in the cloud. This tends to sound less heroic than a full repatriation story, but heroism is not a particularly reliable basis for infrastructure strategy.

The partner who described this case said as much. If they had explored the middle road earlier, they might have kept the local advantages they wanted without assuming quite so much of the surrounding operational burden.

What a real repatriation audit should include

Any company seriously considering a move off AWS should pause long enough to perform an audit that is a little less enchanted by ownership.

Start with the full cloud picture, not just the line items everyone enjoys complaining about. Include engineering effort, compliance automation, security services, platform speed, operational overhead, and the cost of scaling quickly when demand changes.

Then build the on-premises model with uncommon honesty. Price round-the-clock operations. Price redundancy properly. Price backup and recovery as if they matter, because they do. Price refresh cycles, maintenance contracts, spare capacity, patching, testing, physical security, audit evidence, and the awkward certainty that hardware fails when it is least convenient.

Then ask a cultural question, not just a financial one. How many of your engineers actually want to spend more of their time dealing with the physical stack and the operational plumbing that comes with it?

That answer matters more than many executives would like.

A strategy that looks cheaper on paper but nudges your best engineers toward the door is not, in any meaningful sense, cheaper.

Finally, compare repatriation not only against your current cloud bill, but against what a disciplined cloud optimization program could achieve. Rightsizing, storage improvements, better instance strategy, autoscaling discipline, reserved capacity planning, architecture cleanup, and proper FinOps can all change the economics without requiring anyone to rediscover the intimate emotional texture of broken hardware.

The bill behind the bill

What has stayed with me about this story is that it was never really a story about AWS.

It was a story about accounting for the wrong thing.

The visible bill was treated as the entire problem. The hidden work behind the bill was treated as background scenery. Once the company moved off AWS, the scenery walked to the front of the stage and began sending invoices.

That is the trap.

Cloud can absolutely be expensive. Plenty of organizations run it badly and pay for the privilege. But on-premises is not automatically the sober adult in the room. Quite often, it is simply a different payment model, one that hides more of the cost in staffing, slower delivery, operational fragility, maintenance overhead, and all the unlovely little chores that cloud platforms had been taking care of out of sight.

The lesson from this case was not that every workload belongs in AWS forever. It was that infrastructure decisions become dangerous when they are made in reaction to irritation rather than in response to a full economic picture.

Leaving the cloud may still be the right answer for some organizations. For many others, the more useful answer is much less theatrical. Use the cloud better. Govern it better. Design it properly. Understand what you are paying for before deciding you would prefer to rebuild it yourself.

A large monthly cloud bill can be offensive to look at.

The bill that arrives after a bad attempt to escape it is usually less offensive than heartbreaking.

And heartbreak, unlike EC2, rarely comes with autoscaling.

Why Crossplane is the Kubernetes therapy your multi-cloud setup needs

Let us be perfectly honest about multi-cloud environments. They are not a harmonious symphony of computing power. They are a logistical nightmare, roughly equivalent to hosting a dinner party where one guest only eats raw vegan food, another demands a deep-fried turkey, and the third will only consume blue candy. You are running around three different kitchens trying to keep everyone alive and happy while speaking three different languages.

For years, we relied on Terraform or its open-source sibling OpenTofu to manage this chaos. These tools are fantastic, but they come with a terrifying piece of baggage known as the state file. The state file is essentially a fragile, highly sensitive diary holding the deepest, darkest secrets of your infrastructure. If that file gets corrupted or someone forgets to lock it, your cloud provider develops sudden amnesia and forgets where it put the database.

Kubernetes evolved quite a bit while we were busy babysitting our state files. It stopped being just a container orchestrator and started trying to run the whole house. Every major cloud provider released their own Kubernetes operator. Suddenly, you could manage a storage bucket or a database directly from inside your cluster. But there was a catch. The operators refused to speak to each other. You essentially hired a team of brilliant specialists who absolutely hate each other.

This is exactly where Crossplane steps in to act as the universal, unbothered therapist for your infrastructure.

Meet your new obsessive infrastructure butler

Crossplane does not care about vendor rivalries. It installs itself into your Kubernetes cluster and uses the native Kubernetes reconciliation loop to manage your external cloud resources.

If you are unfamiliar with the reconciliation loop, think of it as an aggressively helpful, obsessive-compulsive butler. You hand this butler a piece of YAML paper stating that you require a specific storage bucket in a specific region. The butler goes out, builds the bucket, and then stands there staring at it forever. If a rogue developer logs into the cloud console and manually deletes that bucket, the butler simply builds it again before the developer has even finished their morning coffee. It is relentless, slightly unnerving, and exactly what you want to keep your infrastructure in check.

Because Crossplane lives inside Kubernetes, you do not need to run a separate pipeline just to execute an infrastructure plan. The cluster itself is the engine. You declare what you want, and the cluster makes reality match your desires.

The anatomy of a multi-cloud combo meal

To understand how this actually works without getting bogged down in endless documentation, you only need to understand three main concepts.

First, you have Providers. These are the translator modules. You install the AWS Provider, the Azure Provider, or the Google Cloud Provider, and suddenly your Kubernetes cluster knows how to speak their specific dialects.

Next, you have Managed Resources. These are the raw ingredients. A single virtual machine, a single virtual network, or a single database instance. You can deploy these directly, but asking a developer to configure twenty different Managed Resources just to get a working application is like handing them a live chicken, a sack of flour, and telling them to make a sandwich.

This brings us to the real magic of Crossplane, which is the Composite Resource.

Composite Resources allow you to bundle all those raw ingredients into a single, easy-to-digest package. It is the infrastructure equivalent of a fast-food drive-through. A developer does not need to know about subnets, security groups, or routing tables. They just submitted a claim for a “Standard Web Database” value meal. Crossplane takes that simple request and translates it into the complex web of resources required behind the scenes.

Looking at the code without falling asleep

To prove that this is not just theoretical nonsense, let us look at what it takes to command two completely different cloud providers from the exact same place.

Normally, doing this requires switching between different tools, authenticating multiple times, and praying you do not execute the wrong command in the wrong terminal. With Crossplane, you just throw your YAML files into the cluster.

Here is a sanitized, totally harmless example of how you might ask AWS for a storage bucket.

apiVersion: s3.aws.upbound.io/v1beta1
kind: Bucket
metadata:
  name: acme-corp-financial-reports
spec:
  forProvider:
    region: eu-west-1
  providerConfigRef:
    name: aws-default-provider

And right next to it, in the exact same directory, you can drop this snippet to demand a Resource Group from Azure.

apiVersion: azure.upbound.io/v1beta1
kind: ResourceGroup
metadata:
  name: rg-marketing-dev-01
spec:
  forProvider:
    location: West Europe
  providerConfigRef:
    name: azure-default-provider

You apply these manifests, and Crossplane handles the authentication, the API calls, and the aggressive babysitting of the resources. No Terraform state file is required. It is completely stateless GitOps magic.

The ugly truth about operating at scale

Of course, getting rid of the state file is like going to a music festival without a cell phone. It sounds incredibly liberating until you lose your friends and cannot find your way home.

Operating Crossplane at scale is not always a walk in the park. When things go wrong during provisioning, and they absolutely will go wrong, you do not get a neatly formatted error summary. Because there is no central state file to reference, finding out why a resource failed requires interrogating the Kubernetes API directly.

You type a command to check the status of your resources, and the cluster vomits a massive wall of text onto your screen. It is like trying to find a typo in a phone book while someone shouts at you in a foreign language. Running multiple kubectl commands just to figure out why an Azure database refused to spin up gets very old, very fast.

To survive this chaos, you cannot rely on manual terminal commands. You must pair Crossplane with a dedicated GitOps tool like ArgoCD or FluxCD.

These tools act as the adult in the room. They keep track of what was actually deployed, provide a visual dashboard, and translate the cluster’s internal panic into something a human being can actually read. They give you the visibility that Crossplane lacks out of the box.

Ultimately, moving to Crossplane is a paradigm shift. It requires letting go of the comfortable, procedural workflows of traditional infrastructure as code and embracing the chaotic, eventual consistency of Kubernetes. It has a learning curve that might make you pull your hair out initially, but once you set up your Composite Resources and your GitOps pipelines, you will never want to go back to juggling state files again.

Azure DevOps to GCP without static keys

Static service account keys have an odd domestic quality to them. They begin life as a sensible convenience and, after a few months, end up tucked into variable groups, copied into wikis, or lurking in a repository with the innocent menace of a spare house key under a flowerpot. They work, certainly. So does leaving your front door on the latch. The problem is not whether it works. The problem is how long you can keep pretending it is a good idea.

This article shows how to let Azure DevOps authenticate to Google Cloud without creating or storing a long-lived service account key. Instead, Azure DevOps presents a short-lived OIDC token, Google Cloud checks that token against a workload identity provider, and the pipeline receives temporary Google credentials only for the duration of the job.

The result is cleaner, safer, and far less likely to produce the sort of sentence nobody enjoys reading in a postmortem, namely, “we found an old credential in a place that should not have contained a credential.”

Why this setup is worth the trouble

The old pattern is familiar. You create a Google Cloud service account, download a JSON key, store it somewhere “temporary”, and then spend the next year hoping nobody has copied it into four other places. Even if the key never leaks, it still becomes one more secret to rotate, one more thing to explain to auditors, and one more awkward dependency between your pipeline and a file that should not really exist.

Workload Identity Federation replaces that with short-lived trust. Azure DevOps proves who it is at runtime. Google Cloud verifies that proof. No static key is issued, no secret needs to be rotated, and there is much less housekeeping disguised as security.

Strictly speaking, you can grant permissions directly to the federated principal in Google Cloud. In this article, I am using service account impersonation instead. It is a little easier to reason about, it fits neatly with how many teams already model CI identities, and it behaves consistently across a wide range of Google Cloud services.

What is actually happening

Under the hood, the flow is less mystical than it first appears.

Azure DevOps has a service connection that can mint an OIDC ID token for the running pipeline. Google Cloud has a workload identity pool and an OIDC provider configured to trust tokens issued by that Azure DevOps organization. When the pipeline runs, it retrieves the token, writes a small credential configuration file, and uses that file to exchange the token for temporary Google credentials. Those credentials are then used to impersonate a Google Cloud service account with the exact roles needed for the job.

If you prefer a more ordinary analogy, think of it as a reception desk in an office building. Azure DevOps arrives with a temporary visitor badge. Google Cloud checks whether the badge was issued by a reception desk it trusts, whether it belongs to the expected visitor, and whether that visitor is allowed through the next door. If all of that checks out, access is granted for a while and then expires. Nobody hands over the master keys to the building.

Preparing Azure DevOps

The Azure DevOps side is simpler than it first looks, although the menus do their best to suggest otherwise.

Create an Azure Resource Manager service connection in your Azure DevOps project and use these settings:

  • Identity type: App registration (automatic)
  • Credential: Workload identity federation
  • Scope level: Subscription

Yes, you still need to select a subscription even if your real destination is Google Cloud. It feels slightly like being asked for your train ticket while boarding a ferry, but that is the supported path.

Once the service connection is saved, note down two values from the Workload Identity federation details section:

  • Issuer
  • Subject identifier

The issuer identifies your Azure DevOps organization. The subject identifier identifies the service connection. In practice, the subject identifier follows this pattern:

sc://your-organization/your-project/your-service-connection

That detail matters because Google Cloud will ultimately trust this specific identity, not merely “some pipeline from somewhere in the general direction of Azure.”

A practical naming note is worth making here. Choose a stable, descriptive service connection name early. Renaming things later is always possible in the same way as replacing the plumbing in a bathroom is possible. The word possible is doing quite a lot of work.

Teaching Google Cloud to trust Azure DevOps

Now we move to Google Cloud, where the important trick is to trust the right thing in the right way.

Create a dedicated workload identity pool and OIDC provider. You can do this from the console, but the CLI version is easier to keep, review, and repeat.

export IDENTITY_PROJECT_ID="acme-identity-hub"
export IDENTITY_PROJECT_NUMBER="998877665544"
export POOL_ID="ado-pool"
export PROVIDER_ID="ado-oidc"
export ISSUER_URI="https://vstoken.dev.azure.com/11111111-2222-3333-4444-555555555555"

# Enable the required APIs

gcloud services enable \
  iam.googleapis.com \
  cloudresourcemanager.googleapis.com \
  iamcredentials.googleapis.com \
  sts.googleapis.com \
  --project="$IDENTITY_PROJECT_ID"

# Create the workload identity pool

gcloud iam workload-identity-pools create "$POOL_ID" \
  --project="$IDENTITY_PROJECT_ID" \
  --location="global" \
  --display-name="Azure DevOps pool" \
  --description="Federation trust for Azure DevOps pipelines"

# Create the OIDC provider

gcloud iam workload-identity-pools providers create-oidc "$PROVIDER_ID" \
  --project="$IDENTITY_PROJECT_ID" \
  --location="global" \
  --workload-identity-pool="$POOL_ID" \
  --display-name="Azure DevOps provider" \
  --issuer-uri="$ISSUER_URI" \
  --allowed-audiences="api://AzureADTokenExchange" \
  --attribute-mapping="google.subject=assertion.sub.extract('/sc/{service_connection}')"

There are two details here that are easy to get wrong.

First, the allowed audience for the provider is “api://AzureADTokenExchange”. It is not a random per-connection UUID, and it is not the audience string that later appears inside the external account credential file used by the pipeline.

Second, the attribute mapping should not map “google.subject” to “assertion.aud”. For Azure DevOps, the supported workaround for the 127 byte subject limit is to extract the service connection portion from the “sub” claim:

google.subject=assertion.sub.extract('/sc/{service_connection}')

This matters because the raw Azure DevOps subject can be too long for “google.subject”. Extracting the useful part solves the length issue neatly and still gives Google Cloud a stable subject to authorize.

You do not need an attribute condition for Azure DevOps. The issuer is already tenant-specific, which keeps this case pleasantly less dramatic than some other CI systems.

Creating the service account

Next, create the Google Cloud service account that your pipeline will impersonate.

The exact roles depend on what your pipeline needs to do. If the job only uploads artifacts to Cloud Storage, grant a storage role and stop there. If it deploys Cloud Run services, grant the Cloud Run roles it genuinely needs. This is one of those rare moments in cloud engineering where restraint is both morally admirable and operationally useful.

Here is a simple example:

export DEPLOY_PROJECT_ID="acme-observability-dev"
export SERVICE_ACCOUNT_NAME="ci-deployer"
export SERVICE_ACCOUNT_EMAIL="${SERVICE_ACCOUNT_NAME}@${DEPLOY_PROJECT_ID}.iam.gserviceaccount.com"
export FEDERATED_SUBJECT="your-organization/your-project/your-service-connection"

# Create the service account

gcloud iam service-accounts create "$SERVICE_ACCOUNT_NAME" \
  --project="$DEPLOY_PROJECT_ID" \
  --display-name="CI deployer for Azure DevOps"

# Grant only the roles your pipeline really needs

gcloud projects add-iam-policy-binding "$DEPLOY_PROJECT_ID" \
  --member="serviceAccount:${SERVICE_ACCOUNT_EMAIL}" \
  --role="roles/storage.objectAdmin"

# Allow the federated Azure DevOps identity to impersonate the service account

gcloud iam service-accounts add-iam-policy-binding "$SERVICE_ACCOUNT_EMAIL" \
  --project="$DEPLOY_PROJECT_ID" \
  --role="roles/iam.workloadIdentityUser" \
  --member="principal://iam.googleapis.com/projects/${IDENTITY_PROJECT_NUMBER}/locations/global/workloadIdentityPools/${POOL_ID}/subject/${FEDERATED_SUBJECT}"

The “FEDERATED_SUBJECT” value must match the subject produced by your attribute mapping. In plain English, that means the service connection identity that Google Cloud should trust. If the pool lives in one project and the service account lives in another, that is fine, but be careful to use the project number of the identity project in the principal URI.

Building the pipeline

Now for the part everyone actually came for.

The pipeline below uses the AzureCLI task to obtain the Azure DevOps OIDC token, stores it in a temporary file, writes an external account credential file for Google Cloud, signs in with “gcloud”, and then runs a test command.

trigger:
- main

pool:
  vmImage: 'ubuntu-latest'

variables:
  azureServiceConnection: 'gcp-federation-prod'
  gcpProjectId: 'acme-observability-dev'
  gcpProjectNumber: '998877665544'
  gcpPoolId: 'ado-pool'
  gcpProviderId: 'ado-oidc'
  gcpServiceAccount: 'ci-deployer@acme-observability-dev.iam.gserviceaccount.com'
  GOOGLE_APPLICATION_CREDENTIALS: '$(Pipeline.Workspace)/gcp-wif.json'

steps:
- checkout: self

- task: AzureCLI@2
  displayName: 'Authenticate to Google Cloud with workload identity federation'
  inputs:
    azureSubscription: '$(azureServiceConnection)'
    addSpnToEnvironment: true
    scriptType: 'bash'
    scriptLocation: 'inlineScript'
    inlineScript: |
      set -euo pipefail

      TOKEN_FILE="$(Pipeline.Workspace)/ado-token.jwt"
      printf '%s' "$idToken" > "$TOKEN_FILE"

      cat > "$GOOGLE_APPLICATION_CREDENTIALS" <<EOF
      {
        "type": "external_account",
        "audience": "//iam.googleapis.com/projects/$(gcpProjectNumber)/locations/global/workloadIdentityPools/$(gcpPoolId)/providers/$(gcpProviderId)",
        "subject_token_type": "urn:ietf:params:oauth:token-type:jwt",
        "token_url": "https://sts.googleapis.com/v1/token",
        "credential_source": {
          "file": "$TOKEN_FILE"
        },
        "service_account_impersonation_url": "https://iamcredentials.googleapis.com/v1/projects/-/serviceAccounts/$(gcpServiceAccount):generateAccessToken"
      }
      EOF

      gcloud auth login --cred-file="$GOOGLE_APPLICATION_CREDENTIALS" --quiet
      gcloud config set project "$(gcpProjectId)" --quiet

      echo "Authenticated as federated workload"
      gcloud storage buckets list --limit=5

A couple of details are doing more work here than they appear to be doing.

“addSpnToEnvironment: true” is essential. Without it, the task does not expose the “idToken” variable to your script. The pipeline then behaves like a very polite person who has shown up for an exam without bringing a pen.

The “audience” inside the generated JSON file is also important. This is the full resource name of the workload identity provider in Google Cloud. It is not the same thing as the allowed audience configured on the provider itself. The two values serve different purposes, which is perfectly reasonable once you know it and deeply annoying before you do.

An alternative credential file approach

If you prefer to generate the configuration file with “gcloud” rather than writing JSON inline, you can do that too:

gcloud iam workload-identity-pools create-cred-config \
  "projects/${gcpProjectNumber}/locations/global/workloadIdentityPools/${gcpPoolId}/providers/${gcpProviderId}" \
  --service-account="${gcpServiceAccount}" \
  --credential-source-file="$TOKEN_FILE" \
  --output-file="$GOOGLE_APPLICATION_CREDENTIALS"

That version is perfectly serviceable and often a little tidier if you dislike heredocs. I have shown the explicit JSON version in the main pipeline because it makes each moving part visible, which is useful while learning or troubleshooting.

Common pitfalls

There are a few places where people lose an afternoon.

The token exists, but the pipeline still fails

Make sure the AzureCLI task is using the correct service connection and that “addSpnToEnvironment” is enabled. If “$idToken” is empty, the problem is usually on the Azure DevOps side, not in Google Cloud.

The principal binding looks right, but impersonation is denied

Check the project number in the principal URI. It must be the project number that owns the workload identity pool, not necessarily the project where the service account lives.

Also, check the federated subject. Because of the attribute mapping, the subject is the extracted service connection path, not the raw OIDC subject, and not a made-up shorthand invented during a stressful coffee break.

The pipeline freezes on an authentication prompt

Use ‘–quiet’ with ‘gcloud auth login’ and similar commands. CI jobs are many things, but conversationalists they are not.

Hosted agents are not available

If your Azure DevOps organization has not yet been granted hosted parallelism, use a self-hosted agent temporarily. In that case, make sure the machine already has ‘az’ and ‘gcloud’ installed and available on the ‘PATH’.

A minimal self-hosted pool declaration looks like this:

pool:
  name: 'Default'

On Windows, remember to switch the script type to PowerShell or PowerShell Core and adjust the environment variable syntax accordingly.

Leaving the keys behind

This setup removes one of the more tiresome habits of cross-cloud automation, namely, manufacturing a secret only to spend the rest of its natural life protecting it from yourself. Azure DevOps can obtain a short-lived token, Google Cloud can verify it, and your pipeline can impersonate a tightly scoped service account without anybody downloading a JSON key and promising to delete it later.

That is the technical benefit. The practical benefit is even nicer. Once this is in place, your pipeline starts to feel less like a cupboard full of labelled jars, some of which may or may not contain explosives, and more like a system that knows who it is, proves it when asked, and then gets on with the job.

Which, in cloud engineering, is about as close as one gets to elegance.

AWSMap for smarter AWS migrations

Most AWS migrations begin with a noble ambition and a faintly ridiculous problem.

The ambition is to modernise an estate, reduce risk, tidy the architecture, and perhaps, if fortune smiles, stop paying for three things nobody remembers creating.

The ridiculous problem is that before you can migrate anything, you must first work out what is actually there.

That sounds straightforward until you inherit an AWS account with the accumulated habits of several teams, three naming conventions, resources scattered across regions, and the sort of IAM sprawl that suggests people were granting permissions with the calm restraint of a man feeding pigeons. At that point, architecture gives way to archaeology.

I do not work for AWS, and this is not a sponsored love letter to a shiny console feature. I am an AWS and GCP architect working in the industry, and I have used AWSMap when assessing environments ahead of migration work. The reason I am writing about it is simple enough. It is one of those practical tools that solves a very real problem, and somehow remains less widely known than it deserves.

AWSMap is a third-party command-line utility that inventories AWS resources across regions and services, then lets you explore the result through HTML reports, SQL queries, and plain-English questions. In other words, it turns the early phase of a migration from endless clicking into something closer to a repeatable assessment process.

That does not make it perfect, and it certainly does not replace native AWS services. But in the awkward first hours of understanding an inherited environment, it can be remarkably useful.

The migration problem before the migration

A cloud migration plan usually looks sensible on paper. There will be discovery, analysis, target architecture, dependency mapping, sequencing, testing, cutover, and the usual brave optimism seen in project plans everywhere.

In reality, the first task is often much humbler. You are trying to answer questions that should be easy and rarely are.

What is running in this account?

Which regions are actually in use?

Are there old snapshots, orphaned EIPs, forgotten load balancers, or buckets with names that sound important enough to frighten everyone into leaving them alone?

Which workloads are genuinely active, and which are just historical luggage with a monthly invoice attached?

You can answer those questions from the AWS Management Console, of course. Given enough tabs, enough patience, and a willingness to spend part of your afternoon wandering through services you had not planned to visit, you will eventually get there. But that is not a particularly elegant way to begin a migration.

This is where AWSMap becomes handy. Instead of treating discovery as a long guided tour of the console, it treats it as a data collection exercise.

What AWSMap does well

At its core, AWSMap scans an AWS environment and produces an inventory of resources. The current public package description on PyPI describes it as covering more than 150 AWS services, while version 1.5.0 covers 140 plus services, which is a good reminder that the coverage evolves. The important point is not the exact number on a given Tuesday morning, but that it covers a broad enough slice of the estate to be genuinely useful in early assessments.

What makes the tool more interesting is what it does after the scan.

It can generate a standalone HTML report, store results locally in SQLite, let you query the inventory with SQL, run named audit queries, and translate plain-English prompts into database queries without sending your infrastructure metadata off to an LLM service. The release notes for v1.5.0 describe local SQLite storage, raw SQL querying, named queries, typo-tolerant natural language questions, tag filtering, scoped account views, and browsable examples.

That combination matters because migrations are rarely single, clean events. They are usually a series of discoveries, corrections, and mildly awkward conversations. Having the inventory preserved locally means the account does not need to be rediscovered from scratch every time someone asks a new question two days later.

The report you can actually hand to people

One of the surprisingly practical parts of AWSMap is the report output.

The tool can generate a self-contained HTML report that opens locally in a browser. That sounds almost suspiciously modest, but it is useful precisely because it is modest. You can attach it to a ticket, share it with a teammate, or open it during a workshop without building a whole reporting pipeline first. The v1.5.0 release notes describe the report as a single, standalone HTML file with filtering, search, charts, and export options.

That makes it suitable for the sort of migration meeting where someone says, “Can we quickly check whether eu-west-1 is really the only active region?” and you would rather not spend the next ten minutes performing a slow ritual through five console pages.

A simple scan might look like this:

awsmap -p client-prod

If you want to narrow the blast radius a bit and focus on a few services that often matter early in migration discovery, you could do this:

awsmap -p client-prod -s ec2,rds,elb,lambda,iam

And if the account is a thicket of shared infrastructure, tags can help reduce the noise:

awsmap -p client-prod -t Environment=Production -t Owner=platform-team

That kind of filtering is helpful when the account contains equal parts business workload and historical clutter, which is to say, most real accounts.

Why SQLite is more important than it sounds

The feature I like most is not the report. It is the local SQLite database.

Every scan can be stored locally, so the inventory becomes queryable over time instead of vanishing the moment the terminal output scrolls away. The default local database path is ‘~/.awsmap/inventory.db’, and the scan results from different runs can accumulate there for later analysis.

This changes the character of the tool quite a bit. It stops being a disposable scanner and becomes something closer to a field notebook.

Suppose you scan a client account today, then return to the same work three days later, after someone mentions an old DR region nobody had documented. Without persistence, you start from scratch. With persistence, you ask the database.

That is a much more civilised way to work.

A query for the busiest services in the collected inventory might look like this:

awsmap query "SELECT service, COUNT(*) AS total
FROM resources
GROUP BY service
ORDER BY total DESC
LIMIT 12"

And a more migration-focused query might be something like:

awsmap query "SELECT account, region, service, name
FROM resources
WHERE service IN ('ec2', 'rds', 'elb', 'lambda')
ORDER BY account, region, service, name"

Neither query is glamorous, but migrations are not built on glamour. They are built on being able to answer dull, important questions reliably.

Security and hygiene checks without the acrobatics

AWSMap also includes named queries for common audit scenarios, which is useful for two reasons.

First, most people do not wake up eager to write SQL joins against IAM relationships. Second, migration assessments almost always drift into security checks sooner or later.

The public release notes describe named queries for scenarios such as admin users, public S3 buckets, unencrypted EBS volumes, unused Elastic IPs, and secrets without rotation.

That means you can move from “What exists?” to “What looks questionable?” without much ceremony.

For example:

awsmap query -n admin-users
awsmap query -n public-s3-buckets
awsmap query -n ebs-unencrypted
awsmap query -n unused-eips

Those are not, strictly speaking, migration-only questions. But they are precisely the kind of questions that surface during migration planning, especially when the destination design is meant to improve governance rather than merely relocate the furniture.

Asking questions in plain English

One of the nicer additions in the newer version is the ability to ask plain-English questions.

That is the sort of feature that normally causes one to brace for disappointment. But here the approach is intentionally local and deterministic. This functionality is a built-in parser rather than an LLM-based service, which means no API keys, no network calls to an external model, and no need to ship resource metadata somewhere mysterious.

That matters in enterprise environments where the phrase “just send the metadata to a third-party AI service” tends to receive the warm reception usually reserved for wasps.

Some examples:

awsmap ask show me lambda functions by region
awsmap ask list databases older than 180 days
awsmap ask find ec2 instances without Owner tag

Even when the exact wording varies, the basic idea is appealing. Team members who do not want to write SQL can still interrogate the inventory. That lowers the barrier for using the tool during workshops, handovers, and review sessions.

Where AWSMap fits next to AWS native services

This is the part worth stating clearly.

AWSMap is useful, but it is not a replacement for AWS Resource Explorer, AWS Config, or every other native mechanism you might use for discovery, governance, and inventory.

AWS Resource Explorer can search across supported resource types and, since 2024, can also discover all tagged AWS resources using the ‘tag:all’ operator. AWS documentation also notes an important limitation for IAM tags in Resource Explorer search.

AWS Config, meanwhile, continues to expand the resource types it can record, assess, and aggregate. AWS has announced multiple additions in 2025 and 2026 alone, which underlines that the native inventory and compliance story is still moving quickly.

So why use AWSMap at all?

Because its strengths are slightly different.

It is local.

It is quick to run.

It gives you a portable HTML report.

It stores results in SQLite for later interrogation.

It lets you query the inventory directly without setting up a broader governance platform first.

That makes it particularly handy in the early assessment phase, in consultancy-style discovery work, or in those awkward inherited environments where you need a fast baseline before deciding what the more permanent controls should be.

The weak points worth admitting

No serious article about a tool should pretend the tool has descended from heaven in perfect condition, so here are the caveats.

First, coverage breadth is not the same thing as universal depth. A tool can support a large number of services and still provide uneven detail between them. That is true of almost every inventory tool ever made.

Second, the quality of the result still depends on the credentials and permissions you use. If your access is partial, your inventory will be partial, and no amount of cheerful HTML will alter that fact.

Third, local storage is convenient, but it also means you should be disciplined about how scan outputs are handled on your machine, especially if you are working with client environments. Convenience and hygiene should remain on speaking terms.

Fourth, for organisation-wide governance, compliance history, managed rules, and native integrations, AWS services such as Config still have an obvious place. AWSMap is best seen as a sharp assessment tool, not a universal control plane.

That is not a criticism so much as a matter of proper expectations.

A practical workflow for migration discovery

If I were using AWSMap at the start of a migration assessment, the workflow would be something like this.

First, run a broad scan of the account or profile you care about.

awsmap -p client-prod

Then, if the account is noisy, refine the scope.

awsmap -p client-prod -s ec2,rds,elb,iam,route53
awsmap -p client-prod --exclude-defaults

Next, use a few named queries to surface obvious issues.

awsmap query -n public-s3-buckets
awsmap query -n secrets-no-rotation
awsmap query -n admin-users

After that, ask targeted questions in either SQL or plain English.

awsmap ask list load balancers by region
awsmap ask show databases with no backup tag
awsmap query "SELECT region, COUNT(*) AS total
FROM resources
WHERE service='ec2'
GROUP BY region
ORDER BY total DESC"

And finally, keep the HTML report and local inventory as a baseline for later design discussions.

That is where the tool earns its keep. It gives you a reasonably fast, reasonably structured picture of an estate before the migration plan turns into a debate based on memory, folklore, and screenshots in old slide decks.

When the guessing stops

There is a particular kind of misery in cloud work that comes from being asked to improve an environment before anyone has properly described it.

Tools do not eliminate that misery, but some of them reduce it to a more manageable size.

AWSMap is one of those.

It is not the only way to inventory AWS resources. It is not a substitute for native governance services. It is not magic. But it is practical, fast to understand, and surprisingly helpful when the first job in a migration is simply to stop guessing.

That alone makes it worth knowing about.

And in cloud migrations, a tool that helps replace guessing with evidence is already doing better than half the room.

Tracing the origin of S3 objects with Amazon Athena

A while back, I opened the console for one of my busier S3 buckets and there, nestling among hundreds of perfectly ordinary files, sat something called quarterly_report_20260305.pdf. It looked exactly like the mystery jar of jam you find at the back of the fridge after a family gathering: you know it wasn’t there yesterday, yet nobody owns up to putting it there. In shared buckets, this sort of thing happens all the time, and the console, bless it, only tells you the last modified date. It never whispers who the actual culprit was.

That is where Amazon Athena and S3 server access logs come in. With a few straightforward steps, you can turn those silent log files into a clear answer delivered in plain SQL. The whole process feels pleasantly like detective work you can finish while your tea is still hot, and the article you are reading now should take you no more than ten minutes from start to finish.

The tech stack

Amazon S3 is the sturdy cupboard that holds nearly everything these days, yet for all its virtues, it keeps the identity of each uploader politely hidden. To make that identity visible, you must first switch on server access logging. Once enabled, every request (upload, download, delete) is written as plain text and dropped into a bucket you choose. The logs arrive free of charge to enable, though you do pay the modest storage cost in the destination bucket, a small price for sanity.

Enter Amazon Athena, a serverless query service that lets you run ordinary SQL straight against those log files without moving a single byte. The first time I used it, I felt the same quiet thrill you get when you finally locate the right key in a drawer full of oddments: everything suddenly becomes possible with almost no effort.

Step 1. Ensure logging is enabled

Go to the Properties tab of your source bucket and turn on Server access logging. Point the logs at a separate bucket (I like to call mine something obvious like logs-companyname) and give them a clean prefix such as access-logs/.

A small practical note I learned after waiting in vain one afternoon: the logs can take up to a few hours to appear, rather like a post that has gone round the long way. They do not backfill old activity, so if you need history, you must wait until new events occur. Also, check the target bucket’s policy; a missing permission once left me staring at empty folders and muttering mild British curses at the screen.

Step 2. Create the Athena table with partitioning

Now tell Athena how to read the logs. The following DDL uses partition projection, a quietly brilliant feature that means you never have to run MSCK REPAIR TABLE again. It works out the date folders automatically.

Run this in the Athena query editor (I have changed the names and paths from my real ones so nothing sensitive slips through):

CREATE DATABASE IF NOT EXISTS s3_logs_database;

CREATE EXTERNAL TABLE IF NOT EXISTS access_logs_table (
    bucketowner string,
    bucket_name string,
    requestdatetime string,
    remoteip string,
    requester string,
    requestid string,
    operation string,
    key string,
    request_uri string,
    httpstatus string,
    errorcode string,
    bytessent bigint,
    objectsize bigint,
    totaltime string,
    turnaroundtime string,
    referrer string,
    useragent string,
    versionid string,
    hostid string,
    sigv string,
    ciphersuite string,
    authtype string,
    endpoint string,
    tlsversion string,
    accesspointarn string,
    aclrequired string
)
PARTITIONED BY (timestamp string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    'input.regex' = '([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) ([^ ]*)(?: ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*))?.*$'
)
LOCATION 's3://my-company-logs/access-logs/'
TBLPROPERTIES (
    'projection.enabled' = 'true',
    'projection.timestamp.type' = 'date',
    'projection.timestamp.range' = '2024/01/01,NOW',
    'projection.timestamp.format' = 'yyyy/MM/dd',
    'projection.timestamp.interval' = '1',
    'projection.timestamp.interval.unit' = 'DAYS',
    'storage.location.template' = 's3://my-company-logs/access-logs/${timestamp}/'
);

Replace the LOCATION and storage.location.template with your own bucket and prefix. The regex looks like the sort of thing a medieval monk would doodle in the margin, but it is the official AWS pattern and works reliably.

Step 3. Uncover the upload details

With the table ready, finding who dropped that quarterly report is almost childishly simple. Here is the query I use (again with changed details):

SELECT 
    requestdatetime AS upload_time,
    requester,
    remoteip,
    operation,
    key AS file_path
FROM access_logs_table
WHERE timestamp = '2026/03/05'
  AND key LIKE '%quarterly_report_20260305%'
  AND operation LIKE '%PUT%'
LIMIT 10;

The requester column is the star of the show: it shows the IAM user ARN or role that performed the action. Requestdatetime gives the exact second in UTC. Operation tells you it was indeed an upload. I once traced a suspicious file only to discover it was my own weekend script behaving like an overenthusiastic puppy that had wandered off and come back with a stick.

If the requester says Anonymous, you know the upload came from public access, a gentle reminder to tighten the bucket policy before the next family gathering in the cloud.

Cost saving tips with partitions

Athena charges five dollars per terabyte scanned. Without the timestamp filter, you can accidentally scan months of logs while hunting for one file, an expense roughly equivalent to buying an entire chocolate cake when you only wanted one slice. With partitioning, the query reads only the folder for that single day, and the bill usually stays under the price of a decent biscuit.

I speak from experience: one forgetful broad query once produced a polite note from AWS that made my eyebrows rise. Since then, I filter on timestamp first and sleep better at night.

Wrapping up

What began as a mild annoyance with a stray file has turned into a small daily pleasure. A few lines of SQL, a properly partitioned table, and the cloud cupboard becomes pleasantly transparent. The same technique helps with compliance checks, pipeline debugging, and the quiet satisfaction of knowing exactly what is going on in your storage.

If you have a few minutes this afternoon, turn on logging for a test bucket and give the table a try. You may find, as I did, that the small effort pays back in clarity and calm far beyond the few pennies it costs. And the next time something mysterious appears in your S3 cupboard, you will know exactly who left it there and when.

Why generic auto scaling is terrible for healthcare pipelines

Let us talk about healthcare data pipelines. Running high volume payer processing pipelines is a lot like hosting a mandatory potluck dinner for a group of deeply eccentric people with severe and conflicting dietary restrictions. Each payer behaves with maddening uniqueness. One payer bursts through the door, demanding an entire roasted pig, which they intend to consume in three minutes flat. This requires massive, short-lived computational horsepower. Another payer arrives with a single boiled pea and proceeds to chew it methodically for the next five hours, requiring a small but agonizingly persistent trickle of processing power.

On top of this culinary nightmare, there are strict rules of etiquette. You absolutely must digest the member data before you even look at the claims data. Eligibility files must be validated before anyone is allowed to touch the dessert tray of downstream jobs. The workload is not just heavy. It is incredibly uneven and delightfully complicated.

Buying folding chairs for a banquet

On paper, Amazon Web Services managed Auto Scaling Mechanisms should fix this problem. They are designed to look at a growing pile of work and automatically hire more help. But applying generic auto scaling to healthcare pipelines is like a restaurant manager seeing a line out the door and solving the problem by buying fifty identical plastic folding chairs.

The manager does not care that one guest needs a high chair and another requires a reinforced steel bench. Auto scaling reacts to the generic brute force of the system load. It cannot look at a specific payer and tailor the compute shape to fit their weird eating habits. It cannot enforce the strict social hierarchy of job priorities. It scales the infrastructure, but it completely fails to scale the intention.

This is why we abandoned the generic approach and built our own dynamic EC2 provisioning system. Instead of maintaining a herd of generic servers waiting around for something to do, we create bespoke servers on demand based on a central configuration table.

The ruthless nightclub bouncer of job scheduling

Let us look at how this actually works regarding prioritization. Our system relies on that central configuration table to dictate order. Think of this table as the guest list at an obnoxiously exclusive nightclub. Our scheduler acts as the ruthless bouncer.

When jobs arrive at the queue, the bouncer checks the list. Member data? Right this way to the VIP lounge, sir. Claims data? Stand on the curb behind the velvet rope until the members are comfortably seated. Generic auto scaling has no native concept of this social hierarchy. It just sees a mob outside the club and opens the front doors wide. Our dynamic approach gives us perfect, tyrannical control over who gets processed first, ensuring our pipelines execute in a beautifully deterministic way. We spin up exactly the compute we specify, exactly when we want it.

Leaving your car running in the garage

Then there is the financial absurdity of warm pools. Standard auto scaling often relies on keeping a baseline of idle instances warm and ready, just in case a payer decides to drop a massive batch of files at two in the morning.

Keeping idle servers running is the technological equivalent of leaving your car engine idling in the closed garage all night just in case you get a sudden craving for a carton of milk at dawn. It is expensive, it is wasteful, and it makes you look a bit foolish when the AWS bill arrives.

Our dynamic system operates with a baseline of zero. We experience one hundred percent burst efficiency because we only pay for the exact compute we use, precisely when we use it. Cost savings happen naturally when you refuse to pay for things that are sitting around doing nothing.

A delightfully brutal server lifecycle

The operational model we ended up with is almost comically simple compared to traditional methods. A generic scaling group requires complex scaling policies, tricky cooldown periods, and endless tweaking of CloudWatch alarms. It is like managing a highly sensitive, moody teenager.

Our dynamic EC2 model is wonderfully ruthless. We create the instance and inject it with a single, highly specific purpose via a startup script. The instance wakes up, processes the healthcare data with absolute precision, and then politely self destructs so it stops billing us. They are the mayflies of the cloud computing world. They live just long enough to do their job, and then they vanish. There are no orphaned instances wandering the cloud.

This dynamic provisioning model has fundamentally altered how we digest payer workloads. We have somehow achieved a weird but perfect holy grail of cloud architecture. We get the granular flexibility of serverless functions, the raw, unadulterated horsepower of dedicated EC2 instances, and the stingy cost efficiency of a pure event-driven design.

If your processing jobs vary wildly from payer to payer, and if you care deeply about enforcing priorities without burning money on idle metal, building a disposable compute army might be exactly what your architecture is missing. We said goodbye to our idle servers, and honestly, we do not miss them at all.

The lazy cloud architect guide to AWS automation

The shortcuts I use on every project now, after learning that scale mostly changes the bill, not the mistakes.

Let me tell you how this started. I used to measure my productivity by how many AWS services I could haphazardly stitch together in a single afternoon. Big mistake.

One night, I was deploying what should have been a boring, routine feature. Nothing fancy. Just basic plumbing. Six hours later, I was still babysitting the deployment, clicking through the AWS console like a caffeinated lab rat, re-running scripts, and manually patching up tiny human errors.

That is when the epiphany hit me like a rogue server rack. I was not slow because AWS is a labyrinth of complexity. I was slow because I was doing things manually that AWS already knows how to do in its sleep.

The patterns below did not come from sanitized tutorials. They were forged in the fires of shipping systems under immense pressure and desperately wanting my weekends back.

Event-driven everything and absolutely no polling

If you are polling, you are essentially paying Jeff Bezos for the privilege of wasting your own time. Polling is the digital equivalent of sitting in the backseat of a car and constantly asking, “Are we there yet?” every five seconds.

AWS is an event machine. Treat it like one. Instead of writing cron jobs that anxiously ask the database if something changed, just let AWS tap you on the shoulder when it actually happens.

Where this shines:

  • File uploads
  • Database updates
  • Infrastructure state changes
  • Cross-account automation

Example of reacting to an S3 upload instantly:

def lambda_handler(event, context):
    for record in event['Records']:
        bucket_name = record['s3']['bucket']['name']
        object_key = record['s3']['object']['key']

        # Stop asking if the file is there. AWS just handed it to you.
        trigger_completely_automated_workflow(bucket_name, object_key)

No loops. No waiting. Just action.

Pro tip: Event-driven systems fail less frequently simply because they do less work. They are the lazy geniuses of the cloud world.

Immutable deployments or nothing

SSH is not a deployment strategy. It is a desperate cry for help.

If your deployment plan involves SSH, SCP, or uttering the cursed phrase “just this one quick change in production”, you do not have a system. You have a fragile ecosystem built on hope and duct tape. I stopped “fixing” servers years ago. Now, I just murder them and replace them with fresh clones.

The pattern is brutally simple:

  1. Build once
  2. Deploy new
  3. Destroy old

Example of launching a new EC2 version programmatically:

import boto3

ec2_client = boto3.client('ec2', region_name='eu-west-1')
response = ec2_client.run_instances(
    ImageId='ami-0123456789abcdef0', # Totally fake AMI
    InstanceType='t3a.nano',
    MinCount=1,
    MaxCount=1,
    TagSpecifications=[{
        'ResourceType': 'instance',
        'Tags': [{'Key': 'Purpose', 'Value': 'EphemeralClone'}]
    }]
)

It is like doing open-heart surgery. Instead of trying to fix the heart while the patient is running a marathon, just build a new patient with a healthy heart and disintegrate the old one. When something breaks, I do not debug the server. I debug the build process. That is where the real parasites live.

Infrastructure as code for the forgettable things

Most teams only use IaC for the big, glamorous stuff. VPCs. Kubernetes clusters. Massive databases.

This is completely backwards. It is like wearing a bespoke tuxedo but forgetting your underwear. The small, forgettable resources are the ones that will inevitably bite you when you least expect it.

What I automate with religious fervor:

  • IAM roles
  • Alarms
  • Schedules
  • Policies
  • Log retention

Example of creating a CloudWatch alarm in code:

cloudwatch.put_metric_alarm(
    AlarmName="QueueIsExploding",
    MetricName="ApproximateNumberOfMessagesVisible",
    Namespace="AWS/SQS",
    Threshold=10000,
    ComparisonOperator="GreaterThanThreshold",
    EvaluationPeriods=1,
    Period=300,
    Statistic="Sum"
)

If it matters in production, it lives in code. No exceptions.

Let Step Functions own the flow

Early in my career, I crammed all my business logic into Lambdas. Retries, branching, timeouts, bizarre edge cases. I treated them like a digital junk drawer.

I do not do that anymore. Lambdas should be as dumb and fast as a golden retriever chasing a tennis ball.

The new rule: One Lambda equals one job. If you need a workflow, use Step Functions. They are the micromanaging middle managers your architecture desperately needs.

Example of a simple workflow state:

{
  "Type": "Task",
  "Resource": "arn:aws:lambda:eu-west-1:123456789012:function:DoOneThingWell",
  "Retry": [
    {
      "ErrorEquals": ["States.TaskFailed"],
      "IntervalSeconds": 3,
      "MaxAttempts": 2
    }
  ],
  "Next": "CelebrateSuccess"
}

This separation makes debugging highly visual, makes retries explicit, and makes onboarding the new guy infinitely less painful. Your future self will thank you.

Kill cron jobs and use managed schedulers

Cron jobs are perfectly fine until they suddenly are not.

They are the ghosts of your infrastructure. They are completely invisible until they fail, and when they do fail, they die in absolute silence like a ninja with a sudden heart condition. AWS gives you managed scheduling. Just use it.

Why this is fundamentally faster:

  • Central visibility
  • Built-in retries
  • IAM-native permissions

Example of creating a scheduled rule:

eventbridge.put_rule(
    Name="TriggerNightlyChaos",
    ScheduleExpression="cron(0 2 * * ? *)",
    State="ENABLED",
    Description="Wakes up the system when nobody is looking"
)

Automation should be highly observable. Cron jobs are just waiting in the dark to ruin your Tuesday.

Bake cost controls into automation

Speed without cost awareness is just a highly efficient way to bankrupt your employer. The fastest teams I have ever worked with were not just shipping fast. They were failing cheaply.

What I automate now with the ruthlessness of a debt collector:

  • Budget alerts
  • Resource TTLs
  • Auto-shutdowns for non-production environments

Example of tagging resources with an expiration date:

ec2.create_tags(
    Resources=['i-0deadbeef12345678'],
    Tags=[
        {"Key": "TerminateAfter", "Value": "2026-12-31"},
        {"Key": "Owner", "Value": "TheVoid"}
    ]
)

Leaving resources without an owner or an expiration date is like leaving the stove on, except this stove bills you by the millisecond. Anything without a TTL is just technical debt waiting to invoice you.

A quote I live by: “Automation does not cut costs by magic. It cuts costs by quietly preventing the expensive little mistakes humans call normal.”

The death of the cloud hero

These patterns did not make me faster because they are particularly clever. They made me faster because they completely eliminated the need to make decisions.

Less clicking. Less remembering. Absolutely zero heroics.

If you want to move ten times faster on AWS, stop asking what to build next. Once automation is in charge, real speed usually arrives as work you no longer have to remember.

How we ditched AWS ELB and accidentally built a time machine

I was staring at our AWS bill at two in the morning, nursing my third cup of coffee, when I realized something that should have been obvious months earlier. We were paying more to distribute our traffic than to process it. Our Application Load Balancer, that innocent-looking service that simply forwards packets from point A to point B, was consuming $3,900 every month. That is $46,800 a year. For a traffic cop. A very expensive traffic cop that could not even handle our peak loads without breaking into a sweat.

The particularly galling part was that we had accepted this as normal. Everyone uses AWS load balancers, right? They are the standard, the default, the path of least resistance. It is like paying rent for an apartment you only use to store your shoes. Technically functional, financially absurd.

So we did what any reasonable engineering team would do at that hour. We started googling. And that is how we discovered IPVS, a technology so old that half our engineering team had not been born when it was first released. IPVS stands for IP Virtual Server, which sounds like something from a 1990s hacker movie, and honestly, that is not far off. It was written in 1998 by a fellow named Wensong Zhang, who presumably had no idea that twenty-eight years later, a group of bleary-eyed engineers would be using his code to save more than forty-six thousand dollars a year.

The expensive traffic cop

To understand why we were so eager to jettison our load balancer, you need to understand how AWS pricing works. Or rather, how it accumulates like barnacles on the hull of a ship, slowly dragging you down until you wonder why you are moving so slowly.

An Application Load Balancer costs $0.0225 per hour. That sounds reasonable, about sixteen dollars a month. But then there are LCUs, or Load Balancer Capacity Units, which charge you for every new connection, every rule evaluation, every processed byte. It is like buying a car and then discovering you have to pay extra every time you turn the steering wheel.

In practice, this meant our ALB was consuming fifteen to twenty percent of our entire infrastructure budget. Not for compute, not for storage, not for anything that actually creates value. Just for forwarding packets. It was the technological equivalent of paying a butler to hand you the remote control.

The ALB also had some architectural quirks that made us scratch our heads. It terminated TLS, which sounds helpful until you realize we were already terminating TLS at our ingress. So we were decrypting traffic, then re-encrypting it, then decrypting it again. It was like putting on a coat to go outside, then taking it off and putting on another identical coat, then finally going outside. The security theater was strong with this one.

A trip to 1999

I should confess that when we started this project, I had no idea what IPVS even stood for. I had heard it mentioned in passing by a colleague who used to work at a large Chinese tech company, where apparently everyone uses it. He described it with the kind of reverence usually reserved for vintage wine or classic cars. “It just works,” he said, which in engineering terms is the highest possible praise.

IPVS, I learned, lives inside the Linux kernel itself. Not in a container, not in a microservice, not in some cloud-managed abstraction. In the actual kernel. This means when a packet arrives at your server, the kernel looks at it, consults its internal routing table, and forwards it directly. No context switches, no user-space handoffs, no “let me ask my manager” delays. Just pure, elegant packet forwarding.

The first time I saw it in action, I felt something I had not felt in years of cloud engineering. I felt wonder. Here was code written when Bill Clinton was president, when the iPod was still three years away, when people used modems to connect to the internet. And it was outperforming a service that AWS charges thousands of dollars for. It was like discovering that your grandfather’s pocket watch keeps better time than your smartwatch.

How the magic happens

Our setup is almost embarrassingly simple. We run a DaemonSet called ipvs-router on dedicated, tiny nodes in each Availability Zone. Each pod does four things, and it does them with the kind of efficiency that makes you question everything else in your stack.

First, it claims an Elastic IP using kube-vip, a CNCF project that lets Kubernetes pods take ownership of spare EIPs. No AWS load balancer required. The pod simply announces “this IP is mine now”, and the network obliges. It feels almost rude how straightforward it is.

Second, it programs IPVS in the kernel. IPVS builds an L4 load-balancing table that forwards packets at line rate. No proxies, no user-space hops. The kernel becomes your load balancer, which is a bit like discovering your car engine can also make excellent toast. Unexpected, but delightful.

Third, it syncs with Kubernetes endpoints. A lightweight controller watches for new pods, and when one appears, IPVS adds it to the rotation in less than a hundred milliseconds. Scaling feels instantaneous because, well, it basically is.

But the real trick is the fourth thing. We use something called Direct Server Return, or DSR. Here is how it works. When a request comes in, it travels from the client to IPVS to the pod. But the response goes directly from the pod back to the client, bypassing the load balancer entirely. The load balancer never sees response traffic. That is how we get ten times the throughput. It is like having a traffic cop who only directs cars into the city but does not care how they leave.

The code that makes it work

Here is what our DaemonSet looks like. I have simplified it slightly for readability, but this is essentially what runs in our production cluster:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ipvs-router
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: ipvs-router
  template:
    metadata:
      labels:
        app: ipvs-router
    spec:
      hostNetwork: true
      containers:
      - name: ipvs-router
        image: ghcr.io/kube-vip/kube-vip:v0.8.0
        args:
        - manager
        env:
        - name: vip_arp
          value: ""true""
        - name: port
          value: ""443""
        - name: vip_interface
          value: eth0
        - name: vip_cidr
          value: ""32""
        - name: cp_enable
          value: ""true""
        - name: cp_namespace
          value: kube-system
        - name: svc_enable
          value: ""true""
        - name: vip_leaderelection
          value: ""true""
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
            - NET_RAW

The key here is hostNetwork: true, which gives the pod direct access to the host’s network stack. Combined with the NET_ADMIN capability, this allows IPVS to manipulate the kernel’s routing tables directly. It requires a certain level of trust in your containers, but then again, so does running a load balancer in the first place.

We also use a custom controller to sync Kubernetes endpoints with IPVS. Here is the core logic:

# Simplified endpoint sync logic
def sync_endpoints(service_name, namespace):
    # Get current endpoints from Kubernetes
    endpoints = k8s_client.list_namespaced_endpoints(
        namespace=namespace,
        field_selector=f""metadata.name={service_name}""
    )
    
    # Extract pod IPs
    pod_ips = []
    for subset in endpoints.items[0].subsets:
        for address in subset.addresses:
            pod_ips.append(address.ip)
    
    # Build IPVS rules using ipvsadm
    for ip in pod_ips:
        subprocess.run([
            ""ipvsadm"", ""-a"", ""-t"", 
            f""{VIP}:443"", ""-r"", f""{ip}:443"", ""-g""
        ])
    
    # The -g flag enables Direct Server Return (DSR)
    return len(pod_ips)

The numbers that matter

Let me tell you about the math, because the math is almost embarrassing for AWS. Our old ALB took about five milliseconds to set up a new connection. IPVS takes less than half a millisecond. That is not an improvement. That is a different category of existence. It is the difference between walking to the shops and being teleported there.

While our ALB would start getting nervous around one hundred thousand concurrent connections, IPVS just does not. It could handle millions. The only limit is how much memory your kernel has, which in our case meant we could have hosted the entire internet circa 2003 without breaking a sweat.

In terms of throughput, our ALB topped out around 2.5 gigabits per second. IPVS saturates the 25-gigabit NIC on our c7g.medium instances. That is ten times the throughput, for those keeping score at home. The load balancer stopped being the bottleneck, which was refreshing because previously it had been like trying to fill a swimming pool through a drinking straw.

But the real kicker is the cost. Here is the breakdown. We run one c7g.medium spot instance per availability zone, three zones total. Each costs about $0.017 per hour. That is $0.051 per hour for compute. We also have three Elastic IPs at $0.005 per hour each, which is $0.015 per hour. With Direct Server Return, outbound transfer costs are effectively zero because responses bypass the load balancer entirely.

The total? A mere $0.066 per hour. Divide that among three availability zones, and you’re looking at roughly $0.009 per hour per zone. That’s nine-tenths of a cent per hour. Let’s not call it optimization, let’s call it a financial exorcism. We went from shelling out $3,900 a month to a modest $48. The savings alone could probably afford a very capable engineer’s caffeine habit.

But what about L7 routing

At this point, you might be raising a valid objection. IPVS is dumb L4. It does not inspect HTTP headers, it does not route based on gRPC metadata, and it does not care about your carefully crafted REST API conventions. It just forwards packets based on IP and port. It is the postal worker of the networking world. Reliable, fast, and utterly indifferent to what is in the envelope.

This is where we layer in Envoy, because intelligence should live where it makes sense. Here is how the request flow works. A client connects to one of our Elastic IPs. IPVS forwards that connection to a random healthy pod. Inside that pod, an Envoy sidecar inspects the HTTP/2 headers or gRPC metadata and routes to the correct internal service.

The result is L4 performance at the edge and L7 intelligence at the pod. We get the speed of kernel-level packet forwarding combined with the flexibility of modern service mesh routing. It is like having a Formula 1 engine in a car that also has comfortable seats and a good sound system. Best of both worlds. Our Envoy configuration looks something like this:

static_resources:
  listeners:
  - name: ingress_listener
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 443
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          ""@type"": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress
          route_config:
            name: local_route
            virtual_hosts:
            - name: api
              domains:
              - ""api.ourcompany.com""
              routes:
              - match:
                  prefix: ""/v1/users""
                route:
                  cluster: user_service
              - match:
                  prefix: ""/v1/orders""
                route:
                  cluster: order_service
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              ""@type"": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

The afternoon we broke everything

I should mention that our first attempt did not go smoothly. In fact, it went so poorly that we briefly considered pretending the whole thing had never happened and going back to our expensive ALBs.

The problem was DNS. We pointed our api.ourcompany.com domain at the new Elastic IPs, and then we waited. And waited. And nothing happened. Traffic was still going to the old ALB. It turned out that our DNS provider had a TTL of one hour, which meant that even after we updated the record, most clients were still using the old IP address for, well, an hour.

But that was not the real problem. The real problem was that we had forgotten to update our health checks. Our monitoring system was still pinging the old ALB’s health endpoint, which was now returning 404s because we had deleted the target group. So our alerts were going off, our pagers were buzzing, and our on-call engineer was having what I can only describe as a difficult afternoon.

We fixed it, of course. Updated the health checks, waited for DNS to propagate, and watched as traffic slowly shifted to the new setup. But for about thirty minutes, we were flying blind, which is not a feeling I recommend to anyone who values their peace of mind.

Deploying this yourself

If you are thinking about trying this yourself, the good news is that it is surprisingly straightforward. The bad news is that you will need to know your way around Kubernetes and be comfortable with the idea of pods manipulating kernel networking tables. If that sounds terrifying, perhaps stick with your ALB. It is expensive, but it is someone else’s problem.

Here is the deployment process in a nutshell. First, deploy the DaemonSet. Then allocate some spare Elastic IPs in your subnet. There is a particular quirk in AWS networking that can ruin your afternoon: the source/destination check. By default, EC2 instances are configured to reject traffic that does not match their assigned IP address. Since our setup explicitly relies on handling traffic for IP addresses that the instance does not technically ‘own’ (our Virtual IPs), AWS treats this as suspicious activity and drops the packets. You must disable the source/destination check on any instance running these router pods. It is a simple checkbox in the console, but forgetting it is the difference between a working load balancer and a black hole.
The pods will auto-claim them using kube-vip. Also, ensure your worker node IAM roles have permission to reassociate Elastic IPs, or your pods will shout into the void without anyone listening. Update your DNS to point at the new IPs, using latency-based routing if you want to be fancy. Then watch as your ALB target group drains, and delete the ALB next week after you are confident everything is working.

The whole setup takes about three hours the first time, and maybe thirty minutes if you do it again. Three hours of work for $46,000 per year in savings. That is $15,000 per hour, which is not a bad rate by anyone’s standards.

What we learned about Cloud computing

Three months after we made the switch, I found myself at an AWS conference, listening to a presentation about their newest managed load balancing service. It was impressive, all machine learning and auto-scaling and intelligent routing. It was also, I calculated quietly, about four hundred times more expensive than our little IPVS setup.

I did not say anything. Some lessons are better learned the hard way. And as I sat there, sipping my overpriced conference coffee, I could not help but smile.

AWS managed services are built for speed of adoption and lowest-common-denominator use cases. They are not built for peak efficiency, extreme performance, or cost discipline. For foundational infrastructure like load balancing, a little DIY unlocks exponential gains.

The embarrassing truth is that we should have done this years ago. We were so accustomed to reaching for managed services that we never stopped to ask whether we actually needed them. It took a 2 AM coffee-fueled bill review to make us question the assumptions we had been carrying around.

Sometimes the future of cloud computing looks a lot like 1999. And honestly, that is exactly what makes it beautiful. There is something deeply satisfying about discovering that the solution to your expensive modern problem was solved decades ago by someone working on a much simpler internet, with much simpler tools, and probably much more sleep.

Wensong Zhang, wherever you are, thank you. Your code from 1998 is still making engineers happy in 2026. That is not a bad legacy for any piece of software.

The author would like to thank his patient colleagues who did not complain (much) during the DNS propagation incident, and the kube-vip maintainers who answered his increasingly desperate questions on Slack.

AWS architecture choices I would not repeat

I was holding a lukewarm Americano in my left hand and a lukewarm sense of dread in my right when the Slack notifications started arriving. It was one of those golden hour afternoons where the light hits your monitor at exactly the wrong angle, turning your screen into a mirror that reflects your own panic back at you. CloudWatch was screaming. Not the dignified beep of a minor alert, but the full banshee wail of latency charts gone vertical.

My coffee had developed that particular skin on top that lukewarm coffee gets when you have forgotten it exists. I stared at the graph. Our system, which I had personally architected with the confidence of a man who had read half a documentation page, was melting in real time. The app was not even big. We had fewer concurrent users than a mid-sized bowling league, yet there we were. Throttling errors stacked up like dirty dishes in a shared apartment kitchen. Cold starts multiplied like rabbits on a vitamin regimen. Costs were rising faster than my blood pressure, which at that moment could have powered a small turbine.

That afternoon changed how I design systems. After four years of writing Python and just enough AWS experience to be dangerous, I learned the cardinal rule. Most architectures that look elegant at small scale are just disasters wearing tuxedos. Here is how I built a Rube Goldberg machine of regret, and how I eventually stopped lighting my own infrastructure on fire.

The Godzilla Lambda and the art of overeating

At first, it felt elegant. One Lambda function to handle everything. Image resizing, email sending, report generation, user authentication, and probably the kitchen sink if I had thought to attach plumbing. One deployment. One mental model. One massive mistake.

I called it my Swiss Army knife approach. Except this particular knife weighed eighty pounds and required three weeks’ notice to open. The function had more conditional branches than a family tree in a soap opera. If the event type was ‘resize_image’, it did one thing. If it was ‘send_email’, it did another. It was essentially a diner where the chef was also the waiter, the dishwasher, and the person who had to physically restrain customers who complained about the meatloaf.

The cold starts were spectacular. My function would wake up slower than a teenager on a Monday morning after an all-night gaming session. It dragged itself into consciousness, looked around, and slowly remembered it had responsibilities. Deployments became existential gambles. Change a comma in the email formatting logic, and you risk taking down the image processing pipeline that paying customers actually cared about. Logs turned into a crime scene where every suspect had the same fingerprint.

The automation scripts I had written to manage this beast were just duct tape on top of more duct tape. They had to account for the fact that the entry point was a fragile monolith masquerading as serverless elegance.

Now I build small, single-purpose functions. Each one does exactly one thing, like a very boring but highly reliable employee. My resize handler resizes. My email handler emails. They do not mingle. They do not gossip. They do not share IAM policies at the same coffee station.

Here is the only snippet of code you need to see today, mostly because it is so short it could fit in a tweet from someone with a short attention span.

def handler(event, context):
    return process_invoice(event.get("invoice_id"))

That is it. No if statements doing interpretive dance. No switch cases having an identity crisis. If a Lambda needs more than one IAM policy, it is already too big. It is like needing two different keys to open your refrigerator. If that is the case, you have designed a refrigerator incorrectly.

Using HTTP to check the mailbox

API Gateway is powerful. It is also expensive, verbose, and absolutely overkill for workflows where no human is holding a browser. I learned this the day I decided to route every single background job through API Gateway because I valued consistency over solvency. My AWS bill arrived looking like a phone number. A long one.

I was using HTTP requests for internal automation. Let that sink in. I was essentially hiring a limousine to drive across the street to check my mailbox. Every time a background job needed to trigger another background job, it went through API Gateway. That meant authentication layers, request validation, and pricing tiers designed for enterprise traffic handling, my little cron job that cleaned up temporary files.

Debugging was a nightmare wrapped in an OAuth flow. I spent three hours one Tuesday trying to figure out why an internal service could not authenticate, only to realize I had designed a system where my left hand needed to show my right hand three forms of government ID just to borrow a stapler.

The fix was to remember that computers can talk to each other without pretending to be web browsers. I switched to event-driven architecture using SNS and SQS. Now my producers throw messages into a queue like dropping letters into a mailbox, and they do not care who picks them up. The consumers grab what they need when they are ready.

sns_client = boto3.client("sns")
sns_client.publish(
    TopicArn=REPORT_GENERATION_TOPIC,
    Message=json.dumps({"customer_id": "CUST-8842", "report_type": "quarterly"})
)

The producers have no idea who consumes the message. They do not need to know. It is like leaving a note on the fridge instead of calling your roommate on their cell phone every time you need to tell them the milk is sour. If humans are not calling the endpoint, it probably should not be HTTP. Save your API Gateway budget for something that actually faces the internet, like that side project you will never finish.

The Server with amnesia

This one still stings. I used to run cron jobs on EC2 instances. Backups, cleanup scripts, data pipelines, all scheduled on a server that I treated like a reliable employee instead of the forgetful intern it actually was.

It worked perfectly until the instance restarted. Which instances do. They reboot for maintenance, for updates, for mysterious AWS reasons that arrive in emails written in that particular corporate tone that suggests everything is fine while your world burns. Every time the server came back up, it had the memory of a goldfish with a head injury. Scheduled jobs vanished into the ether. Backups did not happen. Cleanup scripts sat idle while storage costs climbed.

I spent three mornings a week SSHing into instances like a nervous parent checking if a sleeping teenager is still breathing. I would type crontab -l with the same trepidation one might use when opening a credit card statement after a vacation. Is everything there? Did it forget? Is the database backup running, or am I going to explain to the CEO why our disaster recovery plan is actually just a disaster?

If your automation depends on a server staying alive, it is not automation. It is hope dressed up in a shell script.

I replaced it with EventBridge and Lambda. EventBridge does not forget. EventBridge does not take vacations. EventBridge does not require you to log in at 3 AM in your pajamas to check if it is still breathing. It triggers the function, the function does the work, and if something breaks, it either retries or sends a message to a dead letter queue where you can ignore it at your leisure during business hours.

Trusting the Database to save itself

I trusted RDS autoscaling because the documentation made it sound intelligent. Like having a butler who watches your dinner party and quietly brings more chairs when guests arrive. The reality was more like having a butler who stands in the corner watching the house catch fire, then asks if you would like a chair.

The database would hit a traffic spike. Connections would pile up like shoppers at a Black Friday doorbuster sale. The application layer would be perfectly healthy, humming along, wondering why the database was on fire. By the time RDS autoscaling decided to add capacity, the damage was done. The connection pool had already exhausted itself. Automation scripts designed to recover the situation could not even connect to run their recovery logic. It was like calling the fire department only to find out they start driving when they smell smoke, not when the alarm rings.

Now I automate predictive scaling. It is not fancy. It is just intentional. I have scripts that check expected connection loads against current capacity. If we are going to hit five hundred connections, the script starts warming up a larger instance class before we need it. It is like preheating an oven instead of shoving a turkey into a cold metal box and hoping for the best.

AWS gives you primitives. Architecture is deciding when not to trust the defaults, because the defaults are designed to keep AWS running, not to keep you sane.

Reading tea leaves in a hurricane

I once thought centralized logging meant dumping everything into CloudWatch and calling it observability. This is the equivalent of shoveling all your mail into a closet and claiming you have a filing system. Technically true, practically useless.

My automation depended on parsing these logs. I wrote regex patterns that looked like ancient Sumerian curses. They would match error messages sometimes, ignore them other times, and occasionally trigger alerts on completely irrelevant noise because someone had logged the word error in a debugging statement about their lunch order.

During incidents, I would stare at these logs trying to find patterns. It was like trying to identify a specific scream in a horror movie marathon. Everything was urgent. Nothing was actionable. My scripts could not tell the difference between a critical database failure and a debug message about cache expiration. They were essentially reading entrails.

Structured logs saved my sanity. Now everything gets dumped as JSON with actual fields. Event types, durations, identifiers, all labeled and searchable. My automation can trigger follow-up jobs when specific events complete. It can detect anomalies by looking at actual numeric fields instead of trying to parse human-readable text like some kind of desperate fortune teller.

logger.info(
    "task_completed",
    extra={
        "job_type": "inventory_sync",
        "warehouse_id": "WH-15",
        "duration_ms": 1420,
        "items_processed": 847
    }
)

Logs are not for humans anymore. They are for systems. Humans should read dashboards. Systems should read logs. Confuse the two, and you end up with alerts that cry wolf at 3 AM because someone spelled success wrong.

The quiet killer wearing a price tag

This is the one that really hurts. Everything worked. Latency was acceptable. Automation was smooth. The system scaled. Then the bill arrived, and I nearly spilled my coffee onto the keyboard. If cost is not part of your architecture, scale will punish you like a gym teacher who has decided you need motivation.

I had built something that scaled technically but not financially. It was like designing an airplane that flies beautifully but requires fuel that costs more than the GDP of a small nation. Every request through API Gateway, every idle EC2 waiting for a cron job that might not come, every poorly optimized Lambda running for fifteen seconds because I had not bothered to trim the dependencies, it all added up.

Now I automate cost checks. Before expensive jobs run, they estimate their impact. If the daily budget threshold approaches, the system starts making choices. It defers non-critical tasks. It sends warnings. It acts like a responsible adult at a bar when the tab starts getting too high.

def should_process_batch(estimated_cost, daily_spend):
    remaining_budget = DAILY_LIMIT - daily_spend
    return estimated_cost < (remaining_budget * 0.8)

Simple guardrails save real money. There is a saying I keep taped to my monitor now. If it scales technically but not financially, it does not scale. It is just a very efficient way to go bankrupt.

The art of rehearsed failure

Every bad decision I made had the same DNA. I optimized for speed of development. I ignored the longevity of automation. I trusted defaults because reading the full documentation seemed like work for people who had more time than I did. I treated AWS like a magic wand instead of a very powerful, very expensive tool that requires respect.

Good architecture is not about services. It is about failure modes you have already rehearsed in your head. It is about assuming you will forget what you built in six months, because you will. It is about assuming growth will happen, failure will happen, and at some point, you will be trying to debug this thing while your phone buzzes with angry messages from people who just want the system to work.

Build like you are designing a kitchen for a very forgetful, very busy chef who might be slightly drunk. Label everything. Make the dangerous stuff hard to do by accident. Keep the receipts. And for the love of all that is holy, do not put cron jobs on EC2.

GCP services DevOps engineers rely on

I have spent the better part of three years wrestling with Google Cloud Platform, and I am still not entirely convinced it wasn’t designed by a group of very clever people who occasionally enjoy a quiet laugh at the rest of us. The thing about GCP, you see, is that it works beautifully right up until the moment it doesn’t. Then it fails with such spectacular and Byzantine complexity that you find yourself questioning not just your career choices but the fundamental nature of causality itself.

My first encounter with Cloud Build was typical of this experience. I had been tasked with setting up a CI/CD pipeline for a microservices architecture, which is the modern equivalent of being told to build a Swiss watch while someone steadily drops marbles on your head. Jenkins had been our previous solution, a venerable old thing that huffed and puffed like a steam locomotive and required more maintenance than a Victorian greenhouse. Cloud Build promised to handle everything serverlessly, which is a word that sounds like it ought to mean something, but in practice simply indicates you won’t know where your code is running and you certainly won’t be able to SSH into it when things go wrong.

The miracle, when it arrived, was decidedly understated. I pushed some poorly written Go code to a repository and watched as Cloud Build sprang into life like a sleeper agent receiving instructions. It ran my tests, built a container, scanned it for vulnerabilities, and pushed it to storage. The whole process took four minutes and cost less than a cup of tea. I sat there in my home office, the triumph slowly dawning, feeling rather like a man who has accidentally trained his cat to make coffee. I had done almost nothing, yet everything had happened. This is the essential GCP magic, and it is deeply unnerving.

The vulnerability scanner is particularly wonderful in that quietly horrifying way. It examines your containers and produces a list of everything that could possibly go wrong, like a pilot’s pre-flight checklist written by a paranoid witchfinder general. On one memorable occasion, it flagged a critical vulnerability in a library I wasn’t even aware we were using. It turned out to be nested seven dependencies deep, like a Russian doll of potential misery. Fixing it required updating something else, which broke something else, which eventually led me to discover that our entire authentication layer was held together by a library last maintained in 2018 by someone who had subsequently moved to a commune in Oregon. The scanner was right, of course. It always is. It is the most anxious and accurate employee you will ever meet.

Google Kubernetes Engine or how I learned to stop worrying and love the cluster

If Cloud Build is the efficient butler, GKE is the robot overlord you find yourself oddly grateful to. My initial experience with Kubernetes was self-managed, which taught me many things, primarily that I do not have the temperament to manage Kubernetes. I spent weeks tuning etcd, debugging network overlays, and developing what I can only describe as a personal relationship with a persistent volume that refused to mount. It was less a technical exercise and more a form of digitally enhanced psychotherapy.

GKE’s Autopilot mode sidesteps all this by simply making the nodes disappear. You do not manage nodes. You do not upgrade nodes. You do not even, strictly speaking, know where the nodes are. They exist in the same conceptual space as socks that vanish from laundry cycles. You request resources, and they materialise, like summoning a very specific and obliging genie. The first time I enabled Autopilot, I felt I was cheating somehow, as if I had been given the answers to an exam I had not revised for.

The real genius is Workload Identity, a feature that allows pods to access Google services without storing secrets. Before this, secret management was a dark art involving base64 encoding and whispered incantations. We kept our API keys in Kubernetes secrets, which is rather like keeping your house keys under the doormat and hoping burglars are too polite to look there. Workload Identity removes all this by using magic, or possibly certificates, which are essentially the same thing in cloud computing. I demonstrated it to our security team, and their reaction was instructive. They smiled, which security people never do, and then they asked me to prove it was actually secure, which took another three days and several diagrams involving stick figures.

Istio integration completes the picture, though calling it integration suggests a gentle handshake when it is more like being embraced by a very enthusiastic octopus. It gives you observability, security, and traffic management at the cost of considerable complexity and a mild feeling that you have lost control of your own architecture. Our first Istio deployment doubled our pod count and introduced latency that made our application feel like it was wading through treacle. Tuning it took weeks and required someone with a master’s degree in distributed systems and the patience of a saint. When it finally worked, it was magnificent. Requests flowed like water, security policies enforced themselves with silent efficiency, and I felt like a man who had tamed a tiger through sheer persistence and a lot of treats.

Cloud Deploy and the gentle art of not breaking everything

Progressive delivery sounds like something a management consultant would propose during a particularly expensive lunch, but Cloud Deploy makes it almost sensible. The service orchestrates rollouts across environments with strategies like canary and blue-green, which are named after birds and colours because naming things is hard, and DevOps engineers have a certain whimsical desperation about them.

My first successful canary deployment felt like performing surgery on a patient who was also the anaesthetist. We routed 5 percent of traffic to the new version and watched our metrics like nervous parents at a school play. When errors spiked, I expected a frantic rollback procedure involving SSH and tarballs. Instead, I clicked a button, and everything reverted in thirty seconds. The old version simply reappeared, fully formed, like a magic trick performed by someone who actually understands magic. I walked around the office for the rest of the day with what my colleagues described as a smug grin, though I prefer to think of it as the justified expression of someone who has witnessed a minor miracle.

The integration with Cloud Build creates a pipeline so smooth it is almost suspicious. Code commits trigger builds, builds trigger deployments, deployments trigger monitoring alerts, and alerts trigger automated rollbacks. It is a closed loop, a perpetual motion machine of software delivery. I once watched this entire chain execute while I was making a sandwich. By the time I had finished my ham and pickle on rye, a critical bug had been introduced, detected, and removed from production without any human intervention. I was simultaneously impressed and vaguely concerned about my own obsolescence.

Artifact Registry where containers go to mature

Storing artifacts used to involve a self-hosted Nexus repository that required weekly sacrifices of disk space and RAM. Artifact Registry is Google’s answer to this, a fully managed service that stores Docker images, Helm charts, and language packages with the solemnity of a wine cellar for code.

The vulnerability scanning here is particularly thorough, examining every layer of your container with the obsessive attention of someone who alphabetises their spice rack. It once flagged a high-severity issue in a base image we had been using for six months. The vulnerability allowed arbitrary code execution, which is the digital equivalent of leaving your front door open with a sign saying “Free laptops inside.” We had to rebuild and redeploy forty services in two days. The scanner, naturally, had known about this all along but had been politely waiting for us to notice.

Geo-replication is another feature that seems obvious until you need it. Our New Zealand team was pulling images from a European registry, which meant every deployment involved sending gigabytes of data halfway around the world. This worked about as well as shouting instructions across a rugby field during a storm. Moving to a regional registry in New Zealand cut our deployment times by half and our egress fees by a third. It also taught me that cloud networking operates on principles that are part physics, part economics, and part black magic.

Cloud Operations Suite or how I learned to love the machine that watches me

Observability in GCP is orchestrated by the Cloud Operations Suite, formerly known as Stackdriver. The rebranding was presumably because Stackdriver sounded too much like a dating app for developers, which is a missed opportunity if you ask me.

The suite unifies logs, metrics, traces, and dashboards into a single interface that is both comprehensive and bewildering. The first time I opened Cloud Monitoring, I was presented with more graphs than a hedge fund’s annual report. CPU, memory, network throughput, disk IOPS, custom metrics, uptime checks, and SLO burn rates. It was beautiful and terrifying, like watching the inner workings of a living organism that you have created but do not fully understand.

Setting up SLOs felt like writing a promise to my future self. “I, a DevOps engineer of sound mind, do hereby commit to maintaining 99.9 percent availability.” The system then watches your service like a particularly judgmental deity and alerts you the moment you transgress. I once received a burn rate alert at 2 AM because a pod had been slightly slow for ten minutes. I lay in bed, staring at my phone, wondering whether to fix it or simply accept that perfection was unattainable and go back to sleep. I fixed it, of course. We always do.

The integration with BigQuery for long-term analysis is where things get properly clever. We export all our logs and run SQL queries to find patterns. This is essentially data archaeology, sifting through digital sediment to understand why something broke three weeks ago. I discovered that our highest error rates always occurred on Tuesdays between 2 and 3 PM. Why? A scheduled job that had been deprecated but never removed, running on a server everyone had forgotten about. Finding it felt like discovering a Roman coin in your garden, exciting but also slightly embarrassing that you hadn’t noticed it before.

Cloud Monitoring and Logging the digital equivalent of a nervous system

Cloud Logging centralises petabytes of data from services that generate logs with the enthusiasm of a teenager documenting their lunch. Querying this data feels like using a search engine that actually works, which is disconcerting when you’re used to grep and prayer.

I once spent an afternoon tracking down a memory leak using Cloud Profiler, a service that shows you exactly where your code is being wasteful with RAM. It highlighted a function that was allocating memory like a government department allocates paper clips, with cheerful abandon and no regard for consequences. The function turned out to be logging entire database responses for debugging purposes, in production, for six months. We had archived more debug data than actual business data. The developer responsible, when confronted, simply shrugged and said it had seemed like a good idea at the time. This is the eternal DevOps tragedy. Everything seems like a good idea at the time.

Uptime checks are another small miracle. We have probes hitting our endpoints from locations around the world, like a global network of extremely polite bouncers constantly asking, “Are you open?” When Mumbai couldn’t reach our service but London could, it led us to discover a regional DNS issue that would have taken days to diagnose otherwise. The probes had saved us, and they had done so without complaining once, which is more than can be said for the on-call engineer who had to explain it to management at 6 AM.

Cloud Functions and Cloud Run, where code goes to hide

Serverless computing in GCP comes in two flavours. Cloud Functions are for small, event-driven scripts, like having a very eager intern who only works when you clap. Cloud Run is for containerised applications that scale to zero, which is an economical way of saying they disappear when nobody needs them and materialise when they do, like an introverted ghost.

I use Cloud Functions for automation tasks that would otherwise require cron jobs on a VM that someone has to maintain. One function resizes GKE clusters based on Cloud Monitoring alerts. When CPU utilisation exceeds 80 percent for five minutes, the function spins up additional nodes. When it drops below 20 percent, it scales down. This is brilliant until you realise you’ve created a feedback loop and the cluster is now oscillating between one node and one hundred nodes every ten minutes. Tuning the thresholds took longer than writing the function, which is the serverless way.

Cloud Run hosts our internal tools, the dashboards, and debug interfaces that developers need but nobody wants to provision infrastructure for. Deploying is gloriously simple. You push a container, it runs. The cold start time is sub-second, which means Google has solved a problem that Lambda users have been complaining about for years, presumably by bargaining with physics itself. I once deployed a debugging tool during an incident response. It was live before the engineer who requested it had finished describing what they needed. Their expression was that of someone who had asked for a coffee and been given a flying saucer.

Terraform and Cloud Deployment Manager arguing with machines about infrastructure

Infrastructure as Code is the principle that you should be able to rebuild your entire environment from a text file, which is lovely in theory and slightly terrifying in practice. Terraform, using the GCP provider, is the de facto standard. It is also a source of endless frustration and occasional joy.

The state file is the heart of the problem. It is a JSON representation of your infrastructure that Terraform keeps in Cloud Storage, and it is the single source of truth until someone deletes it by accident, at which point the truth becomes rather more philosophical. We lock the state during applies, which prevents conflicts but also means that if an apply hangs, everyone is blocked. I have spent afternoons staring at a terminal, watching Terraform ponder the nature of a load balancer, like a stoned philosophy student contemplating a spoon.

Deployment Manager is Google’s native IaC tool, which uses YAML and is therefore slightly less powerful but considerably easier to read. I use it for simple projects where Terraform would be like using a sledgehammer to crack a nut, if the sledgehammer required you to understand graph theory. The two tools coexist uneasily, like cats who tolerate each other for the sake of the humans.

Drift detection is where things get properly philosophical. Terraform tells you when reality has diverged from your code, which happens more often than you’d think. Someone clicks something in the console, a service account is modified, a firewall rule is added for “just a quick test.” The plan output shows these changes like a disappointed teacher marking homework in red pen. You can either apply the correction or accept that your infrastructure has developed a life of its own and is now making decisions independently. Sometimes I let the drift stand, just to see what happens. This is how accidents become features.

IAM and Cloud Asset Inventory, the endless game of who can do what

Identity and Access Management in GCP is both comprehensive and maddening. Every API call is authenticated and authorised, which is excellent for security but means you spend half your life granting permissions to service accounts. A service account, for the uninitiated, is a machine pretending to be a person so it can ask Google for things. They are like employees who never take a holiday but also never buy you a birthday card.

Workload Identity Federation allows these synthetic employees to impersonate each other across clouds, which is identity management crossed with method acting. We use it to let our AWS workloads access GCP resources, a process that feels rather like introducing two friends who are suspicious of each other and speak different languages. When it works, it is seamless. When it fails, the error messages are so cryptic they may as well be in Linear B.

Cloud Asset Inventory catalogs every resource in your organisation, which is invaluable for audits and deeply unsettling when you realise just how many things you’ve created and forgotten about. I once ran a report and discovered seventeen unused load balancers, three buckets full of logs from a project that ended in 2023, and a Cloud SQL instance that had been running for six months with no connections. The bill was modest, but the sense of waste was profound. I felt like a hoarder being confronted with their own clutter.

For European enterprises, the GDPR compliance features are critical. We export audit logs to BigQuery and run queries to prove data residency. The auditors, when they arrived, were suspicious of everything, which is their job. They asked for proof that data never left the europe-west3 region. I showed them VPC Service Controls, which are like digital border guards that shoot packets trying to cross geographical boundaries. They seemed satisfied, though one of them asked me to explain Kubernetes, and I saw his eyes glaze over in the first thirty seconds. Some concepts are simply too abstract for mortal minds.

Eventarc and Cloud Scheduler the nervous system of the cloud

Eventarc routes events from over 100 sources to your serverless functions, creating event-driven architectures that are both elegant and impossible to debug. An event is a notification that something happened, somewhere, and now something else should happen somewhere else. It is causality at a distance, action at a remove.

I have an Eventarc trigger that fires when a vulnerability is found, sending a message to Pub/Sub, which fans out to multiple subscribers. One subscriber posts to Slack, another creates a ticket, and a third quarantines the image. It is a beautiful, asynchronous ballet that I cannot fully trace. When it fails, it fails silently, like a mime having a heart attack. The dead-letter queue catches the casualties, which I check weekly like a coroner reviewing unexplained deaths.

Cloud Scheduler handles cron jobs, which are the digital equivalent of remembering to take the bins out. We have schedules that scale down non-production environments at night, saving money and carbon. I once set the timezone incorrectly and scaled down the production cluster at midday. The outage lasted three minutes, but the shame lasted considerably longer. The team now calls me “the scheduler whisperer,” which is not the compliment it sounds like.

The real power comes from chaining these services. A Monitoring alert triggers Eventarc, which invokes a Cloud Function, which checks something via Scheduler, which then triggers another function to remediate. It is a Rube Goldberg machine built of code, more complex than it needs to be, but weirdly satisfying when it works. I have built systems that heal themselves, which is either the pinnacle of DevOps achievement or the first step towards Skynet. I prefer to think it is the former.

The map we all pretend to understand

Every DevOps journey, no matter how anecdotal, eventually requires what consultants call a “high-level architecture overview” and what I call a desperate attempt to comprehend the incomprehensible. During my second year on GCP, I created exactly such a diagram to explain to our CFO why we were spending $47,000 a month on something called “Cross-Regional Egress.” The CFO remained unmoved, but the diagram became my Rosetta Stone for navigating the platform’s ten core services.

I’ve reproduced it here partly because I spent three entire afternoons aligning boxes in Lucidchart, and partly because even the most narrative-driven among us occasionally needs to see the forest’s edge while wandering through the trees. Consider it the technical appendix you can safely ignore, unless you’re the poor soul actually implementing any of this.

There it is, in all its tabular glory. Five rows that represent roughly fifteen thousand hours of human effort, and at least three separate incidents involving accidentally deleted production namespaces. The arrows are neat and tidy, which is more than can be said for any actual implementation.

I keep a laminated copy taped to my monitor, not because I consult it; I have the contents memorised, along with the scars that accompany each service, but because it serves as a reminder that even the most chaotic systems can be reduced to something that looks orderly on PowerPoint. The real magic lives in the gaps between those tidy boxes, where service accounts mysteriously expire, where network policies behave like quantum particles, and where the monthly bill arrives with numbers that seem generated by a random number generator with a grudge.

A modest proposal for surviving GCP

That table represents the map. What follows is the territory, with all its muddy bootprints and unexpected cliffs.

After three years, I have learned that the best DevOps engineers are not the ones with the most certificates. They are the ones who have learned to read the runes, who know which logs matter and which can be ignored, who have developed an intuitive sense for when a deployment is about to fail and can smell a misconfigured IAM binding at fifty paces. They are part sysadmin, part detective, part wizard.

The platform makes many things possible, but it does not make them easy. It is infrastructure for grown-ups, which is to say it trusts you to make expensive mistakes and learn from them. My advice is to start small, automate everything, and keep a sense of humour. You will need it the first time you accidentally delete a production bucket and discover that the undo button is marked “open a support ticket and wait.”

Store your manifests in Git and let Cloud Deploy handle the applying. Define SLOs and let the machines judge you. Tag resources for cost allocation and prepare to be horrified by the results. Replicate artifacts across regions because the internet is not as reliable as we pretend. And above all, remember that the cloud is not magic. It is simply other people’s computers running other people’s code, orchestrated by APIs that are occasionally documented and frequently misunderstood.

We build on these foundations because they let us move faster, scale further, and sleep slightly better at night, knowing that somewhere in a data centre in Belgium, a robot is watching our servers and will wake us only if things get truly interesting.

That is the theory, anyway. In practice, I still keep my phone on loud, just in case.