SRE

Your Terraform S3 backend is confused not broken

You’ve done everything right. You wrote your Terraform config with the care of someone assembling IKEA furniture while mildly sleep-deprived. You double-checked your indentation (because yes, it matters). You even remembered to enable encryption, something your future self will thank you for while sipping margaritas on a beach far from production outages.

And then, just as you run terraform init, Terraform stares back at you like a cat that’s just been asked to fetch the newspaper.

Error: Failed to load state: NoSuchBucket: The specified bucket does not exist

But… you know the bucket exists. You saw it in the AWS console five minutes ago. You named it something sensible like company-terraform-states-prod. Or maybe you didn’t. Maybe you named it tf-bucket-please-dont-delete in a moment of vulnerability. Either way, it’s there.

So why is Terraform acting like you asked it to store your state in Narnia?

The truth is, Terraform’s S3 backend isn’t broken. It’s just spectacularly bad at telling you what’s wrong. It doesn’t throw tantrums, it just fails silently, or with error messages so vague they could double as fortune cookie advice.

Let’s decode its passive-aggressive signals together.

The backend block that pretends to listen

At the heart of remote state management lies the backend “s3” block. It looks innocent enough:

terraform {
  backend "s3" {
    bucket         = "my-team-terraform-state"
    key            = "networking/main.tfstate"
    region         = "us-west-2"
    dynamodb_table = "tf-lock-table"
    encrypt        = true
  }
}

Simple, right? But this block is like a toddler with a walkie-talkie: it only hears what it wants to hear. If one tiny detail is off, region, permissions, bucket name, it won’t say “Hey, your bucket is in Ohio but you told me it’s in Oregon.” It’ll just shrug and fail.

And because Terraform backends are loaded before variable interpolation, you can’t use variables inside this block. Yes, really. You’re stuck with hardcoded strings. It’s like being forced to write your grocery list in permanent marker.

The four ways Terraform quietly sabotages you

Over the years, I’ve learned that S3 backend errors almost always fall into one of four buckets (pun very much intended).

1. The credentials that vanished into thin air

Terraform needs AWS credentials. Not “kind of.” Not “maybe.” It needs them like a coffee machine needs beans. But it won’t tell you they’re missing, it’ll just say the bucket doesn’t exist, even if you’re looking at it in the console.

Why? Because without valid credentials, AWS returns a 403 Forbidden, and Terraform interprets that as “bucket not found” to avoid leaking information. Helpful for security. Infuriating for debugging.

Fix it: Make sure your credentials are loaded via environment variables, AWS CLI profile, or IAM roles if you’re on an EC2 instance. And no, copying your colleague’s .aws/credentials file while they’re on vacation doesn’t count as “secure.”

2. The region that lied to everyone

You created your bucket in eu-central-1. Your backend says us-east-1. Terraform tries to talk to the bucket in Virginia. The bucket, being in Frankfurt, doesn’t answer.

Result? Another “bucket not found” error. Because of course.

S3 buckets are region-locked, but the error message won’t mention regions. It assumes you already know. (Spoiler: you don’t.)

Fix it: Run this to check your bucket’s real region:

aws s3api get-bucket-location --bucket my-team-terraform-state

Then update your backend block accordingly. And maybe add a sticky note to your monitor: “Regions matter. Always.”

3. The lock table that forgot to show up

State locking with DynamoDB is one of Terraform’s best features; it stops two engineers from simultaneously destroying the same VPC like overeager toddlers with a piñata.

But if you declare a dynamodb_table in your backend and that table doesn’t exist? Terraform won’t create it for you. It’ll just fail with a cryptic message about “unable to acquire state lock.”

Fix it: Create the table manually (or with separate Terraform code). It only needs one attribute: LockID (string). And make sure your IAM user has dynamodb:GetItem, PutItem, and DeleteItem permissions on it.

Think of DynamoDB as the bouncer at a club: if it’s not there, anyone can stumble in and start redecorating.

4. The missing safety nets

Versioning and encryption aren’t strictly required, but skipping them is like driving without seatbelts because “nothing bad has happened yet.”

Without versioning, a bad terraform apply can overwrite your state forever. No undo. No recovery. Just you, your terminal, and the slow realization that you’ve deleted production.

Enable versioning:

aws s3api put-bucket-versioning \
  --bucket my-team-terraform-state \
  --versioning-configuration Status=Enabled

And always set encrypt = true. Your state file contains secrets, IDs, and the blueprint of your infrastructure. Treat it like your diary, not your shopping list.

Debugging without losing your mind

When things go sideways, don’t guess. Ask Terraform nicely for more details:

TF_LOG=DEBUG terraform init

Yes, it spits out a firehose of logs. But buried in there is the actual AWS API call, and the real error code. Look for lines containing AWS request or ErrorResponse. That’s where the truth hides.

Also, never run terraform init once and assume it’s locked in. If you change your backend config, you must run:

terraform init -reconfigure

Otherwise, Terraform will keep using the old settings cached in .terraform/. It’s stubborn like that.

A few quiet rules for peaceful coexistence

After enough late-night debugging sessions, I’ve adopted a few personal commandments:

  • One project, one bucket. Don’t mix dev and prod states in the same bucket. It’s like keeping your tax documents and grocery receipts in the same shoebox, technically possible, spiritually exhausting.
  • Name your state files clearly. Use paths like prod/web.tfstate instead of final-final-v3.tfstate.
  • Never commit backend configs with real bucket names to public repos. (Yes, people still do this. No, it’s not cute.)
  • Test your backend setup in a sandbox first. A $0.02 bucket and a tiny DynamoDB table can save you a $10,000 mistake.

It’s not you, it’s the docs

Terraform’s S3 backend works beautifully, once everything aligns. The problem isn’t the tool. It’s that the error messages assume you’re psychic, and the documentation reads like it was written by someone who’s never made a mistake in their life.

But now you know its tells. The fake “bucket not found.” The silent region betrayal. The locking table that ghosts you.

Next time it acts up, don’t panic. Pour a coffee, check your region, verify your credentials, and whisper gently: “I know you’re trying your best.”

Because honestly? It is.

Playing detective with dead Kubernetes nodes

It arrives without warning, a digital tap on the shoulder that quickly turns into a full-blown alarm. Maybe you’re mid-sentence in a meeting, or maybe you’re just enjoying a rare moment of quiet. Suddenly, a shriek from your phone cuts through everything. It’s the on-call alert, flashing a single, dreaded message: NodeNotReady.

Your beautifully orchestrated city of containers, a masterpiece of modern engineering, now has a major power outage in one of its districts. One of your worker nodes, a once-diligent and productive member of the cluster, has gone completely silent. It’s not responding to calls, it’s not picking up new work, and its existing jobs are in limbo. In the world of Kubernetes, this isn’t just a technical issue; it’s a ghosting of the highest order.

Before you start questioning your life choices or sacrificing a rubber chicken to the networking gods, take a deep breath. Put on your detective’s trench coat. We have a case to solve.

First on the scene, the initial triage

Every good investigation starts by surveying the crime scene and asking the most basic question: What the heck happened here? In our world, this means a quick and clean interrogation of the Kubernetes API server. It’s time for a roll call.

kubectl get nodes -o wide

This little command is your first clue. It lines up all your nodes and points a big, accusatory finger at the one in the Not Ready state.

NAME                    STATUS     ROLES    AGE   VERSION   INTERNAL-IP      EXTERNAL-IP     OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
k8s-master-1            Ready      master   90d   v1.28.2   10.128.0.2       34.67.123.1     Ubuntu 22.04.1 LTS   5.15.0-78-generic   containerd://1.6.9
k8s-worker-node-7b5d    NotReady   <none>   45d   v1.28.2   10.128.0.5       35.190.45.6     Ubuntu 22.04.1 LTS   5.15.0-78-generic   containerd://1.6.9
k8s-worker-node-fg9h    Ready      <none>   45d   v1.28.2   10.128.0.4       35.190.78.9     Ubuntu 22.04.1 LTS   5.15.0-78-generic   containerd://1.6.9

There’s our problem child: k8s-worker-node-7b5d. Now that we’ve identified our silent suspect, it’s time to pull it into the interrogation room for a more personal chat.

kubectl describe node k8s-worker-node-7b5d

The output of describe is where the juicy gossip lives. You’re not just looking at specs; you’re looking for a story. Scroll down to the Conditions and, most importantly, the Events section at the bottom. This is where the node often leaves a trail of breadcrumbs explaining exactly why it decided to take an unscheduled vacation.

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Mon, 13 Oct 2025 09:55:12 +0200   Mon, 13 Oct 2025 09:45:30 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 13 Oct 2025 09:55:12 +0200   Mon, 13 Oct 2025 09:45:30 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 13 Oct 2025 09:55:12 +0200   Mon, 13 Oct 2025 09:45:30 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Mon, 13 Oct 2025 09:55:12 +0200   Mon, 13 Oct 2025 09:50:05 +0200   KubeletNotReady              container runtime network not ready: CNI plugin reporting error: rpc error: code = Unavailable desc = connection error

Events:
  Type     Reason                   Age                  From                       Message
  ----     ------                   ----                 ----                       -------
  Normal   Starting                 25m                  kubelet                    Starting kubelet.
  Warning  ContainerRuntimeNotReady 5m12s (x120 over 25m) kubelet                    container runtime network not ready: CNI plugin reporting error: rpc error: code = Unavailable desc = connection error

Aha! Look at that. The Events log is screaming for help. A repeating warning, ContainerRuntimeNotReady, points to a CNI (Container Network Interface) plugin having a full-blown tantrum. We’ve moved from a mystery to a specific lead.

The usual suspects, a rogues’ gallery

When a node goes quiet, the culprit is usually one of a few repeat offenders. Let’s line them up.

1. The silent saboteur network issues

This is the most common villain. Your node might be perfectly healthy, but if it can’t talk to the control plane, it might as well be on a deserted island. Think of the control plane as the central office trying to call its remote employee (the node). If the phone line is cut, the office assumes the employee is gone. This can be caused by firewall rules blocking ports, misconfigured VPC routes, or a DNS server that’s decided to take the day off.

2. The overworked informant, the kubelet

The kubelet is the control plane’s informant on every node. It’s a tireless little agent that reports on the node’s health and carries out orders. But sometimes, this agent gets sick. It might have crashed, stalled, or is struggling with misconfigured credentials (like expired TLS certificates) and can’t authenticate with the mothership. If the informant goes silent, the node is immediately marked as a person of interest.

You can check on its health directly on the node:

# SSH into the problematic node
ssh user@<node-ip>

# Check the kubelet's vital signs
systemctl status kubelet

A healthy output should say active (running). Anything else, and you’ve found a key piece of evidence.

3. The glutton resource exhaustion

Your node has a finite amount of CPU, memory, and disk space. If a greedy application (or a swarm of them) consumes everything, the node itself can become starved. The kubelet and other critical system daemons need resources to breathe. Without them, they suffocate and stop reporting in. It’s like one person eating the entire buffet, leaving nothing for the hosts of the party.

A quick way to check for gluttons is with:

kubectl top node <your-problem-child-node-name>

If you see CPU or memory usage kissing 100%, you’ve likely found your culprit.

The forensic toolkit: digging deeper

If the initial triage and lineup didn’t reveal the killer, it’s time to break out the forensic tools and get our hands dirty.

Sifting Through the Diary with journalctl

The journalctl command is your window into the kubelet’s soul (or, more accurately, its log files). This is where it writes down its every thought, fear, and error.

# On the node, tail the kubelet's logs for clues
journalctl -u kubelet -f --since "10 minutes ago"

Look for recurring error messages, failed connection attempts, or anything that looks suspiciously out of place.

Quarantining the patient with drain

Before you start performing open-heart surgery on the node, it’s wise to evacuate the civilians. The kubectl drain command gracefully evicts all the pods from the node, allowing them to be rescheduled elsewhere.

kubectl drain k8s-worker-node-7b5d --ignore-daemonsets --delete-local-data

This isolates the patient, letting you work without causing a city-wide service outage.

Confirming the phone lines with curl

Don’t just trust the error messages. Verify them. From the problematic node, try to contact the API server directly. This tells you if the fundamental network path is even open.

# From the problem node, try to reach the API server endpoint
curl -k https://<api-server-ip>:<port>/healthz

If you get ok, the basic connection is fine. If it times out or gets rejected, you’ve confirmed a networking black hole.

Crime prevention: keeping your nodes out of trouble

Solving the case is satisfying, but a true detective also works to prevent future crimes.

  • Set up a neighborhood watch: Implement robust monitoring with tools like Prometheus and Grafana. Set up alerts for high resource usage, disk pressure, and node status changes. It’s better to spot a prowler before they break in.
  • Install self-healing robots: Most cloud providers (GKE, EKS, AKS) offer node auto-repair features. If a node fails its health checks, the platform will automatically attempt to repair it or replace it. Turn this on. It’s your 24/7 robotic police force.
  • Enforce city zoning laws: Use resource requests and limits on your deployments. This prevents any single application from building a resource-hogging skyscraper that blocks the sun for everyone else.
  • Schedule regular health checkups: Keep your cluster components, operating systems, and container runtimes updated. Many Not Ready mysteries are caused by long-solved bugs that you could have avoided with a simple patch.

The case is closed for now

So there you have it. The rogue node is back in line, the pods are humming along, and the city of containers is once again at peace. You can hang up your trench coat, put your feet up, and enjoy that lukewarm coffee you made three hours ago. The mystery is solved.

But let’s be honest. Debugging a Not Ready node is less like a thrilling Sherlock Holmes novel and more like trying to figure out why your toaster only toasts one side of the bread. It’s a methodical, often maddening, process of elimination. You start with grand theories of network conspiracies and end up discovering the culprit was a single, misplaced comma in a YAML file, the digital equivalent of the butler tripping over the rug.

So the next time an alert yanks you from your peaceful existence, don’t panic. Remember that you are a digital detective, a whisperer of broken machines. Your job is to patiently ask the right questions until the silent, uncooperative suspect finally confesses. After all, in the world of Kubernetes, a node is never truly dead. It’s just being dramatic and waiting for a good detective to find the clues, and maybe, just maybe, restart its kubelet. The city is safe… until the next time. And there is always a next time.

When invisible limits beat warm Lambdas

My team had a problem. Or rather, we had a cause. A noble crusade that consumed our sprints, dominated our Slack channels, and haunted our architectural diagrams. We were on a relentless witch hunt for the dreaded Lambda cold start.

We treated those extra milliseconds of spin-up time like a personal insult from Jeff Bezos himself. We became amateur meteorologists, tracking “cold start storms” across regions. We had dashboards so finely tuned they could detect the faint, quantum flutter of an EC2 instance thinking about starting up. We proudly spent over $3,000 a month on provisioned concurrency¹, a financial sacrifice to the gods of AWS to keep our functions perpetually toasty.

We had done it. Cold starts were a solved problem. We celebrated with pizza and self-congratulatory Slack messages. The system was invincible.

Or so we thought.

The 2:37 am wake-up call

It was a Tuesday, of course. The kind of quiet, unassuming Tuesday that precedes all major IT disasters. At 2:37 AM, my phone began its unholy PagerDuty screech. The alert was as simple as it was terrifying: “API timeouts.”

I stumbled to my laptop, heart pounding, expecting to see a battlefield. Instead, I saw a paradox.

The dashboards were an ocean of serene green.

  • Cold starts? 0%. Our $3,000 was working perfectly. Our Lambdas were warm, cozy, and ready for action.
  • Lambda health? 100%. Every function was executing flawlessly, not an error in sight.
  • Database queries? 100% failure rate.

It was like arriving at a restaurant to find the chefs in the kitchen, knives sharpened and stoves hot, but not a single plate of food making it to the dining room. Our Lambdas were warm, our dashboards were green, and our system was dying. It turns out that for $3,000 a month, you can keep your functions perfectly warm while they helplessly watch your database burn to the ground.

We had been playing Jenga with AWS’s invisible limits, and someone had just pulled the wrong block.

Villain one, The great network card famine

Every Lambda function that needs to talk to services within your VPC, like a database, requires a virtual network card, an Elastic Network Interface (ENI). It’s the function’s physical connection to the world. And here’s the fun part that AWS tucks away in its documentation: your account has a default, region-wide limit on these. Usually around 250.

We discovered this footnote from 2018 when the Marketing team, in a brilliant feat of uncoordinated enthusiasm, launched a flash promo.

Our traffic surged. Lambda, doing its job beautifully, began to scale. 100 concurrent executions. 200. Then 300.

The 251st request didn’t fail. Oh no, that would have been too easy. Instead, it just… waited. For fourteen seconds. It was waiting in a silent, invisible line for AWS to slowly hand-carve a new network card from the finest, artisanal silicon.

Our “optimized” system had become a lottery.

  • The winners: Got an existing ENI and a zippy 200ms response.
  • The losers: Waited 14,000ms for a network card to materialize out of thin air, causing their request to time out.

The worst part? This doesn’t show up as a Lambda error. It just looks like your code is suddenly, inexplicably slow. We were hunting for a bug in our application, but the culprit was a bureaucrat in the AWS networking department.

Do this right now. Seriously. Open a terminal and check your limit. Don’t worry, we’ll wait.

# This command reveals the 'Maximum network interfaces per Region' quota.
# You might be surprised at what you find.
aws service-quotas get-service-quota \
  --service-code vpc \
  --quota-code L-F678F1CE

Villain two, The RDS proxy’s velvet rope policy

Having identified the ENI famine, we thought we were geniuses. But fixing that only revealed the next layer of our self-inflicted disaster. Our Lambdas could now get network cards, but they were all arriving at the database party at once, only to be stopped at the door.

We were using RDS Proxy, the service AWS sells as the bouncer for your database, managing connections so your Aurora instance doesn’t get overwhelmed. What we failed to appreciate is that this bouncer has its own… peculiar rules. The proxy itself has CPU limits. When hundreds of Lambdas tried to get a connection simultaneously, the proxy’s CPU spiked to 100%.

It didn’t crash. It just became incredibly, maddeningly slow. It was like a nightclub bouncer enforcing a strict one-in, one-out policy, not because the club was full, but because he could only move his arms so fast. The queue of connections grew longer and longer, each one timing out, while the database inside sat mostly idle, wondering where everybody went.

The humbling road to recovery

The fixes weren’t complex, but they were humbling. They forced us to admit that our beautiful, perfectly-tuned relational database architecture was, for some tasks, the wrong tool for the job.

  1. The great VPC escape
    For any Lambda that only needed to talk to public AWS services like S3 or SQS, we ripped it out of the VPC. This is Lambda 101, but we had put everything in the VPC for “security.” Moving them out meant they no longer needed an ENI to function. We implemented VPC Endpoints², allowing these functions to access AWS services over a private link without the ENI overhead.
  2. RDS proxy triage
    For the databases we couldn’t escape, we treated the proxy like the delicate, overworked bouncer it was. We massively over-provisioned the proxy instances, giving them far more CPU than they should ever need. We also implemented client-side jitter, a small, random delay before retrying a connection, to stop our Lambdas from acting like a synchronized mob storming the gates.
  3. The nuclear option DynamoDB
    For one critical, high-throughput service, we did the unthinkable. We migrated it from Aurora to DynamoDB. The hardest part wasn’t the code; it was the ego. It was admitting that the problem didn’t require a Swiss Army knife when all we needed was a hammer. The team’s reaction after the migration was telling: “Wait… you mean we don’t need to worry about connection pooling at all?” Every developer, after their first taste of NoSQL freedom.

The real lesson we learned

Obsessing over cold starts is like meticulously polishing the chrome on your car’s engine while the highway you’re on is crumbling into a sinkhole. It’s a visible, satisfying metric to chase, but it often distracts from the invisible, systemic limits that will actually kill you.

Yes, optimize your cold starts. Shave off those milliseconds. But only after you’ve pressure-tested your system for the real bottlenecks. The unsexy ones. The ones buried in AWS service quota pages and 5-year-old forum posts.

Stop micro-optimizing the 50ms you can see and start planning for the 14-second delays you can’t. We learned that the hard way, at 2:37 AM on a Tuesday.

¹ The official term for ‘setting a pile of money on fire to keep your functions toasty’.

² A fancy AWS term for ‘a private, secret tunnel to an AWS service so your Lambda doesn’t have to go out into the scary public internet’. It’s like an employee-only hallway in a giant mall.

What replaces Transit Gateway on Google cloud

Spoiler: There is no single magic box. There is a tidy drawer of parts that click together so cleanly you stop missing the box.

The first time I asked a team to set up “Transit Gateway on Google Cloud,” I received the sort of polite silence you reserve for relatives who ask where you keep the fax machine. On AWS, you reach for Transit Gateway and call it a day. On Azure, you reach for Virtual WAN and its Virtual Hubs. On Google Cloud, you reach for… a shorter shopping list: one global VPC, Network Connectivity Center with VPC spokes when you need a hub, VPC Peering when you do not, Private Service Connect for producer‑consumer traffic, and Cloud Router to keep routes honest.

Once you stop searching for a product name and start wiring the right parts, transit on Google Cloud turns out to be pleasantly boring.

The short answer

  • Inter‑VPC at scaleNetwork Connectivity Center (NCC) with VPC spokes
  • One‑to‑one VPC connectivityVPC Peering (non‑transitive)
  • Private access to managed or third‑party servicesPrivate Service Connect (PSC)
  • Hybrid connectivityCloud Router + HA VPN or Interconnect with dynamic routing mode set to Global

That’s the toolkit most teams actually need. The rest of this piece is simply: where each part shines, where it bites, and how to string them together without leaving teeth marks.

How do the other clouds solve it?

  • AWS: VPCs are regional. Transit Gateway acts as the hub; if you span regions, you peer TGWs. It is a well‑lit path and a single product name.
  • Azure: VNets are regional. Virtual WAN gives you a global fabric with per‑region Virtual Hubs, optionally “secured” with an integrated firewall.
  • Google Cloud: a VPC is global (routing table and firewalls are global, subnets remain regional). You do not need a separate “global transit” box to make two instances in different regions talk. When you outgrow simple, add NCC with VPC spokes for hub‑and‑spoke, PSC for services, and Cloud Router for dynamic routing.

Different philosophies, same goal. Google Cloud leans into a global network and small, specialized parts.

What a global VPC really means

A Google Cloud VPC gives you a global control plane. You define routes and firewall rules once, and they apply across regions; you place subnets per region where compute lives. That split is why multi‑region feels natural on GCP without an extra transit layer. Not everything is magic, though:

  • Cloud Router, VPN, and Interconnect are regional attachments. You can and often should set dynamic routing mode to Global so learned routes propagate across the VPC, but the physical attachment still sits in a region.
  • Global does not mean chaotic. IAM, firewall rules, hierarchical policies, and VPC Service Controls provide the guardrails you actually want.

Choosing the right part

Network connectivity center with VPC spokes

Use it when you have many VPCs and want managed transit without building a mesh of N×N peerings. NCC gives you a hub‑and‑spoke model where spokes exchange routes through the hub, including hybrid spokes via Cloud Router. Think “default” once your VPC count creeps into the double digits.

Use when you need inter‑VPC transit at scale, clear centralization, and easy route propagation.

Avoid when you have only two or three VPCs that will never grow. Simpler is nicer.

VPC Peering

Use it for simple 1:1 connectivity. It is non‑transitive by design. If A peers with B and B peers with C, A does not automatically reach C. This is not a bug; it is a guardrail. If you catch yourself drawing triangles, take the hint and move to NCC.

Use when two VPCs need to talk, and that’s the end of the story.

Avoid when you need full‑mesh or centralized inspection.

Private Service Connect

Use it when a consumer VPC needs private access to a producer (managed Google service like Cloud SQL, or a third‑party/SaaS running behind a producer endpoint). PSC is not inter‑VPC transit; it is producer‑consumer plumbing with private IPs and tight control.

Use when you want “just the sauce” from a service without crossing the public internet.

Avoid when you are trying to stitch two application VPCs together. That is a job for NCC or peering.

Cloud Router with HA VPN or Interconnect

Use it for hybrid. Cloud Router speaks BGP and exchanges routes dynamically with your on‑prem or colo edge. Set dynamic routing to Global so routes learned in one region are known across the VPC. Remember that the attachments are regional; plan for redundancy per region.

Use when you want fewer static routes and less drift between environments.

Avoid when you expected a single global attachment. That is not how physics—or regions—work.

Three quick patterns

Multi‑region application in one VPC

One global VPC, regional subnets in us‑east1, europe‑west1, and asia‑east1. Instances talk across regions without extra kit. If the app grows into multiple VPCs per domain (core, data, edge), bring in NCC as the hub.

Mergers and acquisitions without a month of rewiring

Projects in Google Cloud are movable between folders and even organizations, subject to permissions and policy guardrails. That turns “lift and splice” into a routine operation rather than a quarter‑long saga. Be upfront about prerequisites: billing, liens, org policy, and compliance can slow a move; plan them, do not hand‑wave them.

Shared services with clean tenancy

Run shared services in a host project via Shared VPC. Attach service projects for each team. For an external partner, use VPC Peering or PSC, depending on whether they need network adjacency or just a service endpoint. If many internal VPCs need those shared bits, let NCC be the meeting place.

ASCII sketch of the hub

Pitfalls you can dodge

  • Expecting peering to be transitive. It is not. If your diagram starts to look like spaghetti, stop and bring in NCC.
  • Treating Cloud Router as global. It is regional. The routing mode can be Global; the attachment is not. Plan per‑region redundancy.
  • Using PSC as inter‑VPC glue. PSC is for producer‑consumer privacy, not general transit.
  • Forgetting DNS. Cross‑project and cross‑VPC name resolution needs deliberate configuration. Decide where you publish private zones and who can see them.
  • Over‑centralizing inspection. The global VPC makes central stacks attractive, but latency budgets are still a thing. Place controls where the traffic lives.

Security that scales with freedom

A global VPC does not mean a free‑for‑all. The security model leans on identity and context rather than IP folklore.

  • IAM everywhere for least privilege and clear ownership.
  • VPC firewall rules with hierarchical policy for the sharp edges.
  • VPC Service Controls for data perimeter around managed services.
  • Cloud Armor and load balancers at the edge, where they belong.

The result is a network that is permissive where it should be and stubborn where it must be.

A tiny buying guide for your brain

  • Two VPCs, done in a week → VPC Peering
  • Ten VPCs, many teams, add partners next quarter → NCC with VPC spokes
  • Just need private access to Cloud SQL or third‑party services → PSC
  • Datacenter plus cloud, please keep routing sane → Cloud Router with HA VPN or Interconnect, dynamic routing Global

If you pick the smallest thing that works today and the most boring thing that still works next year, you will almost always land on the right square.

Where the magic isn’t

Transit Gateway is a great product name. It just happens to be the wrong shopping query on Google Cloud. You are not assembling a monolith; you are pulling the right pieces from a drawer that has been neatly labeled for years. NCC connects the dots, Peering keeps simple things simple, PSC keeps services private, and Cloud Router shakes hands with the rest of your world. None of them is glamorous. All of them are boring in the way electricity is boring when it works.

If you insist on a single giant box, you will end up using it as a hammer. Google Cloud encourages a tidier vice: choose the smallest thing that does the job, then let the global VPC and dynamic routing do the quiet heavy lifting. Need many VPCs to talk without spaghetti? NCC with spokes. Need two VPCs and a quiet life? Peering. Need only the sauce from Cloud SQL or a partner? PSC. Need the campus to meet the cloud without sticky notes of static routes? Cloud Router with HA VPN or Interconnect. Label the bag, not every screw.

The punchline is disappointingly practical. When teams stop hunting for a product name, they start shipping features. Incidents fall in number and in temperature. The network diagram loses its baroque flourishes and starts looking like something you could explain before your coffee cools.

So yes, keep admiring Transit Gateway as a name. Then close the tab and open the drawer you already own. Put the parts back in the same place when you are done, teach the interns what each one is for, and get back to building the thing your users actually came for. The box you were searching for was never the point; the drawer is how you move faster without surprises.

The mutability mirage in Cloud

We’ve all been there. A DevOps engineer squints at a script, muttering, “But I changed it, it has to be mutable.” Meanwhile, the cloud infrastructure blinks back, unimpressed, as if to say, “Sure, you swapped the sign. That doesn’t make the building mutable.”

This isn’t just a coding quirk. It’s a full-blown identity crisis in the world of cloud architecture and DevOps, where confusing reassignment with mutability can lead to anything from baffling bugs to midnight firefighting sessions. Let’s dissect why your variables are lying to you, and why it matters more than you think.

The myth of the mutable variable

Picture this: You’re editing a configuration file for a cloud service. You tweak a value, redeploy, and poof, it works. Naturally, you assume the system is mutable. But what if it isn’t? What if the platform quietly discarded your old configuration and spun up a new one, like a magician swapping a rabbit for a hat?

This is the heart of the confusion. In programming, mutability isn’t about whether something changes; it’s about how it changes. A mutable object alters its state in place, like a chameleon shifting colors. An immutable one? It’s a one-hit wonder: once created, it’s set in stone. Any “change” is just a new object in disguise.

What mutability really means

Let’s cut through the jargon. A mutable object, say, a Python list, lets you tweak its contents without breaking a sweat. Add an item, remove another, and it’s still the same list. Check its memory address with id(), and it stays consistent.

Now take a string. Try to “modify” it:

greeting = "Hello"  
greeting += " world"

Looks like a mutation, right? Wrong. The original greeting is gone, replaced by a new string. The memory address? Different. The variable name greeting is just a placeholder, now pointing to a new object, like a GPS rerouting you to a different street.

This isn’t pedantry. It’s the difference between adjusting the engine of a moving car and replacing the entire car because you wanted a different color.

The great swap

Why does this illusion persist? Because programming languages love to hide the smoke and mirrors. In functional programming, for instance, operations like map() or filter() return new collections, never altering the original. Yet the syntax, data = transform(data), feels like mutation.

Even cloud infrastructure plays this game. Consider immutable server deployments: you don’t “update” an AWS EC2 instance. You bake a new AMI and replace the old one. The outcome is change, but the mechanism is substitution. Confusing the two leads to chaos, like assuming you can repaint a house without leaving the living room.

The illusion of change

Here’s where things get sneaky. When you write:

counter = 5  
counter += 1  

You’re not mutating the number 5. You’re discarding it for a shiny new 6. The variable counter is merely a label, not the object itself. It’s like renaming a book after you’ve already read it, The Great Gatsby didn’t change; you just called it The Even Greater Gatsby and handed it to someone else.

This trickery is baked into language design. Python’s tuples are immutable, but you can reassign the variable holding them. Java’s String class is famously unyielding, yet developers swear they “changed” it daily. The culprit? Syntax that masks object creation as modification.

Why cloud and DevOps care

In cloud architecture, this distinction is a big deal. Mutable infrastructure, like manually updating a server, invites inconsistency and “works on my machine” disasters. Immutable infrastructure, by contrast, treats servers as disposable artifacts. Changes mean new deployments, not tweaks.

This isn’t just trendy. It’s survival. Imagine two teams modifying a shared configuration. If the object is mutable, chaos ensues, race conditions, broken dependencies, the works. If it’s immutable, each change spawns a new, predictable version. No guessing. No debugging at 3 a.m.

Performance matters too. Creating new objects has overhead, yes, but in distributed systems, the trade-off for reliability is often worth it. As the old adage goes: “You can optimize for speed or sanity. Pick one.”

How not to fall for the trick

So how do you avoid this trap?

  1. Check the documentation. Is the type labeled mutable? If it’s a string, tuple, or frozenset, assume it’s playing hard to get.
  2. Test identity. In Python, use id(). In Java, compare references. If the address changes, you’ve been duped.
  3. Prefer immutability for shared data. Your future self will thank you when the system doesn’t collapse under concurrent edits.

And if all else fails, ask: “Did I alter the object, or did I just point to a new one?” If the answer isn’t obvious, grab a coffee. You’ll need it.

The cloud doesn’t change, it blinks

Let’s be brutally honest: in the cloud, assuming something is mutable because it changes is like assuming your toaster is self-repairing because the bread pops up different shades of brown. You tweak a Kubernetes config, redeploy, and poof, it’s “updated.” But did you mutate the cluster or merely summon a new one from the void? In the world of DevOps, this confusion isn’t just a coding quirk; it’s the difference between a smooth midnight rollout and a 3 a.m. incident war room where your coffee tastes like regret.

Cloud infrastructure doesn’t change; it reincarnates. When you “modify” an AWS Lambda function, you’re not editing a living organism. You’re cremating the old version and baptizing a new one in S3. The same goes for Terraform state files or Docker images: what looks like a tweak is a full-scale resurrection. Mutable configurations? They’re the digital equivalent of duct-taping a rocket mid-flight. Immutable ones? They’re the reason your team isn’t debugging why the production database now speaks in hieroglyphics.

And let’s talk about the real villain: configuration drift. It’s the gremlin that creeps into mutable systems when no one’s looking. One engineer tweaks a server, another “fixes” a firewall rule, and suddenly your cloud environment has the personality of a broken vending machine. Immutable infrastructure laughs at this. It’s the no-nonsense librarian who will replace the entire catalog if you so much as sneeze near the Dewey Decimal System.

So the next time a colleague insists, “But I changed it!” with the fervor of a street magician, lean in and whisper: “Ah, yes. Just like how I ‘changed’ my car by replacing it with a new one. Did you mutate the object, or did you just sacrifice it to the cloud gods?” Then watch their face, the same bewildered blink as your AWS console when you accidentally set min_instances = 0 on a critical service.

The cloud doesn’t get frustrated. It doesn’t sigh. It blinks. Once. Slowly. And in that silent judgment, you’ll finally grasp the truth: change is inevitable. Mutability is a choice. Choose wisely, or spend eternity debugging the ghost of a server that thought it was mutable.

(And for the love of all things scalable: stop naming your variables temp.)

Parenting your Kubernetes using hierarchical namespaces

Let’s be honest. Your Kubernetes cluster, on its bad days, feels less like a sleek, futuristic platform and more like a chaotic shared apartment right after college. The frontend team is “borrowing” CPU from the backend team, the analytics project left its sensitive data lying around in a public bucket, and nobody knows who finished the last of the memory reserves.

You tried to bring order. You dutifully handed out digital rooms to each team using namespaces. For a while, there was peace. But then those teams had their own little sub-projects, staging, testing, that weird experimental feature no one talks about, and your once-flat world devolved into a sprawling city with no zoning laws. The shenanigans continued, just inside slightly smaller boxes.

What you need isn’t more rules scribbled on a whiteboard. You need a family tree. It’s time to introduce some much-needed parental supervision into your cluster. It’s time for Hierarchical Namespaces.

The origin of the namespace rebellion

In the beginning, Kubernetes gave us namespaces, and they were good. The goal was simple: create virtual walls to stop teams from stealing each other’s lunch (metaphorically speaking, of course). Each namespace was its own isolated island, a sovereign nation with its own rules. This “flat earth” model worked beautifully… until it didn’t.

As organizations scaled, their clusters turned into bustling archipelagos of hundreds of namespaces. Managing them felt like being an air traffic controller for a fleet of paper airplanes in a hurricane. Teams realized that a flat structure was basically a free-for-all party where every guest could raid the fridge, as long as they stayed in their designated room. There was no easy way to apply a single rule, like a network policy or a set of permissions, to a group of related namespaces. The result was a maddening copy-paste-a-thon of YAML files, a breeding ground for configuration drift and human error.

The community needed a way to group these islands, to draw continents. And so, the Hierarchical Namespace Controller (HNC) was born, bringing a simple, powerful concept to the table: namespaces can have parents.

What this parenting gig gets you

Adopting a hierarchical structure isn’t just about satisfying your inner control freak. It comes with some genuinely fantastic perks that make cluster management feel less like herding cats.

  • The “Because I said so” principle: This is the magic of policy inheritance. Any Role, RoleBinding, or NetworkPolicy you apply to a parent namespace automatically cascades down to all its children and their children, and so on. It’s the parenting dream: set a rule once, and watch it magically apply to everyone. No more duplicating RBAC roles for the dev, staging, and testing environments of the same application.
  • The family budget: You can set a resource quota on a parent namespace, and it becomes the total budget for that entire branch of the family tree. For instance, team-alpha gets 100 CPU cores in total. Their dev and qa children can squabble over that allowance, but together, they can’t exceed it. It’s like giving your kids a shared credit card instead of a blank check.
  • Delegated authority: You can make a developer an admin of a “team” namespace. Thanks to inheritance, they automatically become an admin of all the sub-namespaces under it. They get the freedom to manage their own little kingdoms (staging, testing, feature-x) without needing to ping a cluster-admin for every little thing. You’re teaching them responsibility (while keeping the master keys to the kingdom, of course).

Let’s wrangle some namespaces

Convinced? I thought so. The good news is that bringing this parental authority to your cluster isn’t just a fantasy. Let’s roll up our sleeves and see how it works.

Step 0: Install the enforcer

Before we can start laying down the law, we need to invite the enforcer. The Hierarchical Namespace Controller (HNC) doesn’t come built-in with Kubernetes. You have to install it first.

You can typically install the latest version with a single kubectl command:

kubectl apply -f [https://github.com/kubernetes-sigs/hierarchical-namespaces/releases/latest/download/hnc-manager.yaml](https://github.com/kubernetes-sigs/hierarchical-namespaces/releases/latest/download/hnc-manager.yaml)

Wait a minute for the controller to be up and running in its own hnc-system namespace. Once it’s ready, you’ll have a new superpower: the kubectl hns plugin.

Step 1: Create the parent namespace

First, let’s create a top-level namespace for a project. We’ll call it project-phoenix. This will be our proud parent.

kubectl create namespace project-phoenix

Step 2: Create some children

Now, let’s give project-phoenix a couple of children: staging and testing. Wait, what’s that hns command? That’s not your standard kubectl. That’s the magic wand the HNC just gave you. You’re telling it to create a staging namespace and neatly tuck it under its parent.

kubectl hns create staging -n project-phoenix
kubectl hns create testing -n project-phoenix

Step 3: Admire your family tree

To see your beautiful new hierarchy in all its glory, you can ask HNC to draw you a picture.

kubectl hns tree project-phoenix

You’ll get a satisfyingly clean ASCII art diagram of your new family structure:

You can even create grandchildren. Let’s give the staging namespace its own child for a specific feature branch.

kubectl hns create feature-login-v2 -n staging
kubectl hns tree project-phoenix

And now your tree looks even more impressive:

Step 4 Witness the magic of inheritance

Let’s prove that this isn’t all smoke and mirrors. We’ll create a Role in the parent namespace that allows viewing Pods.

# viewer-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-viewer
  namespace: project-phoenix
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

Apply it:

kubectl apply -f viewer-role.yaml

Now, let’s give a user, let’s call her jane.doe, that role in the parent namespace.

kubectl create rolebinding jane-viewer --role=pod-viewer --user=jane.doe -n project-phoenix

Here’s the kicker. Even though we only granted Jane permission in project-phoenix, she can now magically view pods in the staging and feature-login-v2 namespaces as well.

# This command would work for Jane!
kubectl auth can-i get pods -n staging --as=jane.doe
# YES

# And even in the grandchild namespace!
kubectl auth can-i get pods -n feature-login-v2 --as=jane.doe
# YES

No copy-pasting required. The HNC saw the binding in the parent and automatically propagated it down the entire tree. That’s the power of parenting.

A word of caution from a fellow parent

As with real parenting, this new power comes with its own set of challenges. It’s not a silver bullet, and you should be aware of a few things before you go building a ten-level deep namespace dynasty.

  • Complexity can creep in: A deep, sprawling tree of namespaces can become its own kind of nightmare to debug. Who has access to what? Which quota is affecting this pod? Keep your hierarchy logical and as flat as you can get away with. Just because you can create a great-great-great-grandchild namespace doesn’t mean you should.
  • Performance is not free: The HNC is incredibly efficient, but propagating policies across thousands of namespaces does have a cost. For most clusters, it’s negligible. For mega-clusters, it’s something to monitor.
  • Not everyone obeys the parents: Most core Kubernetes resources (RBAC, Network Policies, Resource Quotas) play nicely with HNC. But not all third-party tools or custom controllers are hierarchy-aware. They might only see the flat world, so always test your specific tools.

Go forth and organize

Hierarchical Namespaces are the organizational equivalent of finally buying drawer dividers for that one kitchen drawer, you know the one. The one where the whisk is tangled with the batteries and a single, mysterious key. They transform your cluster from a chaotic free-for-all into a structured, manageable hierarchy that actually reflects how your organization works. It’s about letting you set rules with confidence and delegate with ease.

So go ahead, embrace your inner cluster parent. Bring some order to the digital chaos. Your future self, the one who isn’t spending a Friday night debugging a rogue pod in the wrong environment, will thank you. Just don’t be surprised when your newly organized child namespaces start acting like teenagers, asking for the production Wi-Fi password or, heaven forbid, the keys to the cluster-admin car.After all, with great power comes great responsibility… and a much, much cleaner kubectl get ns output.

Building living systems with WebSockets

For the longest time, communication on the web has been a painfully formal affair. It’s like sending a letter. Your browser meticulously writes a request, sends it off via the postal service (HTTP), and then waits. And waits. Eventually, the server might write back with an answer. If you want to know if anything has changed five seconds later, you have to send another letter. It’s slow, it’s inefficient, and frankly, the postman is starting to give you funny looks.

This constant pestering, “Anything new? How about now? Now?”, is the digital equivalent of a child on a road trip asking, “Are we there yet?” It’s called polling, and it’s the clumsy foundation upon which much of the old web was built. For applications that need to feel alive, this just won’t do.

What if, instead of sending a flurry of letters, we could just open a phone line directly to the server? A dedicated, always-on connection where both sides can just shout information at each other the moment it happens. That, in a nutshell, is the beautiful, chaotic, and nonstop chatter of WebSockets. It’s the technology that finally gave our distributed systems a voice.

The secret handshake that starts the party

So how do you get access to this exclusive, real-time conversation? You can’t just barge in. You have to know the secret handshake.

The process starts innocently enough, with a standard HTTP request. It looks like any other request, but it carries a special, almost magical, header: Upgrade: websocket. This is the client subtly asking the server, “Hey, this letter-writing thing is a drag. Can we switch to a private line?”

If the server is cool, and equipped for a real conversation, it responds with a special status code, 101 Switching Protocols. This isn’t just an acknowledgment; it’s an agreement. The server is saying, “Heck yes. The formal dinner party is over. Welcome to the after-party.” At that moment, the clumsy, transactional nature of HTTP is shed like a heavy coat, and the connection transforms into a sleek, persistent, two-way WebSocket tunnel. The phone line is now open.

So what can we do with all this chatter?

Once you have this open line, the possibilities become far more interesting than just fetching web pages. You can build systems that breathe.

The art of financial eavesdropping

Think of a stock trading platform. With HTTP, you’d be that sweaty-palmed investor hitting refresh every two seconds, hoping to catch a price change. With WebSockets, the server just whispers the new prices in your ear the microsecond they change. It’s the difference between reading yesterday’s newspaper and having a live feed from the trading floor piped directly into your brain.

Keeping everyone on the same page literally

Remember the horror of emailing different versions of a document back and forth? “Report_Final_v2_Johns_Edits_Final_FINAL.docx”. Collaborative tools like Google Docs killed that nightmare, and WebSockets were the murder weapon. When you type, your keystrokes are streamed to everyone else in the document instantly. It’s a seamless, shared consciousness, not a series of disjointed monologues.

Where in the world is my taxi

Ride-sharing apps like Uber would be a farce without a live map. You don’t want a “snapshot” of where your driver was 30 seconds ago; you want to see that little car icon gliding smoothly toward you. WebSockets provide that constant stream of location data, turning a map from a static picture into a living, moving window.

When the conversation gets too loud

Of course, hosting a party where a million people are all talking at once isn’t exactly a walk in the park. This is where our brilliant WebSocket-powered dream can turn into a bit of a logistical headache.

A server that could happily handle thousands of brief HTTP requests might suddenly break into a cold sweat when asked to keep tens of thousands of WebSocket phone lines open simultaneously. Each connection consumes memory and resources. It’s like being a party host who promised to have a deep, meaningful conversation with every single guest, all at the same time. Eventually, you’re just going to collapse from exhaustion.

And what happens if the line goes dead? A phone can be hung up, but a digital connection can just… fade into the void. Is the client still there, listening quietly? Or did their Wi-Fi die mid-sentence? To avoid talking to a ghost, servers have to periodically poke the client with a ping message. If they get a pong back, all is well. If not, the server sadly hangs up, freeing the line for someone who actually wants to talk.

How to be a good conversation host

Taming this beast requires a bit of cleverness. You can’t just throw one server at the problem and hope for the best.

Load balancers become crucial, but they need to be smarter. A simple load balancer that just throws requests at any available server is a disaster for WebSockets. It’s like trying to continue a phone conversation while the operator keeps switching you to a different person who has no idea what you were talking about. You need “sticky sessions,” which is a fancy way of saying the load balancer is smart enough to remember which server you were talking to and keeps you connected to it.

Security also gets a fun new twist. An always-on connection is a wonderfully persistent doorway into your system. If you’re not careful about who you’re talking to and what they’re saying (WSS, the secure version, is non-negotiable), you might find you’ve invited a Trojan horse to your party.

A world that talks back

So, no, WebSockets aren’t just another tool in the shed. They represent a philosophical shift. It’s the moment we stopped treating the web like a library of dusty, static books and started treating it like a bustling, chaotic city square. We traded the polite, predictable, and frankly boring exchange of letters for the glorious, unpredictable mess of a real-time human conversation.

It means our applications can now have a pulse. They can be surprised, they can interrupt, and they can react with the immediacy of a startled cat. Building these living systems is certainly noisier and requires a different kind of host, one who’s part traffic cop and part group therapist. But by embracing the chaos, we create experiences that don’t just respond; they engage, they live. And isn’t building a world that actually talks back infinitely more fun?

BigQuery learns to read between the lines

Keyword search is the friend who hears every word and misses the point. Vector search is the friend who nods, squints a little, and says, “You want a safe family SUV that will not make your wallet cry.” This story is about teaching BigQuery to be the second friend.

I wanted semantic search without renting another database, shipping nightly exports, or maintaining yet another dashboard only I remember to feed. The goal was simple and a little cheeky: keep the data in BigQuery, add embeddings with Vertex AI, create a vector index, and still use boring old SQL to filter by price and mileage. Results should read like good advice, not a word-count contest.

Below is a practical pattern that works well for catalogs, internal knowledge bases, and “please find me the thing I mean” situations. It is light on ceremony, honest about trade‑offs, and opinionated where it needs to be.

Why keyword search keeps missing the point

  • Humans ask for meanings, not tokens. “Family SUV that does not guzzle” is intent, not keywords.
  • Catalogs are messy. Price, mileage, features, and descriptions live in different columns and dialects.
  • Traditional search treats text like a bag of Scrabble tiles. Embeddings turn it into geometry where similar meanings sit near each other.

If you have ever typed “cheap laptop with decent battery” and received a gaming brick with neon lighting, you know the problem.

Keep data where it already lives

No new database. BigQuery already stores your rows, talks SQL, and now speaks vectors. The plan

  1. Build a clean content string per row so the model has a story to understand.
  2. Generate embeddings in BigQuery via a remote Vertex AI model.
  3. Store those vectors in a table and, when it makes sense, add a vector index.
  4. Search with a natural‑language query embedding and filter with plain SQL.

Map of the idea:

Prepare a clean narrative for each row

Your model will eat whatever you feed it. Feed it something tidy. The goal is a single content field with labeled parts, so the embedding has clues.

-- Demo names and values are fictitious
CREATE OR REPLACE TABLE demo_cars.search_base AS
SELECT
  listing_id,
  make,
  model,
  year,
  price_usd,
  mileage_km,
  body_type,
  fuel,
  features,
  CONCAT(
    'make=', make, ' | ',
    'model=', model, ' | ',
    'year=', CAST(year AS STRING), ' | ',
    'price_usd=', CAST(price_usd AS STRING), ' | ',
    'mileage_km=', CAST(mileage_km AS STRING), ' | ',
    'body=', body_type, ' | ',
    'fuel=', fuel, ' | ',
    'features=', ARRAY_TO_STRING(features, ', ')
  ) AS content
FROM demo_cars.listings
WHERE status = 'active';

Housekeeping that pays off

  • Normalize units and spellings early. “20k km” is cute; 20000 is useful.
  • Keep labels short and consistent. Your future self will thank you.
  • Avoid stuffing everything. Noise in, noise out.

Turn text into vectors without hand waving

We will assume you have a BigQuery remote model that points to your Vertex AI text‑embedding endpoint. Choose a modern embedding model and be explicit about task type, use RETRIEVAL_DOCUMENT for rows and RETRIEVAL_QUERY for user queries. That hint matters.

Embed the documents

-- Store document embeddings alongside your base table
CREATE OR REPLACE TABLE demo_cars.search_with_vec AS
SELECT
  b.listing_id,
  b.make, b.model, b.year, b.price_usd, b.mileage_km, b.body_type, b.fuel, b.features,
  e.ml_generate_embedding_result AS embedding
FROM demo_cars.search_base AS b,
UNNEST([
  STRUCT(
    (SELECT ml_generate_embedding_result
     FROM ML.GENERATE_EMBEDDING(
       MODEL `demo.embed_text`,
       (SELECT b.content AS content),
       STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_DOCUMENT' AS task_type)
     )) AS ml_generate_embedding_result
  )
]) AS e;

That cross join with a single STRUCT is a neat way to add one vector per row without creating a separate subquery table. If you prefer, materialize embeddings in a separate table and JOIN on listing_id to minimize churn.

Build an index when it helps and skip it when it does not

BigQuery can scan vectors without an index, which is fine for small tables and prototypes. For larger tables, add an IVF index with cosine distance.

-- Optional but recommended beyond a few thousand rows
CREATE VECTOR INDEX demo_cars.search_with_vec_idx
ON demo_cars.search_with_vec(embedding)
OPTIONS(
  distance_type = 'COSINE',
  index_type = 'IVF',
  ivf_options = '{"num_lists": 128}'
);

Rules of thumb

  • Start without an index for quick experiments. Add the index when latency or cost asks for it.
  • Tune num_lists only after measuring. Guessing is cardio for your CPU.

Ask in plain English, filter in plain SQL

Here is the heart of it. One short block that embeds the query, runs vector search, then applies filters your finance team actually understands.

-- Natural language wish
DECLARE user_query STRING DEFAULT 'family SUV with lane assist under 18000 USD';

WITH q AS (
  SELECT ml_generate_embedding_result AS qvec
  FROM ML.GENERATE_EMBEDDING(
    MODEL `demo.embed_text`,
    (SELECT user_query AS content),
    STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_QUERY' AS task_type)
  )
)
SELECT s.listing_id, s.make, s.model, s.year, s.price_usd, s.mileage_km, s.body_type
FROM VECTOR_SEARCH(
  TABLE demo_cars.search_with_vec, 'embedding',
  TABLE q, query_column_to_search => 'qvec',
  top_k => 20, distance_type => 'COSINE'
) AS s
WHERE s.price_usd <= 18000
  AND s.body_type = 'SUV'
ORDER BY s.price_usd ASC;

This is the “hybrid search” pattern, shoulder to shoulder, semantics finds plausible candidates, SQL draws the hard lines. You get relevance and guardrails.

Measure quality and cost without a research grant

You do not need a PhD rubric, just a habit.

Relevance sanity check

  • Write five real queries from your users. Note how many good hits appear in the top ten. If it is fewer than six, look at your content field. It is almost always the content.

Latency

  • Time the query with and without the vector index. Keep an eye on top‑k and filters. If you filter out 90% of candidates, you can often keep top‑k low.

Cost

  • Avoid regenerating embeddings. Upserts should only touch changed rows. Schedule small nightly or hourly batches, not heroic full refreshes.

Where things wobble and how to steady them

Vague user queries

  • Add example phrasing in your product UI. Even two placeholders nudge users into better intent.

Sparse or noisy text

  • Enrich your content with compact labels and the two or three features people actually ask for. Resist the urge to dump raw logs.

Synonyms of the trade

  • Lightweight mapping helps. If your users say “lane keeping” and your data says “lane assist,” consider normalizing in content.

Region mismatches

  • Keep your dataset, remote connection, and model in compatible regions. Latency enjoys proximity. Downtime enjoys misconfigurations.

Run it day after day without drama

A few operational notes that keep the lights on

  • Track changes by listing_id and only re‑embed those rows.
  • Rebuild or refresh the index on a schedule that fits your churn. Weekly is plenty for most catalogs.
  • Keep one “golden query set” around for spot checks after schema or model changes.

Takeaways you can tape to your monitor

  • Keep data in BigQuery and add meaning with embeddings.
  • Build one tidy content per row. Labels beat prose.
  • Use RETRIEVAL_DOCUMENT for rows and RETRIEVAL_QUERY for the user’s text.
  • Start without an index; add IVF with cosine when volume demands it.
  • Let vectors shortlist and let SQL make the final call.

Tiny bits you might want later

An alternative query that biases toward newer listings

DECLARE user_query STRING DEFAULT 'compact hybrid with good safety under 15000 USD';
WITH q AS (
  SELECT ml_generate_embedding_result AS qvec
  FROM ML.GENERATE_EMBEDDING(
    MODEL `demo.embed_text`,
    (SELECT user_query AS content),
    STRUCT(TRUE AS flatten_json_output, 'RETRIEVAL_QUERY' AS task_type)
  )
)
SELECT s.listing_id, s.make, s.model, s.year, s.price_usd
FROM VECTOR_SEARCH(
  TABLE demo_cars.search_with_vec, 'embedding',
  TABLE q, query_column_to_search => 'qvec',
  top_k => 15, distance_type => 'COSINE'
) AS s
WHERE s.price_usd <= 15000
ORDER BY s.year DESC, s.price_usd ASC
LIMIT 10;

Quick checklist before you ship

  • The remote model exists and is reachable from BigQuery.
  • Dataset and connection share a region you actually meant to use.
  • content strings are consistent and free of junk units.
  • Embeddings updated only for changed rows.
  • Vector index present on tables that need it and not on those that do not.

If keyword search is literal‑minded, this setup is the polite interpreter who knows what you meant, forgives your typos, and still respects the house rules. You keep your data in one place, you use one language to query it, and you get answers that feel like common sense rather than a thesaurus attack. That is the job.

Ingress and egress on EKS made understandable

Getting traffic in and out of a Kubernetes cluster isn’t a magic trick. It’s more like running the city’s most exclusive nightclub. It’s a world of logistics, velvet ropes, bouncers, and a few bureaucratic tollbooths on the way out. Once you figure out who’s working the front door and who’s stamping passports at the exit, the rest is just good manners.

Let’s take a quick tour of the establishment.

A ninety-second tour of the premises

There are really only two journeys you need to worry about in this club.

Getting In: A hopeful guest (the client) looks up the address (DNS), arrives at the front door, and is greeted by the head bouncer (Load Balancer). The bouncer checks the guest list and directs them to the right party room (Service), where they can finally meet up with their friend (the Pod).

Getting Out: One of our Pods needs to step out for some fresh air. It gets an escort from the building’s internal security (the Node’s ENI), follows the designated hallways (VPC routing), and is shown to the correct exit—be it the public taxi stand (NAT Gateway), a private car service (VPC Endpoint), or a connecting tunnel to another venue (Transit Gateway).

The secret sauce in EKS is that our Pods aren’t just faceless guests; the AWS VPC CNI gives them real VPC IP addresses. This means the building’s security rules, Security Groups, route tables, and NACLs aren’t just theoretical policies. They are the very real guards and locked doors that decide whether a packet’s journey ends in success or a silent, unceremonious death.

Getting past the velvet rope

In Kubernetes, Ingress is the set of rules that governs the front door. But rules on paper are useless without someone to enforce them. That someone is a controller, a piece of software that translates your guest list into actual, physical bouncers in AWS.

The head of security for EKS is the AWS Load Balancer Controller. You hand it an Ingress manifest, and it sets up the door staff.

  • For your standard HTTP web traffic, it deploys an Application Load Balancer (ALB). Think of the ALB as a meticulous, sharp-dressed bouncer who doesn’t just check your name. It inspects your entire invitation (the HTTP request), looks at the specific event you’re trying to attend (/login or /api/v1), and only then directs you to the right room.
  • For less chatty protocols like raw TCP, UDP, or when you need sheer, brute-force throughput, it calls in a Network Load Balancer (NLB). The NLB is the big, silent type. It checks that you have a ticket and shoves you toward the main hall. It’s incredibly fast but doesn’t get involved in the details.

This whole operation can be made public or private. For internal-only events, the controller sets up an internal ALB or NLB and uses a private Route 53 zone, hiding the party from the public internet entirely.

The modern VIP system

The classic Ingress system works, but it can feel a bit like managing your guest list with a stack of sticky notes. The rules for routing, TLS, and load balancer behavior are all crammed into a single resource, creating a glorious mess of annotations.

This is where the Gateway API comes in. It’s the successor to Ingress, designed by people who clearly got tired of deciphering annotation soup. Its genius lies in separating responsibilities.

  • The Platform team (the club owners) manages the Gateway. They decide where the entrances are, what protocols are allowed (HTTP, TCP), and handle the big-picture infrastructure like TLS certificates.
  • The Application teams (the party hosts) manage Routes (HTTPRoute, TCPRoute, etc.). They just point to an existing Gateway and define the rules for their specific application, like “send traffic for app.example.com/promo to my service.”

This creates a clean separation of duties, offers richer features for traffic management without resorting to custom annotations, and makes your setup far more portable across different environments.

The art of the graceful exit

So, your Pods are happily running inside the club. But what happens when they need to call an external API, pull an image, or talk to a database? They need to get out. This is egress, and it’s mostly about navigating the building’s corridors and exits.

  • The public taxi stand: For general internet access from private subnets, Pods are sent to a NAT Gateway. It works, but it’s like a single, expensive taxi stand for the whole neighborhood. Every trip costs money, and if it gets too busy, you’ll see it on your bill. Pro tip: Put one NAT in each Availability Zone to avoid paying extra for your Pods to take a cross-town cab just to get to the taxi stand.
  • The private car service: When your Pods need to talk to other AWS services (like S3, ECR, or Secrets Manager), sending them through the public internet is a waste of time and money. Use
    VPC endpoints instead. Think of this as a pre-booked black car service. It creates a private, secure tunnel directly from your VPC to the AWS service. It’s faster, cheaper, and the traffic never has to brave the public internet.
  • The diplomatic passport: The worst way to let Pods talk to AWS APIs is by attaching credentials to the node itself. That’s like giving every guest in the club a master key. Instead, we use
    IRSA (IAM Roles for Service Accounts). This elegantly binds an IAM role directly to a Pod’s service account. It’s the equivalent of issuing your Pod a diplomatic passport. It can present its credentials to AWS services with full authority, no shared keys required.

Setting the house rules

By default, Kubernetes networking operates with the cheerful, chaotic optimism of a free-for-all music festival. Every Pod can talk to every other Pod. In production, this is not a feature; it’s a liability. You need to establish some house rules.

Your two main tools for this are Security Groups and NetworkPolicy.

Security Groups are your Pod’s personal bodyguards. They are stateful and wrap around the Pod’s network interface, meticulously checking every incoming and outgoing connection against a list you define. They are an AWS-native tool and very precise.

NetworkPolicy, on the other hand, is the club’s internal security team. You need to hire a third-party firm like Calico or Cilium to enforce these rules in EKS, but once you do, you can create powerful rules like “Pods in the ‘database’ room can only accept connections from Pods in the ‘backend’ room on port 5432.”

The most sane approach is to start with a default deny policy. This is the bouncer’s universal motto: “If your name’s not on the list, you’re not getting in.” Block all egress by default, then explicitly allow only the connections your application truly needs.

A few recipes from the bartender

Full configurations are best kept in a Git repository, but here are a few cocktail recipes to show the key ingredients.

Recipe 1: Public HTTPS with a custom domain. This Ingress manifest tells the AWS Load Balancer Controller to set up a public-facing ALB, listen on port 443, use a specific TLS certificate from ACM, and route traffic for app.yourdomain.com to the webapp service.

# A modern Ingress for your web application
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: webapp-ingress
  annotations:
    # Set the bouncer to be public
    alb.ingress.kubernetes.io/scheme: internet-facing
    # Talk to Pods directly for better performance
    alb.ingress.kubernetes.io/target-type: ip
    # Listen for secure traffic
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    # Here's the TLS certificate to wear
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789012:certificate/your-cert-id
spec:
  ingressClassName: alb
  rules:
    - host: app.yourdomain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: webapp-service
                port:
                  number: 8080

Recipe 2: A diplomatic passport for S3 access. This gives our Pod a ServiceAccount annotated with an IAM role ARN. Any Pod that uses this service account can now talk to AWS APIs (like S3) with the permissions granted by that role, thanks to IRSA.

# The ServiceAccount with its IAM credentials
apiVersion: v1
kind: ServiceAccount
metadata:
  name: s3-reader-sa
  annotations:
    # This is the diplomatic passport: the ARN of the IAM role
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/EKS-S3-Reader-Role
---
# The Deployment that uses the passport
apiVersion: apps/v1
kind: Deployment
metadata:
  name: report-generator
spec:
  replicas: 1
  selector:
    matchLabels: { app: reporter }
  template:
    metadata:
      labels: { app: reporter }
    spec:
      # Use the service account we defined above
      serviceAccountName: s3-reader-sa
      containers:
        - name: processor
          image: your-repo/report-generator:v1.5.2
          ports:
            - containerPort: 8080

A short closing worth remembering

When you boil it all down, Ingress is just the etiquette you enforce at the front door. Egress is the paperwork required for a clean exit. In EKS, the etiquette is defined by Kubernetes resources, while the paperwork is pure AWS networking. Neither one cares about your intentions unless you write them down clearly.

So, draw the path for traffic both ways, pick the right doors for the job, give your Pods a proper identity, and set the tolls where they make sense. If you do, the cluster will behave, the bill will behave, and your on-call shifts might just start tasting a lot more like sleep.

What your DNS logs are saying behind your back

There’s a dusty shelf in every network closet where good intentions go to die. Or worse, to gossip. You centralize DNS for simplicity. You enable logging for accountability. You peer VPCs for convenience. A few sprints later, your DNS logs have become that chatty neighbor who sees every car that comes and goes, remembers every visitor, and pieces together a startlingly accurate picture of your life.

They aren’t leaking passwords or secret keys. They’re leaking something just as valuable: the blueprints of your digital house.

This post walks through a common pattern that quietly spills sensitive clues through AWS Route 53 Resolver query logging. We’ll skip the dry jargon and focus on the story. You’ll leave with a clear understanding of the problem, a checklist to investigate your own setup, and a handful of small, boring changes that buy you a lot of peace.

The usual suspects are a disaster recipe in three easy steps

This problem rarely stems from one catastrophic mistake. It’s more like three perfectly reasonable decisions that meet for lunch and end up burning down the restaurant. Let’s meet the culprits.

1. The pragmatic architect

In a brilliant move of pure common sense, this hero centralizes DNS resolution into a single, shared network VPC. “One resolver to rule them all,” they think. It simplifies configuration, reduces operational overhead, and makes life easier for everyone. On paper, it’s a flawless idea.

2. The visibility aficionado

Driven by the noble quest for observability, this character enables Route 53 query logging on that shiny new central resolver. “What gets measured, gets managed,” they wisely quote. To be extra helpful, they associate this logging configuration with every single VPC that peers with the network VPC. After all, data is power. Another flawless idea.

3. The easy-going permissions manager

The logs have to land somewhere, usually a CloudWatch Log Group or an S3 bucket. Our third protagonist, needing to empower their SRE and Ops teams, grants them broad read access to this destination. “They need it to debug things,” is the rationale. “They’re the good guys.” A third, utterly flawless idea.

Separately, these are textbook examples of good cloud architecture. Together, they’ve just created the perfect surveillance machine: a centralized, all-seeing eye that diligently writes down every secret whisper and then leaves the diary on the coffee table for anyone to read.

So what is actually being spilled

The real damage comes from the metadata. DNS queries are the internal monologue of your applications, and your logs are capturing every single thought. A curious employee, a disgruntled contractor, or even an automated script can sift through these logs and learn things like:

  • Service Hostnames that tell a story: Names like billing-api.prod.internal or customer-data-primary-db.restricted.internal do more than just resolve to an IP. They reveal your service names, their environments, and even their importance.
  • Secret project names: That new initiative you haven’t announced yet? If its services are making DNS queries like project-phoenix-auth-service.dev.internal, the secret’s already out.
  • Architectural hints: Hostnames often contain roles like etl-worker-3.prod, admin-gateway.staging, or sre-jumpbox.ops.internal. These are the labels on your architectural diagrams, printed in plain text.
  • Cross-Environment chatter: The most dangerous leak of all. When a query from a dev VPC successfully resolves a hostname in the prod environment (e.g., prod-database.internal), you’ve just confirmed a path between them exists. That’s a security finding waiting to happen.

Individually, these are harmless breadcrumbs. But when you have millions of them, anyone can connect the dots and draw a complete, and frankly embarrassing, map of your entire infrastructure.

Put on your detective coat and investigate your own house

Feeling a little paranoid? Good. Let’s channel that energy into a quick investigation. You don’t need a magnifying glass, just your AWS command line.

Step 1 Find the secret diaries

First, we need to find out where these confessions are being stored. This command asks AWS to list all your Route 53 query logging configurations. It’s the equivalent of asking, “Where are all the diaries kept?”

aws route53resolver list-resolver-query-log-configs \
--query 'ResolverQueryLogConfigs[].{Name:Name, Id:Id, DestinationArn:DestinationArn, VpcCount:ResolverQueryLogConfigAssociationCount}'

Take note of the DestinationArn for any configs with a high VpcCount. Those are your prime suspects. That ARN is the scene of the crime.

Step 2 Check who has the keys

Now that you know where the logs are, the million-dollar question is: who can read them?

If the destination is a CloudWatch Log Group, examine its resource-based policy and also review the IAM policies associated with your user roles. Are there wildcard permissions like logs:Get* or logs:* attached to broad groups?

If it’s an S3 bucket, check the bucket policy. Does it look something like this?

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:root"
      },
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::central-network-dns-logs/*"
    }
  ]
}

This policy generously gives every single IAM user and role in the account access to read all the logs. It’s the digital equivalent of leaving your front door wide open.

Step 3 Listen for the juicy gossip

Finally, let’s peek inside the logs themselves. Using CloudWatch Log Insights, you can run a query to find out if your non-production environments are gossiping about your production environment.

fields @timestamp, @message
| filter @message like /\.prod\.internal/
| filter vpc.id not like /vpc-prod-environment-id/
| stats count(*) by vpc.id as sourceVpc
| sort by @timestamp desc

This query looks for any log entries that mention your production domain (.prod.internal) but did not originate from a production VPC. Any results here are a flashing red light, indicating that your environments are not as isolated as you thought.

The fix is housekeeping, not heroics

The good news is that you don’t need to re-architect your entire network. The solution isn’t some heroic, complex project. It’s just boring, sensible housekeeping.

  1. Be granular with your logging: Don’t use a single, central log destination for every VPC. Create separate logging configurations for different environments (prod, staging, dev). Send production logs to a highly restricted location and development logs to a more accessible one.
  2. Practice a little scrutiny: Just because a resolver is shared doesn’t mean its logs have to be. Associate your logging configurations only with the specific VPCs that absolutely need it.
  3. Embrace the principle of least privilege: Your IAM and S3 bucket policies should be strict. Access to production DNS logs should be an exception, not the rule, requiring a specific IAM role that is audited and temporary.

That’s it. No drama, no massive refactor. Just a few small tweaks to turn your chatty neighbor back into a silent, useful tool. Because at the end of the day, the best secret-keeper is the one who never heard the secret in the first place.