LeastPrivilege

RBAC is not least privilege, and your cluster is the proof

Your security scanner ran last night. It came back green. RBAC is configured, there are no critical findings, and you closed the tab with the quiet satisfaction of someone who has done the responsible thing. The cluster is locked down. You can go to lunch.

Here is the uncomfortable part. A green scanner answers the question “Is access controlled?” It does not answer the question “Is access minimal?” Those are different questions, and most teams conflate them because the first one is easy to check and the second one requires reading things nobody wants to read on a Tuesday.

RBAC answers the first. Least privilege requires answering both. And a perfectly valid RBAC configuration can be, at the very same time, a perfectly generous one. The scanner has no opinion about generosity.

The ClusterRole you inherited from a Helm chart in March

Kubernetes ships three aggregated ClusterRoles out of the box (admin, edit, view), and they have a quietly alarming property. They absorb permissions. Any ClusterRole carrying the label ‘rbac.authorization.k8s.io/aggregate-to-edit: “true”’ gets automatically folded into ‘edit’, with no human in the loop and no diff to review.

This is convenient right up until it is not. When you installed that operator back in March, its Helm chart shipped a CRD and a ClusterRole with the aggregation label attached, because that is the polite, idiomatic way to do it. From the moment ‘helm install’ finished, every subject bound to ‘edit’ in your cluster silently gained permissions over a brand new resource type. Nobody approved it. Nobody saw it. The controller did exactly what it was designed to do, which is the part that should worry you.

So the RoleBinding still says ‘edit’. The word has not changed. What it grants has, several times, across several chart upgrades, and the only record of the expansion is scattered across ClusterRole objects nobody has opened since they were applied.

The takeaway is small and annoying: every time you install a chart, check what it aggregated. ‘kubectl get clusterrole -l rbac.authorization.k8s.io/aggregate-to-edit=true’ is two minutes of your life and occasionally a genuine surprise.

That ServiceAccount reads secrets, all of them, probably

Consider a ServiceAccount with ‘get’ on secrets in a single namespace. On paper, this looks narrow and tidy. The reviewer who approved it was right to approve it. The problem is that RBAC grants do not live in isolation; they live next to whatever else is running in that namespace.

If that namespace also hosts External Secrets Operator, a Vault Agent sidecar, or a CSI secrets driver, the secrets sitting there are not application trivia. They are the synced, materialized credentials that those tools pulled from somewhere more important. A grant that reads “can view secrets in ‘team-a’” can, depending on the architecture around it, mean “can read the cloud provider credentials that External Secrets faithfully copied into ‘team-a’ thirty seconds ago.”

Nothing here is broken. Every component is behaving as documented. That is exactly why it slips past review: each piece is reasonable, and the risk only exists in the seam between them, where no single Role definition is looking.

So when you audit a secrets grant, do not read the Role. Read the room. Ask what else lives in that namespace and what those neighbors keep in their pockets.

Creating a Pod sometimes creates a root shell on the node

This is the one people refuse to believe until you show them.

If Pod Security Admission is not enforced in ‘restricted’ mode, a subject with ‘create’ on pods is, functionally, a subject with a path to the node. They can define a pod that mounts the host root filesystem as a volume, sets ‘hostPID: true’, runs ‘privileged: true’, or maps a host port to quietly intercept traffic. From inside that pod, the node is no longer a node; it is a directory.

None of this is a vulnerability. There is no CVE to patch, because Kubernetes is doing precisely what the spec permits. The escalation lives in the gap between two true statements: “we have RBAC” and “nobody can reach the node.” Both can be accurate. Together, they can still be a hole you could drive a cluster through.

The fix is not more RBAC. It is admission control. Enforce PSA ‘restricted’ as the namespace default, and treat every exception as a decision someone wrote down and owns, rather than a default nobody chose.

Three commands that will ruin your afternoon

Theory is comfortable. Here is the part where you actually look.

‘kubectl-who-can’ answers the blunt question: who can perform this verb on this resource, right now. ‘kubectl who-can create pods -n production’ is a fast way to find out that the list is longer than you remembered.

‘rakkess’ produces a full access matrix for a given subject, so you can stare at an entire grid of green checkmarks belonging to a ServiceAccount that, in principle, only needed to read a config map.

‘rbac-tool lookup’ lists everything a specific subject can do across the whole cluster, which is the tool you run when you have a name and a bad feeling.

I will set an honest expectation. The first time you run any of these against a cluster older than a year, you will find at least one thing nobody intended, and there is a decent chance it will be something you granted. This is not a moral failing. It is entropy. Permissions accrete the same way junk drawers do, one reasonable decision at a time.

The scanner will still be green, that is no longer the point

Here is where I am supposed to hand you a fix that makes the scary parts go away. I cannot, because least privilege in Kubernetes is not a configuration state you reach and then defend. It is a process you keep doing, slightly grudgingly, forever.

Start subjects at zero and grant only what the audit log proves they actually use. Tools like ‘audit2rbac’ can generate tight RBAC from real API server audit events, which is to say from evidence rather than from optimism. Enforce PSA ‘restricted’ by default. Audit aggregated ClusterRoles every time you install a chart. Rotate ServiceAccount tokens, because a credential that never expires is just a future incident with good patience.

Do all of that, and run the scanner again. It will still be green. It was always going to be green. The result has not changed at all. The only thing that has changed is the question you now know to ask, and that, inconveniently, was the whole job.

There is no universal answer here, only better-informed trade-offs, and the faint suspicion that your next audit will find something too. It usually does.

Let IAM handle the secrets you can avoid

There are two kinds of secrets in cloud security.

The first kind is the legitimate kind: a third-party API token, a password for something you do not control, a certificate you cannot simply wish into existence.

The second kind is the kind we invent because we are in a hurry: long-lived access keys, copied into a config file, then copied into a Docker image, then copied into a ticket, then copied into the attacker’s weekend plans.

This article is about refusing to participate in that second category.

Not because secrets are evil. Because static credentials are the “spare house key under the flowerpot” of AWS. Convenient, popular, and a little too generous with access for something that can be photographed.

The goal is not “no secrets exist.” The goal is no secrets live in code, in images, or in long-lived credentials.

If you do that, your security posture stops depending on perfect human behavior, which is great because humans are famously inconsistent. (We cannot all be trusted with a jar of cookies, and we definitely cannot all be trusted with production AWS keys.)

Why this works in real life

AWS already has a mechanism designed to prevent your applications from holding permanent credentials: IAM roles and temporary credentials (STS).

When your Lambda runs with an execution role, AWS hands it short-lived credentials automatically. They rotate on their own. There is nothing to copy, nothing to stash, nothing to rotate in a spreadsheet named FINAL-final-rotation-plan.xlsx.

What remains are the unavoidable secrets, usually tied to systems outside AWS. For those, you store them in AWS Secrets Manager and retrieve them at runtime. Not at build time. Not at deploy time. Not by pasting them into an environment variable and calling it “secure” because you used uppercase letters.

This gives you a practical split:

Avoidable secrets are replaced by IAM roles and temporary credentials
Unavoidable secrets go into Secrets Manager, encrypted and tightly scoped

The architecture in one picture

A simple flow to keep in mind:

A Lambda function runs with an IAM execution role
The function fetches one third-party API key from Secrets Manager at runtime
The function calls the third-party API and writes results to DynamoDB
Network access to Secrets Manager stays private through a VPC interface endpoint (when the Lambda runs in a VPC)

The best part is what you do not see.

No access keys. No “temporary” keys that have been temporary since 2021. No secrets baked into ZIPs or container layers.

What this protects you from

This pattern is not a magic spell. It is a seatbelt.

It helps reduce the chance of:

Credentials leaking through Git history, build logs, tickets, screenshots, or well-meaning copy-paste
Forgotten key rotation schedules that quietly become “never.”
Overpowered policies that turn a small bug into a full account cleanup
Unnecessary public internet paths for sensitive AWS API calls

Now let’s build it, step by step, with code snippets that are intentionally sanitized.

Step 1 build an IAM execution role with tight policies

The execution role is the front door key your Lambda carries.

If you give it access to everything, it will eventually use that access, if only because your future self will forget why it was there and leave it in place “just in case.”

Keep it boring. Keep it small.

Here is an example IAM policy for a Lambda that only needs to:

write to one DynamoDB table
read one secret from Secrets Manager
decrypt using one KMS key (optional, depending on how you configure encryption)

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "WriteToOneTable",
      "Effect": "Allow",
      "Action": [
        "dynamodb:PutItem",
        "dynamodb:UpdateItem"
      ],
      "Resource": "arn:aws:dynamodb:eu-west-1:111122223333:table/app-results-prod"
    },
    {
      "Sid": "ReadOneSecret",
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetSecretValue"
      ],
      "Resource": "arn:aws:secretsmanager:eu-west-1:111122223333:secret:thirdparty/weather-api-key-*"
    },
    {
      "Sid": "DecryptOnlyThatKey",
      "Effect": "Allow",
      "Action": [
        "kms:Decrypt"
      ],
      "Resource": "arn:aws:kms:eu-west-1:111122223333:key/12345678-90ab-cdef-1234-567890abcdef",
      "Condition": {
        "StringEquals": {
          "kms:ViaService": "secretsmanager.eu-west-1.amazonaws.com"
        }
      }
    }
  ]
}

A few notes that save you from future regret:

The secret ARN ends with -* because Secrets Manager appends a random suffix.
The KMS condition helps ensure the key is used only through Secrets Manager, not as a general-purpose decryption service.
You can skip the explicit kms:Decrypt statement if you use the AWS-managed key and accept the default behavior, but customer-managed keys are common in regulated environments.

Step 2 store the unavoidable secret properly

Secrets Manager is not a place to dump everything. It is a place to store what you truly cannot avoid.

A third-party API key is a perfect example because IAM cannot replace it. AWS cannot assume a role in someone else’s SaaS.

Use a JSON secret so you can extend it later without creating a new secret every time you add a field.

{
  "api_key": "REDACTED-EXAMPLE-TOKEN"
}

If you like the CLI (and I do, because buttons are too easy to misclick), create the secret like this:

aws secretsmanager create-secret \
  --name "thirdparty/weather-api-key" \
  --description "Token for the Weatherly API used by the ingestion Lambda" \
  --secret-string '{"api_key":"REDACTED-EXAMPLE-TOKEN"}' \
  --region eu-west-1

Then configure:

encryption with a customer-managed KMS key if required
rotation if the provider supports it (rotation is amazing when it is real, and decorative when the vendor does not allow it)

If the vendor does not support rotation, you still benefit from central storage, access control, audit logging, and removing the secret from code.

Step 3 lock down secret access with a resource policy

Identity-based policies on the Lambda role are necessary, but resource policies are a nice extra lock.

Think of it like this: your role policy is the key. The resource policy is the bouncer who checks the wristband.

Here is a resource policy that allows only one role to read the secret.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowOnlyIngestionRole",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111122223333:role/lambda-ingestion-prod"
      },
      "Action": "secretsmanager:GetSecretValue",
      "Resource": "*"
    },
    {
      "Sid": "DenyEverythingElse",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "secretsmanager:GetSecretValue",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalArn": "arn:aws:iam::111122223333:role/lambda-ingestion-prod"
        }
      }
    }
  ]
}

This is intentionally strict. Strict is good. Strict is how you avoid writing apology emails.

Step 4 keep Secrets Manager traffic private with a VPC endpoint

If your Lambda runs inside a VPC, it will not automatically have internet access. That is often the point.

In that case, you do not want the function reaching Secrets Manager through a NAT gateway if you can avoid it. NAT works, but it is like walking your valuables through a crowded shopping mall because the back door is locked.

Use an interface VPC endpoint for Secrets Manager.

Here is a Terraform example (sanitized) that creates the endpoint and limits access using a dedicated security group.

resource "aws_security_group" "secrets_endpoint_sg" {
  name        = "secrets-endpoint-sg"
  description = "Allow HTTPS from Lambda to Secrets Manager endpoint"
  vpc_id      = aws_vpc.main.id

  ingress {
    from_port       = 443
    to_port         = 443
    protocol        = "tcp"
    security_groups = [aws_security_group.lambda_sg.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_vpc_endpoint" "secretsmanager" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.eu-west-1.secretsmanager"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = [aws_subnet.private_a.id, aws_subnet.private_b.id]
  private_dns_enabled = true
  security_group_ids  = [aws_security_group.secrets_endpoint_sg.id]
}

If your Lambda is not in a VPC, you do not need this step. The function will reach Secrets Manager over AWS’s managed network path by default.

If you want to go further, consider adding a DynamoDB gateway endpoint too, so your function can write to DynamoDB without touching the public internet.

Step 5 retrieve the secret at runtime without turning logs into a confession

This is where many teams accidentally reinvent the problem.

They remove the secret from the code, then log it. Or they put it in an environment variable because “it is not in the repository,” which is a bit like saying “the spare key is not under the flowerpot, it is under the welcome mat.”

The clean approach is:

store only the secret name (not the secret value) as configuration
retrieve the value at runtime
cache it briefly to reduce calls and latency
never print it, even when debugging, especially when debugging

Here is a Python example for AWS Lambda with a tiny TTL cache.

import json
import os
import time
import boto3

_secrets_client = boto3.client("secretsmanager")
_cached_value = None
_cached_until = 0

SECRET_ID = os.getenv("THIRDPARTY_SECRET_ID", "thirdparty/weather-api-key")
CACHE_TTL_SECONDS = int(os.getenv("SECRET_CACHE_TTL_SECONDS", "300"))


def _get_api_key() -> str:
    global _cached_value, _cached_until

    now = int(time.time())
    if _cached_value and now < _cached_until:
        return _cached_value

    resp = _secrets_client.get_secret_value(SecretId=SECRET_ID)
    payload = json.loads(resp["SecretString"])

    api_key = payload["api_key"]
    _cached_value = api_key
    _cached_until = now + CACHE_TTL_SECONDS
    return api_key


def lambda_handler(event, context):
    api_key = _get_api_key()

    # Use the key without ever logging it
    results = call_weatherly_api(api_key=api_key, city=event.get("city", "Seville"))

    write_to_dynamodb(results)

    return {
        "status": "ok",
        "items": len(results) if hasattr(results, "__len__") else 1
    }

This snippet is intentionally short. The important part is the pattern:

minimal secret access
controlled cache
zero secret output

If you prefer a library, AWS provides a Secrets Manager caching client for some runtimes, and AWS Lambda Powertools can help with structured logging. Use them if they fit your stack.

Step 6 make security noisy with logs and alarms

Security without visibility is just hope with a nicer font.

At a minimum:

enable CloudTrail in the account
ensure Secrets Manager events are captured
alert on unusual secret access patterns

A simple and practical approach is a CloudWatch metric filter for GetSecretValue events coming from unexpected principals. Another is to build a dashboard showing:

Lambda errors
Secrets Manager throttles
sudden spikes in secret reads

Here is a tiny Terraform example that keeps your Lambda logs from living forever (because storage is forever, but your attention span is not).

resource "aws_cloudwatch_log_group" "lambda_logs" {
  name              = "/aws/lambda/lambda-ingestion-prod"
  retention_in_days = 14
}

Also consider:

IAM Access Analyzer to spot risky resource policies
AWS Config rules or guardrails if your organization uses them
an alarm on unexpected NAT data processing if you intended to keep traffic private

Common mistakes I have made, so you do not have to

I am listing these because I have either done them personally or watched them happen in slow motion.

Using a wildcard secret policy
secretsmanager:GetSecretValue on * feels convenient until it is a breach multiplier.
Putting secret values into environment variables
Environment variables are not evil, but they are easy to leak through debugging, dumps, tooling, or careless logging. Store secret names there, not secret contents.
Retrieving secrets at build time
Build logs live forever in the places you forget to clean. Runtime retrieval keeps secrets out of build systems.
Logging too much while debugging
The fastest way to leak a secret is to print it “just once.” It will not be just once.
Skipping the endpoint and relying on NAT by accident
The NAT gateway is not evil either. It is just an expensive and unnecessary hallway if a private door exists.

A two minute checklist you can steal

Your Lambda uses an IAM execution role, not access keys
The role policy scopes Secrets Manager access to one secret ARN pattern
The secret has a resource policy that only allows the expected role
Secrets are encrypted with KMS when required
The secret value is never stored in code, images, build logs, or environment variables
If Lambda runs in a VPC, you use an interface VPC endpoint for Secrets Manager
You have CloudTrail enabled and you can answer “who accessed this secret” without guessing

Extra thoughts

If you remove long-lived credentials from your applications, you remove an entire class of problems.

You stop rotating keys that should never have existed in the first place.

You stop pretending that “we will remember to clean it up later” is a security strategy.

And you get a calmer life, which is underrated in engineering.

Let IAM handle the secrets you can avoid.

Then let Secrets Manager handle the secrets you cannot.

And let your code do what it was meant to do: process data, not babysit keys like they are a toddler holding a permanent marker.

December 14, 2025 by Fernando SRE Cloud stuff DevOps stuff

Avoiding security gaps by limiting IAM Role permissions

Think about how often we take security for granted. You move into a new apartment and forget to lock the door because nothing bad has ever happened. Then, one day, someone strolls in, helps themselves to your fridge, sits on your couch, and even uses your WiFi. Feels unsettling, right? That’s exactly what happens in AWS when an IAM role is granted far more permissions than it needs, leaving the door wide open for potential security risks.

This is where the principle of least privilege comes in. It’s a fancy way of saying: “Give just enough permissions for the job to get done, and nothing more.” But how do we figure out exactly what permissions an application needs? Enter AWS CloudTrail and Access Analyzer, two incredibly useful tools that help us tighten security without breaking functionality.

The problem of overly generous permissions

Let’s say you have an application running in AWS, and you assign it a role with AdministratorAccess. It can now do anything in your AWS account, from spinning up EC2 instances to deleting databases. Most of the time, it doesn’t even need 90% of these permissions. But if an attacker gets access to that role, you’re in serious trouble.

What we need is a way to see what permissions the application is actually using and then build a custom policy that includes only those permissions. That’s where CloudTrail and Access Analyzer come to the rescue.

Watching everything with CloudTrail

AWS CloudTrail is like a security camera that records every API call made in your AWS environment. It logs who did what, which service they accessed, and when they did it. If you enable CloudTrail for your AWS account, it will capture all activity, giving you a clear picture of which permissions your application uses.

So, the first step is simple: Turn on CloudTrail and let it run for a while. This will collect valuable data on what the application is doing.

Generating a Custom Policy with Access Analyzer

Now that we have a log of the application’s activity, we can use AWS IAM Access Analyzer to create a tailor-made policy instead of guessing. Access Analyzer looks at the CloudTrail logs and automatically generates a policy containing only the permissions that were used.

It’s like watching a security camera playback of who entered your house and then giving house keys only to the people who actually needed access.

Why this works so well

This approach solves multiple problems at once:

Precise permissions: You stop giving unnecessary access because now you know exactly what is needed.
Automated policy generation: Instead of manually writing a policy full of guesswork, Access Analyzer does the heavy lifting.
Better security: If an attacker compromises the role, they get access only to a limited set of actions, reducing damage.
Following best practices: Least privilege is a fundamental rule in cloud security, and this method makes it easy to follow.

Recap

Instead of blindly granting permissions and hoping for the best, enable CloudTrail, track what your application is doing, and let Access Analyzer craft a custom policy. This way, you ensure that your IAM roles only have the permissions they need, keeping your AWS environment secure without unnecessary exposure.

Security isn’t about making things difficult. It’s about making sure that only the right people, and applications, have access to the right things. Just like locking your door at night.

January 31, 2025 by Fernando SRE Cloud stuff SRE stuff