k8s

What if Kubernetes was the wrong tool for almost everyone?

It was 11 PM on a Tuesday, and for the third time that week, we were on an emergency call. Not because our product had a critical bug, but because a routine deployment had, once again, mysteriously broken the internal DNS. Three of our sharpest engineers weren’t creating value; they were offering sacrifices to the capricious god of Istio, hoping it might bless our pods with connectivity.

As I stared at the sprawling diagram we’d made just to add a single, simple microservice, a heretical thought wormed its way into my brain: When did our job stop being about building software and become about… well, serving Kubernetes?

A rumor we couldn’t ignore

The next morning, amidst our collective YAML-induced hangover, someone dropped a screenshot into our team’s Slack channel. It was from some tiny, no-name startup, claiming they were getting 8x the performance of a standard Kubernetes stack at one-tenth of the cost.

The team’s reaction was a collective, cynical laugh. “Sure,” our lead SRE typed, “and my home server can out-render Pixar.” It was obviously marketing fluff. Propaganda. The post itself was deleted within an hour, but the screenshot had already gone viral. It was absurd, unbelievable, and we all knew it was nonsense.

But the idea, like a catchy, terrible pop song, got stuck in our heads.

Let’s just prove it’s impossible

That Friday, we decided to do it. We’d run a small, contained experiment, mostly for the bragging rights of publicly debunking the ridiculous claim. The plan was simple: take one of our standard, moderately complex services and see what it would take to run it on a simpler stack.

Our service, an image processor, wasn’t a behemoth, but its YAML file had grown… organically. Like a colony of particularly stubborn mold. It had sidecars, persistent volume claims, readiness probes, and enough annotations to qualify as a short novel.

Here’s a sanitized glimpse of the beast we were trying to tame:

# apiVersion: apps/v1
# kind: Deployment
# metadata:
#   name: image-processor-svc
#   labels:
#     app: image-processor
# spec:
#   replicas: 3
#   selector:
#     matchLabels:
#       app: image-processor
#   template:
#     metadata:
#       labels:
#         app: image-processor
#     spec:
#       containers:
#       - name: processor
#         image: our-repo/image-processor:v1.2.4
#         ports:
#         - containerPort: 8080
#         resources:
#           requests:
#             memory: "512Mi"
#             cpu: "250m"
#           limits:
#             memory: "1024Mi"
#             cpu: "500m"
#         readinessProbe:
#           httpGet:
#             path: /healthz
#             port: 8080
#           initialDelaySeconds: 5
#           periodSeconds: 10
#       - name: metrics-sidecar
#         image: prom/statsd-exporter:v0.22.0
#         args:
#         - "--statsd.mapping-config=/etc/statsd/mapper.yml"
#         ports:
#         - name: metrics
#           containerPort: 9102
#         volumeMounts:
#         - name: config-volume
#           mountPath: /etc/statsd
#   # ...and so on, for another 150 lines.

We figured it would take us all afternoon just to untangle it.

The uncomfortable silence of success

We set up a competing stack using Firecracker MicroVMs, essentially tiny, lightning-fast virtual machines. The goal was to run the same container, but without the entire Kubernetes universe orbiting around it.

By 3 PM, we had our first results. And that’s when the room went quiet. It was the kind of uncomfortable silence you get when you realize the joke you’ve been telling for months is actually on you.

The numbers weren’t just holding up; they were embarrassing us. We stared at the Grafana dashboard, waiting for the figures to make sense. They didn’t. Our projected monthly cloud bill for this single service didn’t just shrink; it plummeted.

We had spent years building a complex, expensive, and fragile machine, all to solve a problem that, it turned out, could be handled with a much, much simpler approach.

Our quest for sane infrastructure

That weekend experiment turned into a full-blown obsession. Inspired, we started building our own internal escape hatch. We cobbled together a tool that we jokingly called the “SQL-ifier.” The premise was simple and, to a Kubernetes purist, utterly profane: what if you could manage infrastructure with simple, readable rules instead of YAML incantations?

Instead of a 200-line YAML file for an autoscaler, what if you could just write this?

-- This is our internal, SQL-like syntax for managing MicroVMs.
-- It's not a public standard (yet!).

-- Rule: If CPU usage on the frontend service is over 80% for 3 minutes,
-- add two more instances, but never exceed 20 total.
ON high_cpu(>80%) FOR 3m
IF service.name = 'image-processor-svc'
DO SCALE service.name TO instances + 2
LIMIT 20;

-- Rule: If we see more than 10 critical payment errors in 1 minute,
-- immediately revert to the last stable version.
ON log_error(level='critical', service='payment-gateway') > 10 FOR 1m
DO ROLLBACK service 'payment-gateway' TO previous_stable;

It was declarative, readable, and, most importantly, it could be understood by a human being without needing a certification.

How did we all end up here?

This journey forced us to ask a bigger question. Why did we, and thousands of other smart teams, willingly chain ourselves to this complexity?

The answer is surprisingly human. We bought an 18-wheeler truck to do our weekly grocery shopping.

Sure, a giant truck can carry milk and eggs. But you spend most of your time finding a place to park it, paying for diesel, getting a special license to drive it, and explaining to your neighbors why you just flattened their mailbox. Kubernetes was built for Google-scale problems. Most of us run businesses that need the reliability of a Toyota Camry, not a fleet of space shuttles. We adopted the tool because everyone else did, mistaking its complexity for sophistication.

Is your infrastructure serving you?

This whole experience led us to create a simple “mirror test.” If you’re wondering if you’re in the same boat, ask your team these questions:

Do you dread upgrading your cluster more than you dread a root canal?
Is a significant portion of your engineering time spent on “infra-babysitting” instead of building product features?
Could you confidently explain your service mesh configuration to a new hire without a whiteboard and a two-hour meeting?

If you answered “yes” to two or more, you might not have an infrastructure problem. You might have a Kubernetes problem.

This isn’t a manifesto to uninstall kubectl tomorrow. Some organizations genuinely operate at a scale where Kubernetes is not just useful, but necessary. This is just a friendly nudge. A reminder to look up from your YAML files once in a while and ask: is this tool still serving me, or have I started serving it?

How Does etcd Work in Kubernetes?

Kubernetes has emerged as a dominant player in the container orchestration world, providing robust solutions for managing containerized applications. At the heart of Kubernetes lies etcd, an essential component often compared to the “brain” of the system. This comparison is appropriate, as etcd plays a crucial role in maintaining a Kubernetes cluster’s overall state and health. Understanding how etcd works within Kubernetes is key to grasping the fundamentals of Kubernetes itself.

The Core Function of etcd in Kubernetes

Etcd is a distributed key-value store that serves as the primary data store for Kubernetes. Its main function is to store all the cluster data, such as configuration data, secrets, service discovery information, and the state of all the resources in the cluster. This centralized data store acts as the single source of truth for the entire cluster, ensuring consistency and reliability in the information that Kubernetes needs to operate efficiently.

Cluster Data Storage

In Kubernetes, etcd stores all the persistent data of the cluster. This includes:

Cluster configuration: All the configuration settings required to manage the cluster.
State of the cluster: Information about all the nodes, pods, services, and other resources.
Service discovery: Data that helps in the discovery of services within the cluster.
Secrets: Sensitive information like passwords, tokens, and keys.

By acting as the only source of truth, etcd ensures that the cluster’s state is accurately maintained and can be reliably queried and updated as needed.

Consistency and Availability

Etcd achieves high consistency and availability through the use of the Raft consensus algorithm. Raft is designed to ensure that even in the presence of failures, etcd can maintain a consistent state across all nodes. This is crucial for Kubernetes, as it relies on etcd to provide a consistent view of the cluster’s state.

The Raft Consensus Algorithm

Raft works by electing a leader among the etcd nodes, which then manages all write operations. The leader replicates these changes to the follower nodes, ensuring that all nodes have the same data. If the leader fails, a new leader is elected from the follower nodes. This process ensures that etcd remains available and consistent, even in the face of node failures.

Interaction with the Kubernetes API

When users or administrators interact with Kubernetes through its API, any changes made to resources (such as creating or modifying pods, services, or deployments) are stored in etcd. The Kubernetes API server communicates directly with etcd to persist these changes. This interaction is fundamental to Kubernetes’ ability to maintain and manage the cluster’s desired state.

The “Watch” Functionality

One of the powerful features of etcd is its ability to watch for changes in the data it stores. Kubernetes leverages this functionality to detect changes in the cluster’s state quickly and efficiently. When a change occurs, etcd notifies Kubernetes, which can then take appropriate actions to ensure the cluster’s desired state is maintained.

Deployment of etcd in Kubernetes

In a typical Kubernetes setup, etcd is deployed on the control plane nodes. For production environments, it is recommended to use a dedicated etcd cluster. This approach enhances the reliability and availability of etcd, as it reduces the risk of resource contention with other control plane components.

Best Practices for Deployment

Dedicated etcd cluster: Ensures high availability and performance.
High availability setup: Deploying etcd in a highly available configuration with multiple nodes.
Regular backups: Ensuring that regular backups of the etcd data are taken to safeguard against data loss.

Security Considerations

Security is a critical aspect of etcd deployment in Kubernetes. Typically, etcd is configured with mutual TLS (mTLS) authentication to secure communication between etcd nodes and between etcd and other Kubernetes components. This ensures that only authenticated and authorized entities can access the sensitive data stored in etcd.

Backup and Recovery

Given that etcd contains all the critical data of a Kubernetes cluster, regular backups are essential. In the event of a failure or data corruption, having recent backups allows administrators to restore the cluster to a known good state. Kubernetes provides tools and best practices for performing regular backups of etcd data.

Tools for etcd Backup

Several tools can be used to back up etcd:

etcdctl: This is the official command-line tool for interacting with etcd. It allows you to perform backups and restores with the following commands:

.– To make a backup:

ETCDCTL_API=3 etcdctl snapshot save <backup-file-path> \
  --endpoints=<etcd-endpoint> \
  --cacert=<path-to-cafile> \
  --cert=<path-to-certfile> \
  --key=<path-to-keyfile>

.– To restore from a backup:

ETCDCTL_API=3 etcdctl snapshot restore <backup-file-path> \
  --data-dir=<new-data-dir>

Velero: An open-source tool primarily used for backing up and restoring Kubernetes resources, but it can also be configured to back up etcd data. Velero is popular in production environments due to its efficient and automated backup management capabilities.
- To use Velero with etcd, a specific plugin can be configured to back up etcd data alongside Kubernetes resources.
Kubernetes Operator: Some Kubernetes operators are designed specifically for managing etcd and may include backup and restore functionalities. For example, the etcd-operator by CoreOS provides advanced management capabilities for etcd, including automated backups.
Kubernetes CronJobs: CronJobs can be set up in Kubernetes to execute etcdctl commands at regular intervals, automating periodic backups.

Best Practices for Backup

Backup Frequency: Perform regular backups, ideally daily, and before making any significant changes to the cluster.
Secure Storage: Store backups in secure and redundant locations, such as cloud storage with appropriate retention policies.
Recovery Testing: Periodically test the recovery process to ensure that backups are valid and can be restored correctly.

By incorporating these practices and tools, administrators can ensure that critical etcd data is protected and can be effectively restored in the event of a disaster.

Performance Characteristics

Etcd is designed to handle high volumes of write operations, making it well-suited for the dynamic nature of Kubernetes clusters. It can manage thousands of writes per second, ensuring that even in large-scale deployments, etcd can keep up with the demands of the cluster.

End Note

Etcd acts as the brain of Kubernetes, storing and managing all the critical information about the cluster. Its distributed, consistent, and highly available design makes it an ideal choice for this role. By understanding how etcd works and its importance in the Kubernetes ecosystem, administrators and developers can better appreciate the robustness and reliability of Kubernetes, ensuring smooth and efficient operation even at scale.

July 16, 2024 by Fernando SRE DevOps stuff Kubernetes SRE stuff

The Power of Event-Driven Scaling in Kubernetes: KEDA

Kubernetes is a compelling platform for managing containerized applications but can be complex. One area where Kubernetes shines is its ability to scale applications based on demand. However, traditional scaling methods in Kubernetes might not always be the most efficient, especially when dealing with event-driven workloads. This is where KEDA (Kubernetes Event-Driven Autoscaling) comes into play.

What is KEDA?

KEDA stands for Kubernetes Event-Driven Autoscaling. It is an open-source component that allows Kubernetes to scale applications based on events. This means that instead of only scaling your applications based on metrics like CPU or memory usage, you can scale them based on specific events or external metrics such as the number of messages in a queue, the rate of requests to an endpoint, or custom metrics from various sources.

Key Features and Functionalities

Event-Driven Scaling: KEDA enables scaling based on the number of events that need to be processed, rather than just CPU or memory metrics.
Lightweight Component: KEDA is designed to be a lightweight addition to your Kubernetes cluster, ensuring it doesn’t interfere with other components.
Flexibility: It integrates seamlessly with Kubernetes’ Horizontal Pod Autoscaler (HPA), extending its functionality without overwriting or duplicating it.
Built-In Scalers: KEDA comes with over 50 built-in scalers for various platforms, including cloud services, databases, messaging systems, telemetry systems, CI/CD tools, and more.
Support for Multiple Workloads: It can scale various types of workloads, including deployments, jobs, and custom resources.
Scaling to Zero: KEDA allows scaling down to zero pods when there are no events to process, optimizing resource usage and reducing costs.
Extensibility: You can use community-maintained or custom scalers to support unique event sources.
Provider-Agnostic: KEDA supports event triggers from a wide range of cloud providers and products.
Azure Functions Integration: It allows you to run and scale Azure Functions in Kubernetes for production workloads.
Resource Optimization: KEDA helps build sustainable platforms by optimizing workload scheduling and scaling to zero when not needed.

Advantages of Using KEDA

Efficiency: By scaling based on actual events, KEDA ensures that your application only uses the resources it needs, improving efficiency and potentially reducing costs.
Flexibility: With support for a wide range of event sources and integration with HPA, KEDA provides a flexible scaling solution.
Simplicity: It simplifies the configuration of event-driven scaling in Kubernetes, abstracting the complexities of integrating different event sources.
Seamless Integration: KEDA works well with existing Kubernetes components and can be easily integrated into your current infrastructure.

Optimizing a Retail Application

Imagine you are managing an online retail application. During normal hours, traffic is relatively steady, but during sales events, the number of orders can spike dramatically. Here’s how KEDA can help:

Order Processing: Your application uses a message queue to handle order processing. Normally, the queue has a manageable number of messages, but during a sale, the number of messages can skyrocket.
Scaling with KEDA: KEDA can monitor the message queue and automatically scale the order processing service based on the number of messages. This ensures that as more orders come in, additional instances of the service are started to handle the load, preventing delays and improving customer experience.
Cost Management: Once the sale is over and the message count drops, KEDA will scale down the service, ensuring that you are not paying for unused resources.
Scaling to Zero: When there are no orders to process, KEDA can scale the order processing service down to zero pods, further reducing costs.

In a few words

KEDA is a powerful tool that brings the benefits of event-driven scaling to Kubernetes. Its ability to scale applications based on events makes it an ideal choice for dynamic workloads. By integrating with a variety of event sources and providing a simple yet flexible way to configure scaling, KEDA helps optimize resource usage, enhance performance, and manage costs effectively. Whether you’re running an e-commerce platform, processing data streams, or managing microservices, KEDA can help ensure your applications are always running efficiently.

In essence, KEDA is about making your applications responsive to real-world events, ensuring they are always ready to meet demand without wasting resources. It’s a valuable addition to any Kubernetes toolkit, offering a smarter, more efficient way to handle scaling.

July 11, 2024 by Fernando SRE DevOps stuff Kubernetes SRE stuff

Storage Classes in Kubernetes, Let’s Manage Persistent Data

One essential aspect in Kubernetes is how to handle persistent storage, and this is where Kubernetes Storage Classes come into play. In this article, we’ll explore what Storage Classes are, their key components, and how to use them effectively with practical examples.
If you’re working with applications that need to store data persistently (like databases, file systems, or even just configuration files), you’ll want to understand how these work.

What is a Storage Class?

Imagine you’re running a library (that’s our Kubernetes cluster). Now, you need different types of shelves for different kinds of books, some for heavy encyclopedias, some for delicate rare books, and others for popular paperbacks. In Kubernetes, Storage Classes are like these different types of shelves. They define the types of storage available in your cluster.

Storage Classes allow you to dynamically provision storage resources based on the needs of your applications. It’s like having a librarian who can create the perfect shelf for each book as soon as it arrives.

Key Components of a Storage Class

Let’s break down the main parts of a Storage Class:

Provisioner: This is the system that will create the actual storage. It’s like our librarian who creates the shelves.
Parameters: These are specific instructions for the provisioner. For example, “Make this shelf extra sturdy” or “This shelf should be fireproof”.
Reclaim Policy: This determines what happens to the storage when it’s no longer needed. Do we keep the shelf (Retain) or dismantle it (Delete)?
Volume Binding Mode: This decides when the actual storage is created. It’s like choosing between having shelves ready in advance or building them only when a book arrives.

Creating a Storage Class

Now, let’s create our first Storage Class. We’ll use AWS EBS (Elastic Block Store) as an example. Don’t worry if you’re unfamiliar with AWS, the concepts are similar for other cloud providers.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-storage
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

Let’s break this down:

name: fast-storage: This is the name we’re giving our Storage Class.
provisioner: ebs.csi.aws.com: This tells Kubernetes to use the AWS EBS CSI driver to create the storage.
parameters: type: gp3: This specifies that we want to use gp3 EBS volumes, which are a type of fast SSD storage in AWS.
reclaimPolicy: Delete: This means the storage will be deleted when it’s no longer needed.
volumeBindingMode: WaitForFirstConsumer: This tells Kubernetes to wait until a Pod actually needs the storage before creating it.

Using a Storage Class

Now that we have our Storage Class, how do we use it? We use it when creating a Persistent Volume Claim (PVC). A PVC is like a request for storage from an application.

Here’s an example of a PVC that uses our Storage Class:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-app-storage
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast-storage
  resources:
    requests:
      storage: 5Gi

Let’s break this down too:

name: my-app-storage: This is the name of our PVC.
accessModes: – ReadWriteOnce: This means a single node can mount the storage as read-write.
storageClassName: fast-storage: This is where we specify which Storage Class to use, it matches the name we gave our Storage Class earlier.
storage: 5Gi: This is requesting 5 gigabytes of storage.

Real-World Use Case

Let’s imagine we’re running a photo-sharing application. We need fast storage for the database that stores user information and slower, cheaper storage for the actual photos.

We could create two Storage Classes:

A “fast-storage” class (like the one we created above) for the database.
A “bulk-storage” class for the photos, perhaps using a different type of EBS volume that’s cheaper but slower.

Then, we’d create two PVCs (Persistent Volume Claim), one for each Storage Class. Our database Pod would use the PVC with the “fast-storage” class, while our photo storage Pod would use the PVC with the “bulk-storage” class.

This way, we’re optimizing our storage usage (and costs) based on the needs of different parts of our application.

In Summary

Storage Classes in Kubernetes provide a flexible and powerful way to manage different types of storage for your applications. By understanding and using Storage Classes, you can ensure your applications have the storage they need while keeping your infrastructure efficient and cost-effective.

Whether you’re working with AWS EBS, Google Cloud Persistent Disk, or any other storage backend, Storage Classes are an essential tool in your Kubernetes toolkit.

June 24, 2024 by Fernando SRE DevOps stuff Kubernetes SRE stuff

Understanding Kubernetes Network Policies. A Friendly Guide

In Kubernetes, effectively managing communication between different parts of your application is crucial for security and efficiency. That’s where Network Policies come into play. In this article, we’ll explore what Kubernetes Network Policies are, how they work, and provide some practical examples using YAML files. We’ll break it down in simple terms. Let’s go for it!

What are Kubernetes Network Policies?

Kubernetes Network Policies are rules that define how groups of Pods (the smallest deployable units in Kubernetes) can interact with each other and with other network endpoints. These policies allow or restrict traffic based on several factors, such as namespaces, labels, and ports.

Key Concepts

Network Policy

A Network Policy specifies the traffic rules for Pods. It can control both incoming (Ingress) and outgoing (Egress) traffic. Think of it as a security guard that only lets certain types of traffic in or out based on predefined rules.

Selectors

Selectors are used to choose which Pods the policy applies to. They can be based on labels (key-value pairs assigned to Pods), namespaces, or both. This flexibility allows for precise control over traffic flow.

Ingress and Egress Rules

Ingress Rules: These control incoming traffic to Pods. They define what sources can send traffic to the Pods and under what conditions.
Egress Rules: These control outgoing traffic from Pods. They specify what destinations the Pods can send traffic to and under what conditions.

Practical Examples with YAML

Let’s look at some practical examples to understand how Network Policies are defined and applied in Kubernetes.

Example 1: Allow Ingress Traffic from Specific Pods

Suppose we have a database Pod that should only receive traffic from application Pods labeled role=app. Here’s how we can define this policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-app-to-db
  namespace: default
spec:
  podSelector:
    matchLabels:
      role: db
  ingress:
  - from:
    - podSelector:
        matchLabels:
          role: app

In this example:

podSelector selects Pods with the label role=db.
ingress rule allows traffic from Pods with the label role=app.

Example 2: Deny All Ingress Traffic

If you want to ensure that no Pods can communicate with a particular group of Pods, you can define a policy to deny all ingress traffic:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-ingress
  namespace: default
spec:
  podSelector:
    matchLabels:
      role: sensitive
  ingress: []

In this other example:

podSelector selects Pods with the label role=sensitive.
An empty ingress rule (ingress: []) means no traffic is allowed in.

Example 3: Allow Egress Traffic to Specific External IPs

Now, let’s say we have a Pod that needs to send traffic to a specific external service, such as a payment gateway. We can define an egress policy for this:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-egress-to-external
  namespace: default
spec:
  podSelector:
    matchLabels:
      role: payment-client
  egress:
  - to:
    - ipBlock:
        cidr: 203.0.113.0/24
    ports:
    - protocol: TCP
      port: 443

In this last example:

podSelector selects Pods with the label role=payment-client.
egress rule allows traffic to the external IP range 203.0.113.0/24 on port 443 (typically used for HTTPS).

In Summary

Kubernetes Network Policies are powerful tools that help you control traffic flow within your cluster. You can create a secure and efficient network environment for your applications by using selectors and defining ingress and egress rules.
I hope this guide has demystified the concept of Network Policies and shown you how to implement them with practical examples. Remember, the key to mastering Kubernetes is practice, so try out these examples and see how they can enhance your deployments.

June 19, 2024 by Fernando SRE DevOps stuff Kubernetes SRE stuff

Understanding Kubernetes Garbage Collection

How Kubernetes Garbage Collection Works

Kubernetes is an open-source platform designed to automate the deployment, scaling, and operation of application containers. One essential feature of Kubernetes is garbage collection, a process that helps manage and clean up unused or unnecessary resources within a cluster. But how does this work?

Kubernetes garbage collection resembles a janitor who cleans up behind the scenes. It automatically identifies and removes resources that are no longer needed, such as old pods, completed jobs, and other transient data. This helps keep the cluster efficient and prevents it from running out of resources.

Key Concepts:

Pods: The smallest and simplest Kubernetes object. A pod represents a single instance of a running process in your cluster.
Controllers: Ensure that the cluster is in the desired state by managing pods, replica sets, deployments, etc.
Garbage Collection: Removes objects that are no longer referenced or needed, similar to how a computer’s garbage collector frees up memory.

How It Helps

Garbage collection in Kubernetes plays a crucial role in maintaining the health and efficiency of your cluster:

Resource Management: By cleaning up unused resources, it ensures that your cluster has enough capacity to run new and existing applications smoothly.
Cost Efficiency: Reduces the cost associated with maintaining unnecessary resources, especially in cloud environments where you pay for what you use.
Improved Performance: Keeps your cluster performant by avoiding resource starvation and ensuring that the nodes are not overwhelmed with obsolete objects.
Simplified Operations: Automates routine cleanup tasks, reducing the manual effort needed to maintain the cluster.

Setting Up Kubernetes Garbage Collection

Setting up garbage collection in Kubernetes involves configuring various aspects of your cluster. Below are the steps to set up garbage collection effectively:

1. Configure Pod Garbage Collection

Pod garbage collection automatically removes terminated pods to free up resources.

Example YAML:

apiVersion: v1
kind: Node
metadata:
  name: <node-name>
spec:
  podGC:
    - intervalSeconds: 3600 # Interval for checking terminated pods
      maxPodAgeSeconds: 7200 # Max age of terminated pods before deletion

2. Set Up TTL for Finished Resources

The TTL (Time To Live) controller helps manage finished resources such as completed or failed jobs by setting a lifespan for them.

Example YAML:

apiVersion: batch/v1
kind: Job
metadata:
  name: example-job
spec:
  ttlSecondsAfterFinished: 3600 # Deletes the job 1 hour after completion
  template:
    spec:
      containers:
      - name: example
        image: busybox
        command: ["echo", "Hello, Kubernetes!"]
      restartPolicy: Never

3. Configure Deployment Garbage Collection

Deployment garbage collection manages the history of deployments, removing old replicas to save space and resources.

Example YAML:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: example-deployment
spec:
  revisionHistoryLimit: 3 # Keeps the latest 3 revisions and deletes the rest
  replicas: 2
  selector:
    matchLabels:
      app: example
  template:
    metadata:
      labels:
        app: example
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2

Pros and Cons of Kubernetes Garbage Collection

Pros:

Automated Cleanup: Reduces manual intervention by automatically managing and removing unused resources.
Resource Efficiency: Frees up cluster resources, ensuring they are available for active workloads.
Cost Savings: Helps in reducing costs, especially in cloud environments where resource usage is directly tied to expenses.

Cons:

Configuration Complexity: Requires careful configuration to ensure critical resources are not inadvertently deleted.
Monitoring Needs: Regular monitoring is necessary to ensure the garbage collection process is functioning as intended and not impacting active workloads.

In Summary

Kubernetes garbage collection is a vital feature that helps maintain the efficiency and health of your cluster by automatically managing and cleaning up unused resources. By understanding how it works, how it benefits your operations, and how to set it up correctly, you can ensure your Kubernetes environment remains optimized and cost-effective.

Implementing garbage collection involves configuring pod, TTL, and deployment garbage collection settings, each serving a specific role in the cleanup process. While it offers significant advantages, balancing these with the potential complexities and monitoring requirements is essential to achieve the best results.

June 12, 2024 by Fernando SRE DevOps stuff Kubernetes SRE stuff

Understanding Kubernetes RBAC: Safeguarding Your Cluster

Role-Based Access Control (RBAC) stands as a cornerstone for securing and managing access within the Kubernetes ecosystem. Think of Kubernetes as a bustling city, with myriad services, pods, and nodes acting like different entities within it. Just like a city needs a comprehensive system to manage who can access what – be it buildings, resources, or services – Kubernetes requires a robust mechanism to control access to its numerous resources. This is where RBAC comes into play.

RBAC is not just a security feature; it’s a fundamental framework that helps maintain order and efficiency in Kubernetes’ complex environments. It’s akin to a sophisticated security system, ensuring that only authorized individuals have access to specific areas, much like keycard access in a high-security building. In Kubernetes, these “keycards” are roles and permissions, meticulously defined and assigned to users or groups.

This system is vital in a landscape where operations are distributed and responsibilities are segmented. RBAC allows granular control over who can do what, which is crucial in a multi-tenant environment. Without RBAC, managing permissions would be akin to leaving the doors of a secure facility unlocked, potentially leading to unauthorized access and chaos.

At its core, Kubernetes RBAC revolves around a few key concepts: defining roles with specific permissions, assigning these roles to users or groups, and ensuring that access rights are precisely tailored to the needs of the cluster. This ensures that operations within the Kubernetes environment are not only secure but also efficient and streamlined.

By embracing RBAC, organizations step into a realm of enhanced security, where access is not just controlled but intelligently managed. It’s a journey from a one-size-fits-all approach to a customized, role-based strategy that aligns with the diverse and dynamic needs of Kubernetes clusters. In the following sections, we’ll delve deeper into the intricacies of RBAC, unraveling its layers and revealing how it fortifies Kubernetes environments against security threats while facilitating smooth operational workflows.

User Accounts vs. Service Accounts in RBAC: A unique aspect of Kubernetes RBAC is its distinction between user accounts (human users or groups) and service accounts (software resources). This broad approach to defining “subjects” in RBAC policies is different from many other systems that primarily focus on human users.

Flexible Resource Definitions: RBAC in Kubernetes is notable for its flexibility in defining resources, which can include pods, logs, ingress controllers, or custom resources. This is in contrast to more restrictive systems that manage predefined resource types.

Roles and ClusterRoles: RBAC differentiates between Roles, which are namespace-specific, and ClusterRoles, which apply to the entire cluster. This distinction allows for more granular control of permissions within namespaces and broader control at the cluster level.

Role Example: A Role in the “default” namespace granting read access to pods:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: pod-reader
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "watch", "list"]

ClusterRole Example: A ClusterRole granting read access to secrets across all namespaces:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: secret-reader
rules:
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["get", "watch", "list"]

Managing Permissions with Verbs:

In Kubernetes RBAC, the concept of “verbs” is pivotal to how access controls are defined and managed. These verbs are essentially the actions that can be performed on resources within the Kubernetes environment. Unlike traditional access control systems that may offer a binary allow/deny model, Kubernetes RBAC verbs introduce a nuanced and highly granular approach to defining permissions.

Understanding Verbs in RBAC:

Core Verbs:
- Get: Allows reading a specific resource.
- List: Permits listing all instances of a resource.
- Watch: Enables watching changes to a particular resource.
- Create: Grants the ability to create new instances of a resource.
- Update: Provides permission to modify existing resources.
- Patch: Similar to update, but for making partial changes.
- Delete: Allows the removal of specific resources.
Extended Verbs:
- Exec: Permits executing commands in a container.
- Bind: Enables linking a role to specific subjects.

Practical Application of Verbs:

The power of verbs in RBAC lies in their ability to define precisely what a user or a service account can do with each resource. For example, a role that includes the “get,” “list,” and “watch” verbs for pods would allow a user to view pods and receive updates about changes to them but would not permit the user to create, update, or delete pods.

Customizing Access with Verbs:

This system allows administrators to tailor access rights at a very detailed level. For instance, in a scenario where a team needs to monitor deployments but should not change them, their role can include verbs like “get,” “list,” and “watch” for deployments, but exclude “create,” “update,” or “delete.”

Flexibility and Security:

This flexibility is crucial for maintaining security in a Kubernetes environment. By assigning only the necessary permissions, administrators can adhere to the principle of least privilege, reducing the risk of unauthorized access or accidental modifications.

Verbs and Scalability:

Moreover, verbs in Kubernetes RBAC make the system scalable. As the complexity of the environment grows, administrators can continue to manage permissions effectively by defining roles with the appropriate combination of verbs, tailored to the specific needs of users and services.

RBAC Best Practices: Implementing RBAC effectively involves understanding and applying best practices, such as ensuring least privilege, regularly auditing and reviewing RBAC settings, and understanding the implications of role bindings within and across namespaces.

Real-World Use Case: Imagine a scenario where an organization needs to limit developers’ access to specific namespaces for deploying applications while restricting access to other cluster areas. By defining appropriate Roles and RoleBindings, Kubernetes RBAC allows precise control over what developers can do, significantly enhancing both security and operational efficiency.

The Synergy of RBAC and ServiceAccounts in Kubernetes Security

In the realm of Kubernetes, RBAC is not merely a feature; it’s the backbone of access management, playing a crucial role in maintaining a secure and efficient operation. However, to fully grasp the essence of Kubernetes security, one must understand the synergy between RBAC and ServiceAccounts.

Understanding ServiceAccounts:

ServiceAccounts in Kubernetes are pivotal for automating processes within the cluster. They are special kinds of accounts used by applications and pods, as opposed to human operators. Think of ServiceAccounts as robot users – automated entities performing specific tasks in the Kubernetes ecosystem. These tasks range from running a pod to managing workloads or interacting with the Kubernetes API.

The Role of ServiceAccounts in RBAC:

Where RBAC is the rulebook defining what can be done, ServiceAccounts are the players acting within those rules. RBAC policies can be applied to ServiceAccounts, thereby regulating the actions these automated players can take. For example, a ServiceAccount tied to a pod can be granted permissions through RBAC to access certain resources within the cluster, ensuring that the pod operates within the bounds of its designated privileges.

Integrating ServiceAccounts with RBAC:

Integrating ServiceAccounts with RBAC allows Kubernetes administrators to assign specific roles to automated processes, thereby providing a nuanced and secure access control system. This integration ensures that not only are human users regulated, but also that automated processes adhere to the same stringent security protocols.

Practical Applications. The CI/CD Pipeline:

In a Continuous Integration and Continuous Deployment (CI/CD) pipeline, tasks like code deployment, automated testing, and system monitoring are integral. These tasks are often automated and run within the Kubernetes environment. The challenge lies in ensuring these automated processes have the necessary permissions to perform their functions without compromising the security of the Kubernetes cluster.

Role of ServiceAccounts:

Automated Task Execution: ServiceAccounts are perfect for CI/CD pipelines. Each part of the pipeline, be it a deployment process or a testing suite, can have its own ServiceAccount. This ensures that the permissions are tightly scoped to the needs of each task.
Specific Permissions: For instance, a ServiceAccount for a deployment tool needs permissions to update pods and services, while a monitoring tool’s ServiceAccount might only need to read pod metrics and log data.

Applying RBAC for Fine-Grained Control:

Defining Roles: With RBAC, specific roles can be created for different stages of the CI/CD pipeline. These roles define precisely what operations are permissible by the ServiceAccount associated with each stage.
Example Role for Deployment: A role for the deployment stage may include verbs like ‘create’, ‘update’, and ‘delete’ for resources such as pods and deployments.

kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: deployment
  name: deployment-manager
rules:
- apiGroups: ["apps", ""]
  resources: ["deployments", "pods"]
  verbs: ["create", "update", "delete"]

Binding Roles to ServiceAccounts: Each role is then bound to the appropriate ServiceAccount, ensuring that the permissions align with the task’s requirements.

kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: deployment-manager-binding
  namespace: deployment
subjects:
- kind: ServiceAccount
  name: deployment-service-account
  namespace: deployment
roleRef:
  kind: Role
  name: deployment-manager
  apiGroup: rbac.authorization.k8s.io

Isolation and Security: This setup not only isolates each task’s permissions but also minimizes the risk of a security breach. If a part of the pipeline is compromised, the attacker has limited permissions, confined to a specific role and namespace.

Enhancing CI/CD Security:

Least Privilege Principle: The principle of least privilege is effectively enforced. Each ServiceAccount has only the permissions necessary to perform its designated task, nothing more.
Audit and Compliance: The explicit nature of RBAC roles and ServiceAccount bindings makes it easier to audit and ensure compliance with security policies.
Streamlined Operations: Administrators can manage and update permissions as the pipeline evolves, ensuring that the CI/CD processes remain efficient and secure.

The Harmony of Automation and Security:

In conclusion, the combination of RBAC and ServiceAccounts forms a harmonious balance between automation and security in Kubernetes. This synergy ensures that every action, whether performed by a human or an automated process, is under the purview of meticulously defined permissions. It’s a testament to Kubernetes’ foresight in creating an ecosystem where operational efficiency and security go hand in hand.

December 25, 2023 by Fernando SRE DevOps stuff Kubernetes SRE stuff