Helm

Your Kubernetes rollback is lying

The PagerDuty alert screams. The new release, born just minutes ago with such promising release notes, is coughing up blood in production. The team’s Slack channel is a frantic mess of flashing red emojis. Someone, summoning the voice of a panicked adult, yells the magic word: “ROLLBACK!”

And so, Helm, our trusty tow-truck operator, rides in with a smile, waving its friendly green check marks. The dashboards, those silent accomplices, beam with the serene glow of healthy metrics. Kubernetes probes, ever so polite, confirm that the resurrected pods are, in fact, “breathing.”

Then, production face-plants. Hard.

The feeling is like putting a cartoon-themed bandage on a burst water pipe and then wondering, with genuine surprise, why the living room has become a swimming pool. This article is the autopsy of those “perfect” rollbacks. We’re going to uncover why your monitoring is a pathological liar, how network traffic becomes a double agent, and what to do so that the next time Helm gives you a thumbs-up, you can actually believe it.

A state that refuses to time-travel

The first, most brutal lie a rollback tells you is that it can turn back time. A helm rollback is like the “rewind” button on an old VCR remote; it diligently rewinds the tape (your YAML manifests), but it has absolutely no power to make the actors on screen younger.

Your application’s state is one of those stubborn actors.

While your ConfigMaps and Secrets might dutifully revert to their previous versions, your data lives firmly in the present. If your new release included a database migration that added a column, rolling back the application code doesn’t magically make that column disappear. Now your old code is staring at a database schema from the future, utterly confused, like a medieval blacksmith being handed an iPad.

The same goes for PersistentVolumeClaims, external caches like Redis, or messages sitting in a Kafka queue. The rollback command whispers sweet nothings about returning to a “known good state,” but it’s only talking about itself. The rest of your universe has moved on, and it refuses to travel back with you.

The overly polite doorman

The second culprit in our investigation is the Kubernetes probe. Think of the readinessProbe as an overly polite doorman at a fancy party. Its job is to check if a guest (your pod) is ready to enter. But its definition of “ready” can be dangerously optimistic.

Many applications, especially those running on the JVM, have what we’ll call a “warming up” period. When a pod starts, the process is running, the HTTP port is open, and it will happily respond to a simple /health check. The doorman sees a guest in a tuxedo and says, “Looks good to me!” and opens the door.

What the doorman doesn’t see is that this guest is still stretching, yawning, and trying to remember where they are. The application’s caches are cold, its connection pools are empty, and its JIT compiler is just beginning to think about maybe, possibly, optimizing some code. The first few dozen requests it receives will be painfully slow or, worse, time out completely.

So while your readinessProbe is giving you a green light, your first wave of users is getting a face full of errors. For these sleepy applications, you need a more rigorous bouncer.

A startupProbe is that bouncer. It gives the app a generous amount of time to get its act together before even letting the doorman (readiness and liveness probes) start their shift.

# This probe gives our sleepy JVM app up to 5 minutes to wake up.
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
startupProbe:
  httpGet:
    path: /health/ready
    port: 8080
  # Kubelet will try 30 times with a 10-second interval (300 seconds).
  # If the app isn't ready by then, the pod will be restarted.
  failureThreshold: 30
  periodSeconds: 10

Without it, your rollback creates a fleet of pods that are technically alive but functionally useless, and Kubernetes happily sends them a flood of unsuspecting users.

Traffic, the double agent

And that brings us to our final suspect: the network traffic itself. In a modern setup using a service mesh like Istio or Linkerd, traffic routing is a sophisticated dance. But even the most graceful dancer can trip.

When you roll back, a new ReplicaSet is created with the old pod specification. The service mesh sees these new pods starting up, asks the doorman (readinessProbe) if they’re good to go, gets an enthusiastic “yes!”, and immediately starts sending them a percentage of live production traffic.

This is where all our problems converge. Your service mesh, in its infinite efficiency, has just routed 50% of your user traffic to a platoon of sleepy, confused pods that are trying to talk to a database from the future.

Let’s look at the evidence. This VirtualService, which we now call “The 50/50 Disaster Splitter,” was routing traffic with criminal optimism.

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: checkout-api-vs
  namespace: prod-eu-central
spec:
  hosts:
    - "checkout.api.internal"
  http:
    - route:
        - destination:
            host: checkout-api-svc
            subset: v1-stable
          weight: 50 # 50% to the (theoretically) working pods
        - destination:
            host: checkout-api-svc
            subset: v1-rollback
          weight: 50 # 50% to the pods we just dragged from the past

The service mesh isn’t malicious. It’s just an incredibly efficient tool that is very good at following bad instructions. It sees a green light and hits the accelerator.

A survival guide that won’t betray you

So, you’re in the middle of a fire, and the “break glass in case of emergency” button is a lie. What do you do? You need a playbook that acknowledges reality.

Step 0: Breathe and isolate the blast radius

Before you even think about rolling back, stop the bleeding. The fastest way to do that is often at the traffic level. Use your service mesh or ingress controller to immediately shift 100% of traffic back to the last known good version. Don’t wait for new pods to start. This is a surgical move that takes seconds and gives you breathing room.

Step 1: Declare an incident and gather the detectives

Get the right people on a call. Announce that this is not a “quick rollback” but an incident investigation. Your goal is to understand why the release failed, not just to hit the undo button.

Step 2: Perform the autopsy (while the system is stable)

With traffic safely routed away from the wreckage, you can now investigate. Check the logs of the failed pods. Look at the database. Is there a schema mismatch? A bad configuration? This is where you find the real killer.

Step 3: Plan the counter-offensive (which might not be a rollback)

Sometimes, the safest path forward is a roll forward. A small hotfix that corrects the issue might be faster and less risky than trying to force the old code to work with a new state. A rollback should be a deliberate, planned action, not a panic reflex. If you must roll back, do it with the knowledge you’ve gained from your investigation.

Step 4: The deliberate, cautious rollback

If you’ve determined a rollback is the only way, do it methodically.

Scale down the broken deployment:
kubectl scale deployment/checkout-api –replicas=0
Execute the Helm rollback:
helm rollback checkout-api 1 -n prod-eu-central
Watch the new pods like a hawk: Monitor their logs and key metrics as they come up. Don’t trust the green check marks.
Perform a Canary Release: Once the new pods look genuinely healthy, use your service mesh to send them 1% of the traffic. Then 10%. Then 50%. Then 100%. You are now in control, not the blind optimism of the automation.

The truth will set you free

A Kubernetes rollback isn’t a time machine. It’s a YAML editor with a fancy title. It doesn’t understand your data, it doesn’t appreciate your app’s need for a morning coffee, and it certainly doesn’t grasp the nuances of traffic routing under pressure.

Treating a rollback as a simple, safe undo button is the fastest way to turn a small incident into a full-blown outage. By understanding the lies it tells, you can build a process that trusts human investigation over deceptive green lights. So the next time a deployment goes sideways, don’t just reach for the rollback lever. Reach for your detective’s hat instead.

Helm or Kustomize for deploying to Kubernetes?

Choosing the right tool for continuous deployments is a big decision. It’s like picking the right vehicle for a road trip. Do you go for the thrill of a sports car or the reliability of a sturdy truck? In our world, the “cargo” is your application, and we want to ensure it reaches its destination smoothly and efficiently.

Two popular tools for this task are Helm and Kustomize. Both help you manage and deploy applications on Kubernetes, but they take different approaches. Let’s dive in, explore how they work, and help you decide which one might be your ideal travel buddy.

What is Helm?

Imagine Helm as a Kubernetes package manager, similar to apt or yum if you’ve worked with Linux before. It bundles all your application’s Kubernetes resources (like deployments, services, etc.) into a neat Helm chart package. This makes installing, upgrading, and even rolling back your application straightforward.

Think of a Helm chart as a blueprint for your application’s desired state in Kubernetes. Instead of manually configuring each element, you have a pre-built plan that tells Kubernetes exactly how to construct your environment. Helm provides a command-line tool, helm, to create these charts. You can start with a basic template and customize it to suit your needs, like a pre-fabricated house that you can modify to match your style. Here’s what a typical Helm chart looks like:

mychart/
  Chart.yaml        # Describes the chart
  templates/        # Contains template files
    deployment.yaml # Template for a Deployment
    service.yaml    # Template for a Service
  values.yaml       # Default configuration values

Helm makes it easy to reuse configurations across different projects and share your charts with others, providing a practical way to manage the complexity of Kubernetes applications.

What is Kustomize?

Now, let’s talk about Kustomize. Imagine Kustomize as a powerful customization tool for Kubernetes, a versatile toolkit designed to modify and adapt existing Kubernetes configurations. It provides a way to create variations of your deployment without having to rewrite or duplicate configurations. Think of it as having a set of advanced tools to tweak, fine-tune, and adapt everything you already have. Kustomize allows you to take a base configuration and apply overlays to create different variations for various environments, making it highly flexible for scenarios like development, staging, and production.

Kustomize works by applying patches and transformations to your base Kubernetes YAML files. Instead of duplicating the entire configuration for each environment, you define a base once, and then Kustomize helps you apply environment-specific changes on top. Imagine you have a basic configuration, and Kustomize is your stencil and spray paint set, letting you add layers of detail to suit different environments while keeping the base consistent. Here’s what a typical Kustomize project might look like:

base/
  deployment.yaml
  service.yaml

overlays/
  dev/
    kustomization.yaml
    patches/
      deployment.yaml
  prod/
    kustomization.yaml
    patches/
      deployment.yaml

The structure is straightforward: you have a base directory that contains your core configurations, and an overlays directory that includes different environment-specific customizations. This makes Kustomize particularly powerful when you need to maintain multiple versions of an application across different environments, like development, staging, and production, without duplicating configurations.

Kustomize shines when you need to maintain variations of the same application for multiple environments, such as development, staging, and production. This helps keep your configurations DRY (Don’t Repeat Yourself), reducing errors and simplifying maintenance. By keeping base definitions consistent and only modifying what’s necessary for each environment, you can ensure greater consistency and reliability in your deployments.

Helm vs Kustomize, different approaches

Helm uses templating to generate Kubernetes manifests. It takes your chart’s templates and values, combines them, and produces the final YAML files that Kubernetes needs. This templating mechanism allows for a high level of flexibility, but it also adds a level of complexity, especially when managing different environments or configurations. With Helm, the user must define various parameters in values.yaml files, which are then injected into templates, offering a powerful but sometimes intricate method of managing deployments.

Kustomize, by contrast, uses a patching approach, starting from a base configuration and applying layers of customizations. Instead of generating new YAML files from scratch, Kustomize allows you to define a consistent base once, and then apply overlays for different environments, such as development, staging, or production. This means you do not need to maintain separate full configurations for each environment, making it easier to ensure consistency and reduce duplication. Kustomize’s patching mechanism is particularly powerful for teams looking to maintain a DRY (Don’t Repeat Yourself) approach, where you only change what’s necessary for each environment without affecting the shared base configuration. This also helps minimize configuration drift, keeping environments aligned and easier to manage over time.

Ease of use

Helm can be a bit intimidating at first due to its templating language and chart structure. It’s like jumping straight onto a motorcycle, whereas Kustomize might feel more like learning to ride a bike with training wheels. Kustomize is generally easier to pick up if you are already familiar with standard Kubernetes YAML files.

Packaging and reusability

Helm excels when it comes to packaging and distributing applications. Helm charts can be shared, reused, and maintained, making them perfect for complex applications with many dependencies. Kustomize, on the other hand, is focused on customizing existing configurations rather than packaging them for distribution.

Integration with kubectl

Both tools integrate well with Kubernetes’ command-line tool, kubectl. Helm has its own CLI, helm, which extends kubectl capabilities, while Kustomize can be directly used with kubectl via the -k flag.

Declarative vs. Imperative

Kustomize follows a declarative mode, you describe what you want, and it figures out how to get there. Helm can be used both declaratively and imperatively, offering more flexibility but also more complexity if you want to take a hands-on approach.

Release history management

Helm provides built-in release management, keeping track of the history of your deployments so you can easily roll back to a previous version if needed. Kustomize lacks this feature, which means you need to handle versioning and rollback strategies separately.

CI/CD integration

Both Helm and Kustomize can be integrated into your CI/CD pipelines, but their roles and strengths differ slightly. Helm is frequently chosen for its ability to package and deploy entire applications. Its charts encapsulate all necessary components, making it a great fit for automated, repeatable deployments where consistency and simplicity are key. Helm also provides versioning, which allows you to manage releases effectively and roll back if something goes wrong, which is extremely useful for CI/CD scenarios.

Kustomize, on the other hand, excels at adapting deployments to fit different environments without altering the original base configurations. It allows you to easily apply changes based on the environment, such as development, staging, or production, by layering customizations on top of the base YAML files. This makes Kustomize a valuable tool for teams that need flexibility across multiple environments, ensuring that you maintain a consistent base while making targeted adjustments as needed.

In practice, many DevOps teams find that combining both tools provides the best of both worlds: Helm for packaging and managing releases, and Kustomize for environment-specific customizations. By leveraging their unique capabilities, you can build a more robust, flexible CI/CD pipeline that meets the diverse needs of your application deployment processes.

Helm and Kustomize together

Here’s an interesting twist: you can use Helm and Kustomize together! For instance, you can use Helm to package your base application, and then apply Kustomize overlays for environment-specific customizations. This combo allows for the best of both worlds, standardized base configurations from Helm and flexible customizations from Kustomize.

Use cases for combining Helm and Kustomize

Environment-Specific customizations: Use Kustomize to apply environment-specific configurations to a Helm chart. This allows you to maintain a single base chart while still customizing for development, staging, and production environments.
Third-Party Helm charts: Instead of forking a third-party Helm chart to make changes, Kustomize lets you apply those changes directly on top, making it a cleaner and more maintainable solution.
Secrets and ConfigMaps management: Kustomize allows you to manage sensitive data, such as secrets and ConfigMaps, separately from Helm charts, which can help improve both security and maintainability.

Final thoughts

So, which tool should you choose? The answer depends on your needs and preferences. If you’re looking for a comprehensive solution to package and manage complex Kubernetes applications, Helm might be the way to go. On the other hand, if you want a simpler way to tweak configurations for different environments without diving into templating languages, Kustomize may be your best bet.

My advice? If the application is for internal use within your organization, use Kustomize. If the application is to be distributed to third parties, use Helm.

November 17, 2024 by Fernando SRE DevOps stuff Kubernetes SRE stuff