HighAvailability

How to ensure high availability for pods in Kubernetes

I was thinking the other day about these Kubernetes pods, and how they’re like little spaceships floating around in the cluster. But what happens if one of those spaceships suddenly vanishes? Poof! Gone! That’s a real problem. So I started wondering, how can we ensure our pods are always there, ready to do their job, even if things go wrong? It’s like trying to keep a juggling act going while someone’s moving the floor around you…

Let me tell you about this tool called Karpenter. It’s like a super-efficient hotel manager for our Kubernetes worker nodes, always trying to arrange the “guests” (our applications) most cost-effectively. Sometimes, this means moving guests from one room to another to save on operating costs. In Kubernetes terminology, we call this “consolidation.”

The dancing pods challenge

Here’s the thing: We have this wonderful hotel manager (Karpenter) who’s doing a fantastic job, keeping costs down by constantly optimizing room assignments. But what about our guests (the applications)? They might get a bit dizzy with all this moving around, and sometimes, their important work gets disrupted.

So, the question is: How do we keep our applications running smoothly while still allowing Karpenter to do its magic? It’s like trying to keep a circus performance going while the stage crew rearranges the set in the middle of the act.

Understanding the moving parts

Before we explore the solutions, let’s take a peek behind the scenes and see what happens when Karpenter decides to relocate our applications. It’s quite a fascinating process:

First, Karpenter puts up a “Do Not Disturb” sign (technically called a taint) on the node it wants to clear. Then, it finds new accommodations for all the applications. Finally, it carefully moves each application to its new location.

Think of it as a well-choreographed dance where each step must be perfectly timed to avoid any missteps.

The art of high availability

Now, for the exciting part, we have some clever tricks up our sleeves to ensure our applications keep running smoothly:

  1. The buddy system: The first rule of high availability is simple: never go it alone! Instead of running a single instance of your application, run at least two. It’s like having a backup singer, if one voice falters, the show goes on. In Kubernetes, we do this by setting replicas: 2 in our deployment configuration.
  2. Strategic placement: Here’s a neat trick: we can tell Kubernetes to spread our application copies across different physical machines. It’s like not putting all your eggs in one basket. We use something called “Pod Topology Spread Constraints” for this. Here’s how it looks in practice:
topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: your-app
  1. Setting boundaries: Remember when your parents set rules about how many cookies you could eat? We do something similar in Kubernetes with PodDisruptionBudgets (PDB). We tell Kubernetes, “Hey, you must always keep at least 50% of my application instances running.” This prevents our hotel manager from getting too enthusiastic about rearranging things.
  2. The “Do Not Disturb” sign: For those special cases where we absolutely don’t want an application to be moved, we can put up a permanent “Do Not Disturb” sign using the karpenter.sh/do-not-disrupt: “true” annotation. It’s like having a VIP guest who gets to keep their room no matter what.

The complete picture

The beauty of this system lies in how all the pieces work together. Think of it as a safety net with multiple layers:

  • Multiple instances ensure basic redundancy.
  • Strategic placement keeps instances separated.
  • PodDisruptionBudgets prevent too many moves at once.
  • And when necessary, we can completely prevent disruption.

A real example

Let me paint you a picture. Imagine you’re running a critical web service. Here’s how you might set it up:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-web-service
spec:
  replicas: 2
  template:
    metadata:
      annotations:
        karpenter.sh/do-not-disrupt: "false"  # We allow movement, but with controls
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: critical-web-service-pdb
spec:
  minAvailable: 50%
  selector:
    matchLabels:
      app: critical-web-service

The result

With these patterns in place, our applications become incredibly resilient. They can handle node failures, scale smoothly, and even survive Karpenter’s optimization efforts without any downtime. It’s like having a self-healing system that keeps your services running no matter what happens behind the scenes.

High availability isn’t just about having multiple copies of our application, it’s about thoughtfully designing how those copies are managed and maintained. By understanding and implementing these patterns, we are not just running applications in Kubernetes; we are crafting reliable, resilient services that can weather any storm.

The next time you deploy an application to Kubernetes, think about these patterns. They might just save you from that dreaded 3 AM wake-up call about your service being down!

Building resilient AWS infrastructure

Imagine building a house of cards. One slight bump and the whole structure comes tumbling down. Now, imagine if your AWS infrastructure was like that house of cards, a scary thought, right? That’s exactly what happens when we have single points of failure in our cloud architecture.

We’re setting ourselves up for trouble when we build systems that depend on a single critical component. If that component fails, the entire system can come crashing down like the house of cards. Instead, we want our infrastructure to resemble a well-engineered skyscraper: stable, robust, and designed with the foresight that no one piece can bring everything down. By thinking ahead and using the right tools, we can build systems that are resilient, adaptable, and ready for anything.

Why should you care about high availability?

Let me start with a story. A few years ago, a major e-commerce company lost millions in revenue when its primary database server crashed during Black Friday. The problem? They had no redundancy in place. It was like trying to cross a river with just one bridge,  when that bridge failed, they were completely stuck. In the cloud era, having a single point of failure isn’t just risky, it’s entirely avoidable.

AWS provides us with incredible tools to build resilient systems, kind of like having multiple bridges, boats, and even helicopters to cross that river. Let’s explore how to use these tools effectively.

Starting at the edge with DNS and content delivery

Think of DNS as the reception desk of your application. AWS Route 53, its DNS service, is like having multiple receptionists who know exactly where to direct your visitors, even if one of them takes a break. Here’s how to make it bulletproof:

  • Health checks: Route 53 constantly monitors your endpoints, like a vigilant security guard. If something goes wrong, it automatically redirects traffic to healthy resources.
  • Multiple routing policies: You can set up different routing rules based on:
    • Geolocation: Direct users based on their location.
    • Latency: Route traffic to the endpoint that provides the lowest latency.
    • Failover: Automatically direct users to a backup endpoint if the primary one fails.
  • Application recovery controller: Think of this as your disaster recovery command center. It manages complex failover scenarios automatically, giving you the control needed to respond effectively.

But we can make things even better. CloudFront, AWS’s content delivery network, acts like having local stores in every neighborhood instead of one central warehouse. This ensures users get data from the closest location, reducing latency. Add AWS Shield and WAF, and you’ve got bouncers at every door, protecting against DDoS attacks and malicious traffic.

The load balancing dance

Load balancers are like traffic cops directing cars at a busy intersection. The key is choosing the right one:

  • Application Load Balancer (ALB): Ideal for HTTP/HTTPS traffic, like a sophisticated traffic controller that knows where each type of vehicle needs to go.
  • Network Load Balancer (NLB): When you need ultra-high performance and static IP addresses, think of it as an express lane for your traffic, ideal for low-latency use cases.
  • Cross-Zone Load Balancing: This is a feature of both Application Load Balancers (ALB) and Network Load Balancers (NLB). It ensures that even if one availability zone is busier than others, the traffic gets distributed evenly like a good parent sharing cookies equally among children.

The art of auto scaling

Auto Scaling Groups (ASG) are like having a smart hiring manager who knows exactly when to bring in more help and when to reduce staff. Here’s how to make them work effectively:

  • Multiple Availability Zones: Never put all your eggs in one basket. Spread your instances across different AZs to avoid single points of failure.
  • Launch Templates: Think of these as detailed job descriptions for your instances. They ensure consistency in the configuration of your resources and make it easy to replicate settings whenever needed.
  • Scaling Policies: Use CloudWatch alarms to trigger scaling actions based on metrics like CPU usage or request count. For instance, if CPU utilization exceeds 70%, you can automatically add more instances to handle the increased load. This type of setup is known as a Tracking Policy, where scaling actions are determined by monitoring key metrics like CPU utilization.

The backend symphony

Your backend layer needs to be as resilient as the front end. Here’s how to achieve that:

  • Stateless design: Each server should be like a replaceable worker, able to handle any task without needing to remember previous interactions. Stateless servers make scaling easier and ensure that no single instance becomes critical.
  • Caching strategy: ElastiCache acts like a team’s shared notebook, frequently needed information is always at hand. This reduces the load on your databases and improves response times.
  • Message queuing: Services like SQS and MSK ensure that if one part of your system gets overwhelmed, messages wait patiently in line instead of getting lost. This decouples your components, making the whole system more resilient.

The data layer foundation

Your data layer is like the foundation of a building, it needs to be rock solid. Here’s how to achieve that:

  • RDS Multi-AZ: Your database gets a perfect clone in another availability zone, ready to take over in milliseconds if needed. This provides fault tolerance for critical data.
  • DynamoDB Global tables: Think of these as synchronized notebooks in different offices around the world. They allow you to read and write data across regions, providing low-latency access and redundancy.
  • Aurora Global Database: Imagine having multiple synchronized libraries across different continents. With Aurora, you get global resilience with fast failover capabilities that ensure continuity even during regional outages.

Monitoring and management

You need eyes and ears everywhere in your infrastructure. Here’s how to set that up:

  • AWS Systems Manager: This serves as your central command center for configuration management, enabling you to automate operational tasks across your AWS resources.
  • CloudWatch: It’s your all-seeing eye for monitoring and alerting. Set alarms for resource usage, errors, and performance metrics to get notified before small issues escalate.
  • AWS Config: It’s like having a compliance officer, constantly checking that everything in your environment follows the rules and best practices. By the way, Have you wondered how to configure AWS Config if you need to apply its rules across an infrastructure spread over multiple regions? We’ll cover that in another article.

Best practices and common pitfalls

Here are some golden rules to live by:

  • Regular testing: Don’t wait for a real disaster to test your failover mechanisms. Conduct frequent disaster recovery drills to ensure that your systems and teams are ready for anything.
  • Documentation: Keep clear runbooks, they’re like instruction manuals for your infrastructure, detailing how to respond to incidents and maintain uptime.
  • Avoid these common mistakes:
    • Forgetting to test failover procedures
    • Neglecting to monitor all components, especially those that may seem trivial
    • Assuming that AWS services alone guarantee high availability, resilience requires thoughtful architecture

In a Few Words

Building a truly resilient AWS infrastructure is like conducting an orchestra, every component needs to play its part perfectly. But with careful planning and the right use of AWS services, you can create a system that stays running even when things go wrong.

The goal isn’t just to eliminate single points of failure, it’s to build an infrastructure so resilient that your users never even notice when something goes wrong. Because the best high-availability system is one that makes downtime invisible.

Ready to start building your bulletproof infrastructure? Start with one component at a time, test thoroughly, and gradually build up to a fully resilient system. Your future self (and your users) will thank you for it.

Route 53 --> Alias Record --> External Load Balancer --> ASG (Front Layer)
           |                                                     |
           v                                                     v
       CloudFront                               Internal Load Balancer
           |                                                     |
           v                                                     v
  AWS Shield + WAF                                    ASG (Back Layer)
                                                             |
                                                             v
                                              Database (Multi-AZ) or 
                                              DynamoDB Global

Types of Failover in Amazon Route 53 Explained Easily

Imagine Amazon Route 53 as a city’s traffic control system that directs cars (internet traffic) to different streets (servers or resources) based on traffic conditions and road health (the health and configuration of your AWS resources).

Active-Active Failover

In an active-active scenario, you have two streets leading to your destination (your website or application), and both are open to traffic all the time. If one street gets blocked (a server fails), traffic simply continues flowing through the other street. This is useful when you want to balance the load between two resources that are always available.

Active-active failover gives you access to all resources during normal operation. In this example, both region 1 and region 2 are active all the time. When a resource becomes unavailable, Route 53 can detect that it’s unhealthy and stop including it when responding to queries.

Active-Passive Failover

In active-passive failover, you have one main street that you prefer all traffic to use (the primary resource) and a secondary street that’s only used if the main one is blocked (the secondary resource is activated only if the primary fails). This method is useful when you have a preferred resource to handle requests but need a backup in case it fails.

Use an active-passive failover configuration when you want a primary resource or group of resources to be available the majority of the time and you want a secondary resource or group of resources to be on standby in case all the primary resources become unavailable.

Configuring Active-Passive Failover with One Primary and One Secondary Resource

This approach is like having one big street and one small street. You use the big street whenever possible because it can handle more traffic or get you to your destination more directly. You only use the small street if there’s construction or a blockage on the big street.

Configuring Active-Passive Failover with Multiple Primary and Secondary Resources

Now imagine you have several big streets and several small streets. All the big ones are your preferred options, and all the small ones are your backup options. Depending on how many big streets are available, you’ll direct traffic to them before considering using the small ones.

Configuring Active-Passive Failover with Weighted Records

This is like having multiple streets leading to your destination, but you give each street a “weight” based on how often you want it used. Some streets (resources) are preferred more than others, and that preference is adjusted by weight. You still have a backup street for when your preferred options aren’t available.

Evaluating Target Health

“Evaluate Target Health” is like having traffic sensors that instantly tell you if a street is blocked. If you’re routing traffic to AWS resources for which you can create alias records, you don’t need to set up separate health checks for those resources. Instead, you enable “Evaluate Target Health” on your alias records, and Route 53 will automatically check the health of those resources. This simplifies setup and keeps your traffic flowing to streets (resources) that are open and healthy without needing additional health configurations.

In short, Amazon Route 53 offers a powerful set of tools that you can use to manage the availability and resilience of your applications through a variety of ways to apply failover configurations. Implementation of such knowledge into the practice of failover strategy will result in keeping your application up and available for the users in cases when any kind of resource fails or gets a downtime outage.