Resilience

AWS Fault Injection service, the unknown service

Let’s discuss something near and dear to every AWS Architect and DevOps Engineer’s heart: resilience. Or, as I like to call it, “making sure your digital baby doesn’t throw a tantrum when things go sideways.”

We’ve all been there. Like a magnificent sandcastle, you build this beautiful, intricate system in the cloud. It’s got auto-scaling, high availability, and the works. You’re feeling pretty proud of yourself. Then, BAM! Some unforeseen event, a tiny ripple in the force of the internet, and your sandcastle starts to crumble. Panic ensues.

But what if, instead of waiting for disaster to strike, you could be a bit… mischievous? What if you could poke and prod your system before it has a meltdown in front of your users? Enter AWS Fault Injection Simulator (FIS), a service that’s about as well-known as a quiet librarian at a rock concert, but far more useful.

What’s this FIS thing, anyway?

Think of FIS as your friendly neighborhood chaos monkey but with a PhD in engineering and a strict code of conduct. It’s a fully managed service that lets you run controlled chaos experiments on your AWS workloads. Yes, you read that right. You can intentionally break things but in a safe and measured way. It is like playing Jenga but only for advanced players.

Why would you do that, you ask? Well, my friends, it’s all about finding those hidden weaknesses before they become major headaches. It’s like giving your application a stress test, similar to how doctors check your heart’s health. You want to see how it handles the pressure before it’s out there running a marathon in the real world. The idea is simple: you don’t know how strong the dam will be until you put the river on it.

Why is this CHAOS stuff so important?

In the old days (you know, like five years ago), we tested for predictable failures. Server goes down? No problem, we have a backup! But the cloud is a complex beast, and failures can be, well, weird. Latency spikes, partial network outages, API throttling… it’s a jungle out there.

FIS helps you simulate these real-world, often unpredictable scenarios. By deliberately injecting faults, you expose how your system behaves under stress. This way you will discover if your great ideas in whiteboards are translated into a great and resilient system in the cloud.

This isn’t just about avoiding downtime, though that’s a big plus. It’s about:

  • Improving Reliability: Find and fix weak points, leading to a more robust and dependable system.
  • Boosting Performance: Identify bottlenecks and optimize your application’s response under duress.
  • Validating Your Assumptions: Does your fancy auto-scaling work as intended? FIS will tell you.
  • Building Confidence: Knowing your system can handle the unexpected gives you peace of mind. And maybe, just maybe, you can sleep through the night without getting paged. A DevOps Engineer can dream, right?

Let’s get our hands dirty (Virtually, of course)

So, how does this magical chaos tool work? FIS operates through experiment templates. These are like recipes for disaster (the good kind, of course). In these templates, you define:

  • Actions: What kind of mischief do you want to unleash? FIS offers a menu of pre-built actions, like:
    • aws:ec2:stop-instances: Stop EC2 instances. You pick which ones.
    • aws:ec2:terminate-instances: Terminate EC2 instances. Poof, they are gone.
    • aws:ssm:send-command: Run a script on an instance that causes, for example, CPU stress, or memory stress.
    • aws:fis:inject-api-latency: Add latency to internal or external APIs.
  • Targets: Where do you want to inject these faults? You can target specific EC2 instances, ECS clusters, EKS clusters, RDS databases… You get the idea. You can select the resources by tags, by name, by percentage… You have plenty of options here.
  • Stop Conditions: This is your “emergency brake.” You define CloudWatch alarms that, if triggered, will automatically halt the experiment. Safety first, people! Imagine that the experiment is affecting more components than expected, the stop condition will be your friend here.
  • IAM Role: This role is very important. It will give the FIS service permission to inject the fault into your resources. Remember to assign only the necessary permissions, nothing more.

Once you’ve crafted your experiment template, you can run it and watch the magic (or mayhem) unfold. FIS provides detailed logs and integrates with CloudWatch, so you can monitor the impact in real time.

FIS in the Wild

Let’s say you have a microservices architecture running on ECS. You want to test how your system handles the failure of a critical service. With FIS, you could create an experiment that:

  • Action: Terminates a percentage of the tasks in your critical service.
  • Target: Your ECS service, specifically the tasks tagged as “critical-service.”
  • Stop Condition: A CloudWatch alarm that triggers if your application’s latency exceeds a certain threshold or the error rate increases.

By running this experiment, you can observe how your other services react, whether your load balancing works as expected, and if your system can gracefully recover.

Or, imagine you want to test the resilience of your RDS database. You could simulate a failover by:

  • Action: aws:rds:reboot-db-instance with the failover option set to true.
  • Target: Your primary RDS instance.
  • Stop Condition: A CloudWatch alarm that monitors the database’s availability.

This allows you to validate your read replica setup and ensure a smooth transition in case of a real-world primary instance failure.

I remember one time I was helping a startup that had a critical application running on EC2. They were convinced their auto-scaling was flawless. We used FIS to simulate a sudden surge in traffic by terminating a bunch of instances. Guess what? Their auto-scaling took longer to kick in than they expected, leading to a brief period of performance degradation. Thanks to the experiment, they were able to fix the issue, avoiding real user impact in the future.

My Two Cents (and Maybe a Few More)

I’ve been around the AWS block a few times, and I can tell you that FIS is a game-changer. It’s not just about breaking things; it’s about understanding things. It’s about building systems that are not just robust on paper but resilient in the face of the unpredictable chaos of the real world.

Building resilient AWS infrastructure

Imagine building a house of cards. One slight bump and the whole structure comes tumbling down. Now, imagine if your AWS infrastructure was like that house of cards, a scary thought, right? That’s exactly what happens when we have single points of failure in our cloud architecture.

We’re setting ourselves up for trouble when we build systems that depend on a single critical component. If that component fails, the entire system can come crashing down like the house of cards. Instead, we want our infrastructure to resemble a well-engineered skyscraper: stable, robust, and designed with the foresight that no one piece can bring everything down. By thinking ahead and using the right tools, we can build systems that are resilient, adaptable, and ready for anything.

Why should you care about high availability?

Let me start with a story. A few years ago, a major e-commerce company lost millions in revenue when its primary database server crashed during Black Friday. The problem? They had no redundancy in place. It was like trying to cross a river with just one bridge,  when that bridge failed, they were completely stuck. In the cloud era, having a single point of failure isn’t just risky, it’s entirely avoidable.

AWS provides us with incredible tools to build resilient systems, kind of like having multiple bridges, boats, and even helicopters to cross that river. Let’s explore how to use these tools effectively.

Starting at the edge with DNS and content delivery

Think of DNS as the reception desk of your application. AWS Route 53, its DNS service, is like having multiple receptionists who know exactly where to direct your visitors, even if one of them takes a break. Here’s how to make it bulletproof:

  • Health checks: Route 53 constantly monitors your endpoints, like a vigilant security guard. If something goes wrong, it automatically redirects traffic to healthy resources.
  • Multiple routing policies: You can set up different routing rules based on:
    • Geolocation: Direct users based on their location.
    • Latency: Route traffic to the endpoint that provides the lowest latency.
    • Failover: Automatically direct users to a backup endpoint if the primary one fails.
  • Application recovery controller: Think of this as your disaster recovery command center. It manages complex failover scenarios automatically, giving you the control needed to respond effectively.

But we can make things even better. CloudFront, AWS’s content delivery network, acts like having local stores in every neighborhood instead of one central warehouse. This ensures users get data from the closest location, reducing latency. Add AWS Shield and WAF, and you’ve got bouncers at every door, protecting against DDoS attacks and malicious traffic.

The load balancing dance

Load balancers are like traffic cops directing cars at a busy intersection. The key is choosing the right one:

  • Application Load Balancer (ALB): Ideal for HTTP/HTTPS traffic, like a sophisticated traffic controller that knows where each type of vehicle needs to go.
  • Network Load Balancer (NLB): When you need ultra-high performance and static IP addresses, think of it as an express lane for your traffic, ideal for low-latency use cases.
  • Cross-Zone Load Balancing: This is a feature of both Application Load Balancers (ALB) and Network Load Balancers (NLB). It ensures that even if one availability zone is busier than others, the traffic gets distributed evenly like a good parent sharing cookies equally among children.

The art of auto scaling

Auto Scaling Groups (ASG) are like having a smart hiring manager who knows exactly when to bring in more help and when to reduce staff. Here’s how to make them work effectively:

  • Multiple Availability Zones: Never put all your eggs in one basket. Spread your instances across different AZs to avoid single points of failure.
  • Launch Templates: Think of these as detailed job descriptions for your instances. They ensure consistency in the configuration of your resources and make it easy to replicate settings whenever needed.
  • Scaling Policies: Use CloudWatch alarms to trigger scaling actions based on metrics like CPU usage or request count. For instance, if CPU utilization exceeds 70%, you can automatically add more instances to handle the increased load. This type of setup is known as a Tracking Policy, where scaling actions are determined by monitoring key metrics like CPU utilization.

The backend symphony

Your backend layer needs to be as resilient as the front end. Here’s how to achieve that:

  • Stateless design: Each server should be like a replaceable worker, able to handle any task without needing to remember previous interactions. Stateless servers make scaling easier and ensure that no single instance becomes critical.
  • Caching strategy: ElastiCache acts like a team’s shared notebook, frequently needed information is always at hand. This reduces the load on your databases and improves response times.
  • Message queuing: Services like SQS and MSK ensure that if one part of your system gets overwhelmed, messages wait patiently in line instead of getting lost. This decouples your components, making the whole system more resilient.

The data layer foundation

Your data layer is like the foundation of a building, it needs to be rock solid. Here’s how to achieve that:

  • RDS Multi-AZ: Your database gets a perfect clone in another availability zone, ready to take over in milliseconds if needed. This provides fault tolerance for critical data.
  • DynamoDB Global tables: Think of these as synchronized notebooks in different offices around the world. They allow you to read and write data across regions, providing low-latency access and redundancy.
  • Aurora Global Database: Imagine having multiple synchronized libraries across different continents. With Aurora, you get global resilience with fast failover capabilities that ensure continuity even during regional outages.

Monitoring and management

You need eyes and ears everywhere in your infrastructure. Here’s how to set that up:

  • AWS Systems Manager: This serves as your central command center for configuration management, enabling you to automate operational tasks across your AWS resources.
  • CloudWatch: It’s your all-seeing eye for monitoring and alerting. Set alarms for resource usage, errors, and performance metrics to get notified before small issues escalate.
  • AWS Config: It’s like having a compliance officer, constantly checking that everything in your environment follows the rules and best practices. By the way, Have you wondered how to configure AWS Config if you need to apply its rules across an infrastructure spread over multiple regions? We’ll cover that in another article.

Best practices and common pitfalls

Here are some golden rules to live by:

  • Regular testing: Don’t wait for a real disaster to test your failover mechanisms. Conduct frequent disaster recovery drills to ensure that your systems and teams are ready for anything.
  • Documentation: Keep clear runbooks, they’re like instruction manuals for your infrastructure, detailing how to respond to incidents and maintain uptime.
  • Avoid these common mistakes:
    • Forgetting to test failover procedures
    • Neglecting to monitor all components, especially those that may seem trivial
    • Assuming that AWS services alone guarantee high availability, resilience requires thoughtful architecture

In a Few Words

Building a truly resilient AWS infrastructure is like conducting an orchestra, every component needs to play its part perfectly. But with careful planning and the right use of AWS services, you can create a system that stays running even when things go wrong.

The goal isn’t just to eliminate single points of failure, it’s to build an infrastructure so resilient that your users never even notice when something goes wrong. Because the best high-availability system is one that makes downtime invisible.

Ready to start building your bulletproof infrastructure? Start with one component at a time, test thoroughly, and gradually build up to a fully resilient system. Your future self (and your users) will thank you for it.

Route 53 --> Alias Record --> External Load Balancer --> ASG (Front Layer)
           |                                                     |
           v                                                     v
       CloudFront                               Internal Load Balancer
           |                                                     |
           v                                                     v
  AWS Shield + WAF                                    ASG (Back Layer)
                                                             |
                                                             v
                                              Database (Multi-AZ) or 
                                              DynamoDB Global