SysAdmin

An irreverent tour of Linux disk space and RAM mysteries

Linux feels a lot like living in a loft apartment: the pipes are on display, every clank echoes, and when something leaks, you’re the first to squelch through the puddle. This guide hands you a mop, half a dozen snappy commands that expose where your disk space and memory have wandered off to, plus a couple of click‑friendly detours. Expect prose that winks, occasionally rolls its eyes, and never ever sounds like tax law.

Why checking disk and memory matters

Think of storage and RAM as the pantry and fridge in a shared flat. Ignore them for a week, and you end up with three half‑finished jars of salsa (log files) and leftovers from roommates long gone (orphaned kernels). A five‑minute audit every Friday spares you the frantic sprint for extra space, or worse, the freeze just before a production deploy.

Disk panic survival kit

Get the big picture fast

df is the bird’s‑eye drone shot of your mounted filesystems, everything lines up like contestants at a weigh‑in.

# Exclude temporary filesystems for clarity
$ df -hT -x tmpfs -x devtmpfs

-h prints friendly sizes, -T shows filesystem type, and the two -x flags hide the short‑lived stuff.

Zoom in on space hogs

du is your tape measure. Pair it with a little sort and head for instant gossip about the top offenders in any directory:

# Top 10 fattest directories under /var
$ sudo du -h --max-depth=1 /var 2>/dev/null | sort -hr | head -n 10

If /var/log looks like it skipped leg day and went straight for bulking season, you’ve found the culprit.

Bring in the interactive detective

When scrolling text gets dull, ncdu adds caffeine and colour:

# Install on most Debian‑based distros
$ sudo apt install ncdu

# Start at root (may take a minute)
$ sudo ncdu /

Navigate with the arrow keys, press d to delete, and feel the instant gratification of reclaiming gigabytes, the Marie Kondo of storage.

Visualise block devices

# Tree view of drives, partitions, and mount points
$ lsblk -o NAME,SIZE,FSTYPE,MOUNTPOINT --tree

Handy when that phantom 8 GB USB stick from last week still lurks in /media like an uninvited houseguest.

Memory and swap reality check

Check the ledger

The free command is a quick wallet peek, straightforward, and slightly judgemental:

$ free -h

Focus on the available column; that’s what you can still spend without the kernel reaching for its credit card (a.k.a. swap).

Real‑Time spy cam

# Refresh every two seconds, ordered by RAM gluttons
$ top -o %MEM

Prefer your monitoring colourful and charming? Try htop:

$ sudo apt install htop
$ htop

Use F6 to sort by RES (resident memory) and watch your browser tabs duke it out for supremacy.

Meet RAM’s couch‑surfing cousin

Swap steps in when RAM is full, think of it as sleeping on the living‑room sofa: doable, but slow and slightly undignified.

# Show active swap files or partitions
$ swapon --show

Seeing swap above 20 % during regular use? Either add RAM or conjure an emergency swap file:

$ sudo fallocate -l 2G /swapfile
$ sudo chmod 600 /swapfile
$ sudo mkswap /swapfile
$ sudo swapon /swapfile

Remember to append it to /etc/fstab so it survives a reboot.

Prefer clicking to typing

Yes, there’s a GUI for that. GNOME Disks and KSysGuard both display live graphs and won’t judge your typos. On Ubuntu, you can run:

$ sudo apt install gnome-disk-utility

Launch it from the menu and watch I/O spikes climb like toddlers on a sugar rush.

Quick reference cheat sheet

Show all mounts minus temp stuff
Command: df -hT -x tmpfs -x devtmpfs
Memory aid: df = disk fly‑over
Top ten heaviest directories
Command: du -h –max-depth=1 /path | sort -hr | head
Memory aid: du = directory weight
Interactive cleanup
Command: ncdu /
Memory aid: ncdu = du after espresso
Live RAM counter
Command: free -h
Memory aid: free = funds left
Spot memory‑hogging apps
Command: top -o %MEM
Memory aid: top = talent show
Swap usage
Command: swapon –show
Memory aid: swap on stage

Stick this list on your clipboard; your future self will thank you.

Wrapping up without a bow

You now own the detective kit for disk and memory mysteries, no cosmic metaphors, just straight talk with a wink. Run df -hT right now; if the numbers give you heartburn, take three deep breaths and start paging through ncdu. Storage leaks and RAM gluttons are inevitable, but letting them linger is optional.

Found an even better one‑liner? Drop it in the comments and make the rest of us look lazy. Until then, happy sleuthing, and may your logs stay trim and your swap forever bored.

Free that stuck Linux port and get on with your day

A rogue process squatting on port 8080 is the tech-equivalent of leaving your front-door key in the lock: nothing else gets in or out, and the neighbours start gossiping. Ports are exclusive party venues; one process per port, no exceptions. When an app crashes, restarts awkwardly, or you simply forget it’s still running, it grips that port like a toddler with the last cookie, triggering the dreaded “address already in use” error and freezing your deployment plans.

Below is a brisk, slightly irreverent field guide to evicting those squatters, gracefully when possible, forcefully when they ignore polite knocks, and automatically so you can get on with more interesting problems.

When ports act like gate crashers

Ports are finite. Your Linux box has 65535 of them, but every service worth its salt wants one of the “good seats” (80, 443, 5432…). Let a single zombie process linger, and you’ll be running deployment whack-a-mole all afternoon. Keeping ports free is therefore less superstition and more basic hygiene, like throwing out last night’s takeaway before the office starts to smell.

Spot the culprit

Before brandishing a digital axe, find out who is hogging the socket.

lsof, the bouncer with the clipboard

sudo lsof -Pn -iTCP:8080 -sTCP:LISTEN

lsof prints the PID, the user, and even whether our offender is IPv4 or IPv6. It’s as chatty as the security guard who tells you exactly which cousin tried to crash the wedding.

ss, the Formula 1 mechanic

Modern kernels prefer ss, it’s faster and less creaky than netstat.

sudo ss -lptn sport = :8080

fuser, the debt collector

When subtlety fails, fuser spells out which processes own the file or socket:

sudo fuser -v 8080/tcp

It displays the PID and the user, handy for blaming Dave from QA by name.

Tip: Add the -k flag to fuser to terminate offenders in one swoop, great for scripts, dangerous for fingers-on-keyboard humans.

Gentle persuasion first

A well-behaved process will exit graciously if you offer it a polite SIGTERM (15):

kill -15 3245     # give the app a chance to clean up

Think of it as tapping someone’s shoulder at closing time: “Finish your drink, mate.”

If it doesn’t listen, escalate to SIGINT (2), the Ctrl-C of signals, or SIGHUP (1) to make daemons reload configs without dying.

Bring out the big stick

Sometimes you need the digital equivalent of cutting the mains power. SIGKILL (9) is that guillotine:

kill -9 3245      # immediate, unsentimental termination

No cleanup, no goodbye note, just a corpse on the floor. Databases hate this, log files dislike it, and system-wide supervisors may auto-restart the process, so use sparingly.

One-liners for the impatient

sudo kill -9 $(sudo ss -lptn sport = :8080 | awk 'NR==2{split($NF,a,"pid=");split(a[2],b,",");print b[1]}')

Single line, single breath, done. It’s the Fast & Furious of port freeing, but remember: copy-paste speed correlates strongly with “oops-I-just-killed-production”.

Automate the cleanup

A pocket Bash script

#!/usr/bin/env bash
port=${1:-3000}
pid=$(ss -lptn "sport = :$port" | awk 'NR==2 {split($NF,a,"pid="); split(a[2],b,","); print b[1]}')

if [[ -n $pid ]]; then
  echo "Port $port is busy (PID $pid). Sending SIGTERM."
  kill -15 "$pid"
  sleep 2
  kill -0 "$pid" 2>/dev/null && echo "Still alive; escalating..." && kill -9 "$pid"
else
  echo "Port $port is already free."
fi

Drop it in ~/bin/freeport, mark executable, and call freeport 8080 before every dev run. Fewer keystrokes, fewer swearwords.

systemd, your tireless janitor

Create a watchdog service so the OS restarts your app only when it exits cleanly, not when you manually murder it:

[Unit]
Description=Watchdog for MyApp on 8080

[Service]
ExecStart=/usr/local/bin/myapp
Restart=on-failure
RestartPreventExitStatus=64   # don’t restart if we SIGKILLed

Enable with systemctl enable myapp.service, grab coffee, forget ports ever mattered.

Ansible for the herd

- name: Free port 8080 across dev boxes
  hosts: dev
  become: true
  tasks:
    - name: Terminate offender on 8080
      shell: |
        pid=$(ss -lptn 'sport = :8080' | awk 'NR==2{split($NF,a,"pid=");split(a[2],b,",");print b[1]}')
        [ -n "$pid" ] && kill -15 "$pid" || echo "Nothing to kill"

Run it before each CI deploy; your colleagues will assume you possess sorcery.

A few cautionary tales

Containers restart themselves. Kill a process inside Docker, and the orchestrator may spin it right back up. Either stop the container or adjust restart policies.
Dependency dominoes. Shooting a backend API can topple every microservice that chats to it. Check systemctl status or your Kubernetes liveness probes before opening fire .
Sudo isn’t seasoning. Use it only when the victim process belongs to another user. Over-salting scripts with sudo causes security heartburn.

Wrap-up

Freeing a port isn’t arcane black magic; it’s janitorial work that keeps your development velocity brisk and your ops team sane. Identify the squatter, ask it nicely to leave, evict it if it refuses, and automate the routine so you rarely have to think about it again. Got a port-conflict horror story involving 3 a.m. pager alerts and too much caffeine? Tell me in the comments, schadenfreude is a powerful teacher.

Now shut that laptop lid and actually get on with your day. The ports are free, and so are you.

June 28, 2025 by Fernando SRE DevOps stuff Linux Stuff SRE stuff

Linux commands for the pathologically curious

We all get comfortable. We settle into our favorite chair, our favorite IDE, and our little corner of the Linux command line. We master ls, grep, and cd, and we walk around with the quiet confidence of someone who knows their way around. But the terminal isn’t a neat, modern condo; it’s a sprawling, old mansion filled with secret passages, dusty attics, and bizarre little tools left behind by generations of developers.

Most people stick to the main hallways, completely unaware of the weird, wonderful, and handy commands hiding just behind the wallpaper. These aren’t your everyday tools. These are the secret agents, the oddballs, and the unsung heroes of your operating system. Let’s meet a few of them.

The textual anarchists

Some commands don’t just process text; they delight in mangling it in beautiful and chaotic ways.

First, meet rev, the command-line equivalent of a party trick that turns out to be surprisingly useful. It takes whatever you give it and spits it out backward.

echo "desserts" | rev

This, of course, returns stressed. Coincidence? The terminal thinks not. At first glance, you might dismiss it as a tool for a nerdy poetry slam. But the next time you’re faced with a bizarrely reversed data string from some ancient legacy system, you’ll be typing rev and looking like a wizard.

If rev is a neat trick, shuf is its chaotic cousin. This command takes the lines in your file and shuffles them into a completely random order.

# Create a file with a few choices
echo -e "Order Pizza\nDeploy to Production\nTake a Nap" > decisions.txt

# Let the terminal decide your fate
shuf -n 1 decisions.txt

Why would you want to do this? Maybe you need to randomize a playlist, test an algorithm, or run a lottery for who has to fix the next production bug. shuf is an agent of chaos, and sometimes, chaos is exactly what you need.

Then there’s tac, which is cat spelled backward for a very good reason. While the ever-reliable cat shows you a file from top to bottom, tac shows it to you from bottom to top. This might sound trivial, but anyone who has ever tried to read a massive log file will see the genius.

# Instantly see the last 5 errors in a huge log file
tac /var/log/syslog | grep -i "error" | head -n 5

This lets you get straight to the juicy, most recent details without an eternity of scrolling.

The obsessive organizers

After all that chaos, you might need a little order. The terminal has a few neat freaks ready to help.

The nl command is like cat’s older, more sophisticated cousin who insists on numbering everything. It adds formatted line numbers to a file, turning a simple text document into something that looks official.

# Add line numbers to a script
nl backup_script.sh

Now you can professionally refer to “the critical bug on line 73” during your next code review.

But for true organizational bliss, there’s column. This magnificent tool takes messy, delimited text and formats it into beautiful, perfectly aligned columns.

# Let's say you have a file 'users.csv' like this:
# Name,Role,Location
# Alice,Dev,Remote
# Bob,Sysadmin,Office

cat users.csv | column -t -s,

This command transforms your comma-vomit into a table fit for a king. It’s so satisfying it should be prescribed as a form of therapy.

The tireless workers

Next, we have the commands that just do their job, repeatedly and without complaint.

In the entire universe of Linux, there is no command more agreeable than yes. Its sole purpose in life is to output a string over and over until you tell it to stop.

# Automate the confirmation for a script that keeps asking
yes | sudo apt install my-awesome-package

This is the digital equivalent of nodding along until the installation is complete. It is the ultimate tool for the lazy, the efficient, and the slightly tyrannical system administrator.

If yes is the eternal optimist, watch is the eternal observer. This command executes another program periodically, showing its output in real time.

# Monitor the number of established network connections every 2 seconds
watch -n 2 "ss -t | grep ESTAB | wc -l"

It turns your terminal into a live dashboard. It’s the command-line equivalent of binge-watching your system’s health, and it’s just as addictive.

For an even nosier observer, try dstat. It’s the town gossip of your system, an all-in-one tool that reports on everything from CPU stats to disk I/O.

# Get a running commentary of your system's vitals
dstat -tcnmd

This gives you a timestamped report on cpu, network, disk, and memory usage. It’s like top and iostat had a baby and it came out with a Ph.D. in system performance.

The specialized professionals

Finally, we have the specialists, the commands built for one hyper-specific and crucial job.

The look command is a dictionary search on steroids. It performs a lightning-fast search on a sorted file and prints every line that starts with your string.

# Find all words in the dictionary starting with 'compu'
look compu /usr/share/dict/words

It’s the hyper-efficient librarian who finds “computer,” “computation,” and “compulsion” before you’ve even finished your thought.

For more complex relationships, comm acts as a file comparison counselor. It takes two sorted files and tells you which lines are unique to each and which they share.

# File 1: developers.txt (sorted)
# alice
# bob
# charlie

# File 2: admins.txt (sorted)
# alice
# david
# eve

# See who is just a dev, just an admin, or both
comm developers.txt admins.txt

Perfect for figuring out who has access to what, or who is on both teams and thus doing twice the work.

The desire to procrastinate productively is a noble one, and Linux is here to help. Meet at. This command lets you schedule a job to run once at a specific time.

# Schedule a server reboot for 3 AM tomorrow.
# After hitting enter, you type the command(s) and press Ctrl+D.
at 3:00am tomorrow
reboot
^D (Ctrl+D)

Now you can go to sleep and let your past self handle the dirty work. It’s time travel for the command line.

And for the true control freak, there’s chrt. This command manipulates the real-time scheduling priority of a process. In simple terms, you can tell the kernel that your program is a VIP.

# Run a high-priority data processing script
sudo chrt -f 99 ./process_critical_data.sh

This tells the kernel, “Out of the way, peasants! This script is more important than whatever else you were doing.” With great power comes great responsibility, so use it wisely.

Keep digging

So there you have it, a brief tour of the digital freak show lurking inside your Linux system. These commands are the strange souvenirs left behind by generations of programmers, each one a solution to a problem you probably never knew existed. Your terminal is a treasure chest, but it’s one where half the gold coins might just be cleverly painted bottle caps. Each of these tools walks the fine line between a stroke of genius and a cry for help. The fun part isn’t just memorizing them, but that sudden, glorious moment of realization when one of these oddballs becomes the only thing in the world that can save your day.

June 14, 2025 by Fernando SRE DevOps stuff Linux Stuff

Secure and simplify EC2 access with AWS Session Manager

Accessing EC2 instances used to be a hassle. Bastion hosts, SSH keys, firewall rules, each piece added another layer of complexity and potential security risks. You had to open ports, distribute keys, and constantly manage access. It felt like setting up an intricate vault just to perform simple administrative tasks.

AWS Session Manager changes the game entirely. No exposed ports, no key distribution nightmares, and a complete audit trail of every session. Think of it as replacing traditional keys and doors with a secure, on-demand teleportation system, one that logs everything.

How AWS Session Manager works

Session Manager is part of AWS Systems Manager, a fully managed service that provides secure, browser-based, and CLI-based access to EC2 instances without needing SSH or RDP. Here’s how it works:

An SSM Agent runs on the instance and communicates outbound to AWS Systems Manager.
When you start a session, AWS verifies your identity and permissions using IAM.
Once authorized, a secure channel is created between your local machine and the instance, without opening any inbound ports.

This approach significantly reduces the attack surface. There is no need to open port 22 (SSH) or 3389 (RDP) for bastion hosts. Moreover, since authentication and authorization are managed by IAM policies, you no longer have to distribute or rotate SSH keys.

Setting up AWS Session Manager

Getting started with Session Manager is straightforward. Here’s a step-by-step guide:

1. Ensure the SSM agent is installed

Most modern Amazon Machine Images (AMIs) come with the SSM Agent pre-installed. If yours doesn’t, install it manually using the following command (for Amazon Linux, Ubuntu, or RHEL):

sudo yum install -y amazon-ssm-agent
sudo systemctl enable amazon-ssm-agent
sudo systemctl start amazon-ssm-agent

2. Create an IAM Role for EC2

Your EC2 instance needs an IAM role to communicate with AWS Systems Manager. Attach a policy that grants at least the following permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ssm:StartSession"
      ],
      "Resource": [
        "arn:aws:ec2:REGION:ACCOUNT_ID:instance/INSTANCE_ID"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:TerminateSession",
        "ssm:ResumeSession"
      ],
      "Resource": [
        "arn:aws:ssm:REGION:ACCOUNT_ID:session/${aws:username}-*"
      ]
    }
  ]
}

Replace REGION, ACCOUNT_ID, and INSTANCE_ID with your actual values. For best security practices, apply the principle of least privilege by restricting access to specific instances or tags.

3. Connect to your instance

Once the IAM role is attached, you’re ready to connect.

From the AWS Console: Navigate to EC2 > Instances, select your instance, click Connect, and choose Session Manager.

From the AWS CLI: Run:

aws ssm start-session --target i-xxxxxxxxxxxxxxxxx

That’s it, no SSH keys, no VPNs, no open ports.

Built-in security and auditing

Session Manager doesn’t just improve security, it also enhances compliance and auditing. Every session can be logged to Amazon S3 or CloudWatch Logs, capturing a full record of all executed commands. This ensures complete visibility into who accessed which instance and what actions were taken.

To enable logging, navigate to AWS Systems Manager > Session Manager, configure Session Preferences, and enable logging to an S3 bucket or CloudWatch Log Group.

Why Session Manager is better than traditional methods

Let’s compare Session Manager with traditional access methods:

Feature	Bastion Host & SSH	AWS Session Manager
Open inbound ports	Yes (22, 3389)	No
Requires SSH keys	Yes	No
Key rotation required	Yes	No
Logs session activity	Manual setup	Built-in
Works for on-premises	No	Yes

Session Manager removes unnecessary complexity. No more juggling bastion hosts, no more worrying about expired SSH keys, and no more open ports that expose your infrastructure to unnecessary risks.

Real-World applications and operational Benefits

Session Manager is not just a theoretical improvement, it delivers real-world value in multiple scenarios:

Developers can quickly access production or staging instances without security concerns.
System administrators can perform routine maintenance without managing SSH key distribution.
Security teams gain complete visibility into instance access and command history.
Hybrid cloud environments benefit from unified access across AWS and on-premises infrastructure.

With these advantages, Session Manager aligns perfectly with modern cloud-native security principles, helping teams focus on operations rather than infrastructure headaches.

In summary

AWS Session Manager isn’t just another tool, it’s a fundamental shift in how we access EC2 instances securely. If you’re still relying on bastion hosts and SSH keys, it’s time to rethink your approach.Try it out, configure logging, and experience a simpler, more secure way to manage your instances. You might never go back to the old ways.

February 14, 2025 by Fernando SRE Cloud stuff SRE stuff

Deciphering AWS Network Mysteries with Reachability Analyzer

Let’s talk about the cloud, specifically, the tangled web of networks we build inside AWS. You spin up your Virtual Private Clouds (VPCs), toss in some subnets, sprinkle in a few security groups, configure those route tables, and before you know it, you’ve got a more complex network than a Rube Goldberg machine. Everything works great… until it doesn’t. A connection fails, an application times out, and you’re left scratching your head. Where do you even begin to troubleshoot?

This is the exact headache that AWS Reachability Analyzer is designed to cure. It is not the most known tool in the AWS toolbox, but believe me, it’s a lifesaver when diagnosing network connectivity issues. This article will explore what Reachability Analyzer is, how this handy tool works its magic, and why you should use it to keep your AWS network humming along smoothly.

What exactly is AWS Reachability Analyzer?

So, what’s the deal with Reachability Analyzer? Think of it as your network detective. It’s a configuration analysis tool that lets you test the connectivity between a source and a destination within your AWS environment. The beauty of it is that it doesn’t send any live traffic. Instead, it does something much smarter.

This nifty tool analyzes your network configuration, your security groups, Network Access Control Lists (NACLs), route tables, and all that jazz. It then builds a virtual model of your network and simulates the path that traffic would take. This way it determines whether packets starting their journey at the source could reach their intended destination.

Reachability Analyzer is part of the VPC service but tightly integrates with AWS Network Manager. If you’re dealing with a global network spanning multiple regions, Network Manager lets you run these reachability analyses centrally, giving you a bird’s-eye view of connectivity across your entire infrastructure.

It’s essential to understand what Reachability Analyzer doesn’t do. It won’t test your application-level connectivity or tell you anything about latency. It strictly focuses on the network layer, making sure the path is clear, based on your setup. It also does not take into account firewall rules of the OS, or the capacity of the resources to handle the traffic.

The perks of using Reachability Analyzer

Why bother with Reachability Analyzer? Let me break down the key benefits:

Pinpoint Connectivity Problems Fast: No more endless digging through logs or running manual traceroutes. Reachability Analyzer quickly identifies the root cause of connectivity issues, saving you precious time and frustration.
Validate Your Network Setup: It helps ensure your network is configured exactly as you intended and that your security policies are correctly enforced.
Plan Network Changes with Confidence: Before making any changes to your network, you can use Reachability Analyzer to simulate the impact and avoid accidental outages.
Boost Your Security Posture: By uncovering potential configuration flaws, it helps you strengthen your network’s defenses.
Easy Peasy to Use: The interface is intuitive. You don’t need to be a networking guru to use it effectively.
Identify Components Involved: It shows you hop-by-hop the details of the virtual path between the origin and the destination, giving you visibility of the resources involved in the connection.

Reachability Analyzer in Action

Let’s get our hands dirty with some practical examples to see how Reachability Analyzer shines in real-world scenarios:

Scenario 1 – EC2 Instance Can’t Talk to RDS Database

Your application running on an EC2 instance is throwing a tantrum and can’t connect to your RDS database, even though they’re in the same VPC. Reachability Analyzer to the rescue! You set up an analysis between the EC2 instance’s Elastic Network Interface (ENI) and the RDS instance’s ENI.

Bam! Reachability Analyzer might reveal that the RDS security group is the culprit. It’s not allowing inbound traffic from the EC2 instance’s security group on the database port. The problem is identified, and you can fix the security group rule with surgical precision.
Scenario 2 – Testing Connectivity After Route Table Tweaks

You’ve just modified a route table to direct traffic between two subnets through a firewall. Now you need to be sure that connectivity is still working as expected.

Simply create an analysis between an instance in the source subnet and one in the destination subnet. Reachability Analyzer will show you the complete path, including the hop through the firewall. If there’s a hiccup in the route table or the firewall configuration, you’ll spot it immediately.
Scenario 3 – VPN Connectivity Woes

You’ve set up a VPN connection between your VPC and your on-premise network, but your users are complaining that they can’t access resources on-premise. Time to bring in Reachability Analyzer.

Run an analysis from an instance in your VPC to an IP address of a server in your on-premise network. Reachability Analyzer might show you that your subnet’s route table is missing a route to the on-premise network via the Virtual Private Gateway (VGW). Or maybe there is a problem with the configuration of your VPN tunnel. The results will give you the clues you need to troubleshoot the VPN setup.
Scenario 4 – Transit Gateway Validation

You are using a Transit Gateway to connect multiple VPCs, and you need to verify connectivity between them.

Configure tests between instances in different VPCs attached to the Transit Gateway. Reachability Analyzer will show you if the Transit Gateway route tables are correctly configured and if the VPCs can communicate through the resource. It can also help determine if there are asymmetric routing issues, where traffic flows in one direction but not the other.

How to use Reachability Analyzer

Ready to give it a spin? Here’s a simple step-by-step guide:

Access the Tool: Head over to the AWS Management Console, navigate to the VPC section, and you’ll find Reachability Analyzer there. Or, if you are using Network Manager, you can find it in that section.
Create an Analysis:

.- Select your source and destination. This could be an EC2 instance, an ENI, an Internet Gateway, a VPN Gateway, and more.

.- Specify the protocol (TCP or UDP) and optionally, the destination port.

.- If needed and applicable, enter the source IP address or port.

Run the Analysis: Hit the “Create and run analysis path” button and let Reachability Analyzer do its thing.
Interpret the Results:

.- The tool will tell you if the destination is “Reachable” or “Not reachable.”

.- If there’s a problem, it will provide a detailed breakdown of the path, showing you exactly which component is blocking the connection and an explanation of why.

Run the Analysis from Network Manager: If you have a global network, run the reachability analysis from Network Manager for a broader view.

Wrapping Up

AWS Reachability Analyzer is a powerful tool that simplifies network troubleshooting and gives you greater control over your AWS environment. It’s like having X-ray vision for your network. So, next time you encounter a connectivity mystery in your AWS setup, don’t panic. Fire up Reachability Analyzer, and you will have answers in minutes. Try it out, experiment, and unlock the secrets of your network.

January 23, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

AWS Fault Injection service, the unknown service

Let’s discuss something near and dear to every AWS Architect and DevOps Engineer’s heart: resilience. Or, as I like to call it, “making sure your digital baby doesn’t throw a tantrum when things go sideways.”

We’ve all been there. Like a magnificent sandcastle, you build this beautiful, intricate system in the cloud. It’s got auto-scaling, high availability, and the works. You’re feeling pretty proud of yourself. Then, BAM! Some unforeseen event, a tiny ripple in the force of the internet, and your sandcastle starts to crumble. Panic ensues.

But what if, instead of waiting for disaster to strike, you could be a bit… mischievous? What if you could poke and prod your system before it has a meltdown in front of your users? Enter AWS Fault Injection Simulator (FIS), a service that’s about as well-known as a quiet librarian at a rock concert, but far more useful.

What’s this FIS thing, anyway?

Think of FIS as your friendly neighborhood chaos monkey but with a PhD in engineering and a strict code of conduct. It’s a fully managed service that lets you run controlled chaos experiments on your AWS workloads. Yes, you read that right. You can intentionally break things but in a safe and measured way. It is like playing Jenga but only for advanced players.

Why would you do that, you ask? Well, my friends, it’s all about finding those hidden weaknesses before they become major headaches. It’s like giving your application a stress test, similar to how doctors check your heart’s health. You want to see how it handles the pressure before it’s out there running a marathon in the real world. The idea is simple: you don’t know how strong the dam will be until you put the river on it.

Why is this CHAOS stuff so important?

In the old days (you know, like five years ago), we tested for predictable failures. Server goes down? No problem, we have a backup! But the cloud is a complex beast, and failures can be, well, weird. Latency spikes, partial network outages, API throttling… it’s a jungle out there.

FIS helps you simulate these real-world, often unpredictable scenarios. By deliberately injecting faults, you expose how your system behaves under stress. This way you will discover if your great ideas in whiteboards are translated into a great and resilient system in the cloud.

This isn’t just about avoiding downtime, though that’s a big plus. It’s about:

Improving Reliability: Find and fix weak points, leading to a more robust and dependable system.
Boosting Performance: Identify bottlenecks and optimize your application’s response under duress.
Validating Your Assumptions: Does your fancy auto-scaling work as intended? FIS will tell you.
Building Confidence: Knowing your system can handle the unexpected gives you peace of mind. And maybe, just maybe, you can sleep through the night without getting paged. A DevOps Engineer can dream, right?

Let’s get our hands dirty (Virtually, of course)

So, how does this magical chaos tool work? FIS operates through experiment templates. These are like recipes for disaster (the good kind, of course). In these templates, you define:

Actions: What kind of mischief do you want to unleash? FIS offers a menu of pre-built actions, like:
- aws:ec2:stop-instances: Stop EC2 instances. You pick which ones.
- aws:ec2:terminate-instances: Terminate EC2 instances. Poof, they are gone.
- aws:ssm:send-command: Run a script on an instance that causes, for example, CPU stress, or memory stress.
- aws:fis:inject-api-latency: Add latency to internal or external APIs.
Targets: Where do you want to inject these faults? You can target specific EC2 instances, ECS clusters, EKS clusters, RDS databases… You get the idea. You can select the resources by tags, by name, by percentage… You have plenty of options here.
Stop Conditions: This is your “emergency brake.” You define CloudWatch alarms that, if triggered, will automatically halt the experiment. Safety first, people! Imagine that the experiment is affecting more components than expected, the stop condition will be your friend here.
IAM Role: This role is very important. It will give the FIS service permission to inject the fault into your resources. Remember to assign only the necessary permissions, nothing more.

Once you’ve crafted your experiment template, you can run it and watch the magic (or mayhem) unfold. FIS provides detailed logs and integrates with CloudWatch, so you can monitor the impact in real time.

FIS in the Wild

Let’s say you have a microservices architecture running on ECS. You want to test how your system handles the failure of a critical service. With FIS, you could create an experiment that:

Action: Terminates a percentage of the tasks in your critical service.
Target: Your ECS service, specifically the tasks tagged as “critical-service.”
Stop Condition: A CloudWatch alarm that triggers if your application’s latency exceeds a certain threshold or the error rate increases.

By running this experiment, you can observe how your other services react, whether your load balancing works as expected, and if your system can gracefully recover.

Or, imagine you want to test the resilience of your RDS database. You could simulate a failover by:

Action: aws:rds:reboot-db-instance with the failover option set to true.
Target: Your primary RDS instance.
Stop Condition: A CloudWatch alarm that monitors the database’s availability.

This allows you to validate your read replica setup and ensure a smooth transition in case of a real-world primary instance failure.

I remember one time I was helping a startup that had a critical application running on EC2. They were convinced their auto-scaling was flawless. We used FIS to simulate a sudden surge in traffic by terminating a bunch of instances. Guess what? Their auto-scaling took longer to kick in than they expected, leading to a brief period of performance degradation. Thanks to the experiment, they were able to fix the issue, avoiding real user impact in the future.

My Two Cents (and Maybe a Few More)

I’ve been around the AWS block a few times, and I can tell you that FIS is a game-changer. It’s not just about breaking things; it’s about understanding things. It’s about building systems that are not just robust on paper but resilient in the face of the unpredictable chaos of the real world.

January 18, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

Managing SSL certificates with SNI on AWS ALB and NLB

The challenge of hosting multiple SSL-Secured sites

Let’s talk about security on the web. You want your website to be secure. Of course, you do! That’s where HTTPS and those little SSL/TLS certificates come in. They’re like the secret handshakes of the internet, ensuring that the information flowing between your site and visitors is safe from prying eyes. But here’s the thing: back in the day, if you wanted a bunch of websites, each with its secure certificate, you needed a separate IP address. Imagine having to get a new phone number for every person you wanted to call! It was a real headache and cost a pretty penny, too, especially if you were running a whole bunch of websites.

Defining SNI as a modern SSL/TLS extension

Now, what if I told you there was a clever way around this whole IP address mess? That’s where this little gem called Server Name Indication (SNI) comes in. It’s like a smart little addition to the way websites and browsers talk to each other securely. Think of it this way, your server’s IP address is like a big apartment building, and each website is a different apartment. Without SNI, it’s like visitors can only shout the building’s address (the IP address). The doorman (the server) wouldn’t know which apartment to send them to. SNI fixes that. It lets the visitor whisper both the building address and the apartment number (the website’s name) right at the start. Pretty neat.

Understanding the SNI handshake process

So, how does this SNI thing work? Let’s lift the hood and take a peek at the engine, shall we? It all happens during this little dance called the SSL/TLS handshake, the very beginning of a secure connection.

Client Hello: First, the client (like your web browser) says “Hello!” to the server. But now, thanks to SNI, it also whispers the name of the website it wants to talk to. It is like saying “Hey, I want to connect, and by the way, I’m looking for ‘www.example.com‘”.
Server Selection: The server gets this message and, because it’s a smart cookie, it checks the SNI part. It uses that website name to pick out the right secret handshake (the SSL certificate) from its big box of handshakes.
Server Hello: The server then says “Hello!” back, showing off the certificate it picked.
Secure Connection: The client checks if the handshake looks legit, and if it does, boom! You’ve got yourself a secure connection. It’s like a secret club where everyone knows the password, and they’re all speaking in code so no one else can understand.

AWS load balancers and SNI as a perfect match

Now, let’s bring this into the world of Amazon Web Services (AWS). They’ve got these things called load balancers, which are like traffic cops for websites, directing visitors to the right place. The newer ones, Application Load Balancers (ALB) and Network Load Balancers (NLB) are big fans of SNI. It means you can have a whole bunch of websites, each with its certificate, all hiding behind one of these load balancers. Each of those websites could be running on different computers (EC2 instances, as they call them), but the load balancer, thanks to SNI, knows exactly where to send the visitors.

CloudFront’s adoption of SNI for secure content delivery at scale

And it’s not just load balancers, AWS has this other thing called CloudFront, which is like a super-fast delivery service for websites. It makes sure your website loads quickly for people all over the world. And guess what? CloudFront loves SNI, too. It lets you have different secret handshakes (certificates) for different websites, even if they’re all being delivered through the same CloudFront setup. Just remember, the old-timer, Classic Load Balancer (CLB), doesn’t know this SNI trick. It’s a bit behind the times, so keep that in mind.

Cost savings through optimized resource utilization

Why should you care about all this? Well, for starters, it saves you money! Instead of needing a whole bunch of IP addresses (which cost money), you can use just one with SNI. It is like sharing an office space instead of everyone renting their building.

Simplified management by streamlining certificate handling

And it makes your life a whole lot easier, too. Managing those secret handshakes (certificates) can be a real pain. But with SNI, you can manage them all in one place on your load balancer. It is way simpler than running around to a dozen different offices to update everyone’s secret handshake.

Enhanced scalability for efficient infrastructure growth

And if your website gets popular, no problem, SNI lets you add new websites to your load balancer without breaking a sweat. You don’t have to worry about getting new IP addresses every time you want to launch a new site. It’s like adding new apartments to your building without having to change the building’s address.

Client compatibility to ensure broad support

Now, I have to be honest with you. There might be some really, really old web browsers out there that haven’t heard of SNI. But, honestly, they’re becoming rarer than a dodo bird. Most browsers these days are smart enough to handle SNI, so you don’t have to worry about it.

SNI as a cornerstone of modern Web hosting on AWS

So, there you have it. SNI is like a secret weapon for running websites securely and efficiently on AWS. It’s a clever little trick that saves you money, simplifies your life, and lets your website grow without any headaches. It is proof that even small changes to the way things work on the internet can make a huge difference. When you’re building things on AWS, remember SNI. It’s like having a master key that unlocks a whole bunch of possibilities for a secure and scalable future. It’s a neat piece of engineering if you ask me.

January 5, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

User-Agent and Amazon CloudFront behaviors explained

We all love a good glass of lemonade, right? But let’s be honest: ” One size fits all” doesn’t always work. Some like it sweet, some like it tart, and some like it with a twist. Running a successful lemonade stand or website means understanding these individual preferences. The first step? Listening to your customers, or in the case of the web, understanding the information their browsers send you.

The internet works similarly. Websites are like your lemonade stand, and users’ browsers are the customers coming up to ask for a drink. But instead of just saying “lemonade, please,” browsers send a whole bunch of information with their requests, tucked away in “headers.”

The User-Agent, your browser’s secret identity

One of these headers is the mighty “User-Agent.” Think of it as your browser’s secret identity. It tells the website, “Hey, I’m Chrome on a Windows laptop!” or “Howdy, I’m Safari on an iPhone!”

This is super important because, just like you’d tweak your lemonade recipe, websites want to serve the best experience for each device. A website designed for a big desktop screen might look cramped and clunky on a tiny phone. Using the User-Agent, the website can say, “Aha! This is a mobile user, let me send them the mobile-optimized version of my page!”

Now, let’s say your lemonade stand has become so popular that you need help. You hire someone to stand at the end of the block and direct people to you. This helper is like Amazon CloudFront, a content delivery network (CDN) that makes your website faster by storing copies of it all over the world.

CloudFront, the speedy delivery guy

CloudFront is brilliant. It’s like having mini lemonade stands everywhere, so customers get their drinks quicker. But there’s a catch. By default, CloudFront is a bit too eager to simplify things. It might think, “Lemonade is lemonade! Everyone gets the same!” and throw away some of those important headers, including the User-Agent.

This can lead to situations where users don’t get the optimal experience. For instance, mobile users might be served a clunky desktop version of a website, leading to frustration and a poor user experience. It becomes evident that CloudFront, while powerful, needs a little guidance to handle these nuances.

Behaviors, teaching CloudFront some manners

Luckily, CloudFront is a fast learner. You can teach it to handle those headers properly using “Behaviors.” Think of behaviors as special instructions you give to CloudFront. You can say things like, “Hey CloudFront, when someone asks for my website, please forward the User-Agent header to my origin server.” The “origin server” is where your website’s content ultimately resides. Typically, this is an Application Load Balancer (ALB) acting as a single point of contact and distributing traffic to a group of EC2 instances running your web application.

The solution, straight from the horse’s mouth

So, to ensure the best user experience for all visitors of a website delivered through CloudFront, you need to configure the CloudFront distribution’s behavior. Specifically, you tell it to forward the User-Agent header. This way, the website (your origin server) will know what kind of device is asking for the page and can serve the right version.

Why not add the User-Agent to the origin custom headers, as an alternative approach? Well, that’s like whispering the secret identity to the lemonade stand instead of letting the customer shout it out loud. The origin might not know what to do with that information in that format. Forwarding the header as part of the standard request is much cleaner and more reliable.

Wrapping it up, keep it simple and smart

And there you have it! The User-Agent header is a browser’s way of saying what it is, and CloudFront behaviors let you customize how your website handles that information. By understanding these simple concepts, you can make sure your website is serving the right experience to every user, whether they’re on a phone, a tablet, or a good old-fashioned desktop computer.

The internet, just like a good lemonade recipe, is all about understanding your audience and delivering the best experience possible. And sometimes, all it takes is a little tweak in the right place.

December 28, 2024 by Fernando SRE Cloud stuff

How to check if a folder is used by services on Linux

You know that feeling when you’re spring cleaning your Linux system and spot that mysterious folder lurking around forever? Your finger hovers over the delete key, but something makes you pause. Smart move! Before removing any folder, wouldn’t it be nice to know if any services are actively using it? It’s like checking if someone’s sitting in a chair before moving it. Today, I’ll show you how to do that, and I promise to keep it simple and fun.

Why should you care?

You see, in the world of DevOps and SysOps, understanding which services are using your folders is becoming increasingly important. It’s like being a detective in your own system – you need to know what’s happening behind the scenes to avoid accidentally breaking things. Think of it as checking if the room is empty before turning off the lights!

Meet your two best friends lsof and fuser

Let me introduce you to two powerful tools that will help you become this system detective: lsof and fuser. They’re like X-ray glasses for your Linux system, letting you see invisible connections between processes and files.

The lsof command as your first tool

lsof stands for “list open files” (pretty straightforward, right?). Here’s how you can use it:

lsof +D /path/to/your/folder

This command is like asking, “Hey, who’s using stuff in this folder?” The system will then show you a list of all processes that are accessing files in that directory. It’s that simple!

Let’s break down what you’ll see:

COMMAND: The name of the program using the folder
PID: A unique number identifying the process (like its ID card)
USER: Who’s running the process
FD: File descriptor (don’t worry too much about this one)
TYPE: Type of file
DEVICE: Device numbers
SIZE/OFF: Size of the file
NODE: Inode number (system’s way of tracking files)
NAME: Path to the file

The fuser command as your second tool

Now, let’s meet fuser. It’s like lsof’s cousin, but with a different approach:

fuser -v /path/to/your/folder

This command shows you which processes are using the folder but in a more concise way. It’s perfect when you want a quick overview without too many details.

Examples

Let’s say you have a folder called /var/www/html and you want to check if your web server is using it:

lsof +D /var/www/html

You might see something like:

COMMAND  PID     USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
apache2  1234    www-data  3r  REG  252,0   12345 67890 /var/www/html/index.html

This tells you that Apache is reading files from that folder, good to know before making any changes!

Pro tips and best practices

Always check before deleting When in doubt, it’s better to check twice than to break something once. It’s like looking both ways before crossing the street!

Watch out for performance The lsof +D command checks all subfolders too, which can be slow for large directories. For quicker checks of just the folder itself, you can use:

lsof +d /path/to/folder

Combine commands for better insights You can pipe these commands with grep for more specific searches:

lsof +D /path/to/folder | grep service_name

Troubleshooting common scenarios

Sometimes you might run these commands and get no output. Don’t panic! This usually means no processes are currently using the folder. However, remember that:

Some processes might open and close files quickly
You might need sudo privileges to see everything
System processes might be using files in ways that aren’t immediately visible

Conclusion

Understanding which services are using your folders is crucial in modern DevOps and SysOps environments. With lsof and fuser, you have powerful tools at your disposal to make informed decisions about your system’s folders.

Remember, the key is to always check before making changes. It’s better to spend a minute checking than an hour fixing it! These tools are your friends in maintaining a healthy and stable Linux system.

Quick reference

# Check folder usage with lsof
lsof +D /path/to/folder

# Quick check with fuser
fuser -v /path/to/folder

# Check specific service
lsof +D /path/to/folder | grep service_name

# Check folder without recursion
lsof +d /path/to/folder

The commands we’ve explored today are just the beginning of your journey into better Linux system management. As you become more comfortable with these tools, you’ll find yourself naturally integrating them into your daily DevOps and SysOps routines. They’ll become an essential part of your system maintenance toolkit, helping you make informed decisions and prevent those dreaded “Oops, I shouldn’t have deleted that” moments.

Being cautious with system modifications isn’t about being afraid to make changes, it’s about making changes confidently because you understand what you’re working with. Whether you’re managing a single server or orchestrating a complex cloud infrastructure, these simple yet powerful commands will help you maintain system stability and peace of mind.

Keep exploring, keep learning, and most importantly, keep your Linux systems running smoothly. The more you practice these techniques, the more natural they’ll become. And remember, in the world of system administration, a minute of checking can save hours of troubleshooting!

December 25, 2024 by Fernando SRE DevOps stuff Linux Stuff SRE stuff

Beyond 404, Exploring the Universe of Elastic Load Balancer Errors

In the world of cloud computing, Elastic Load Balancers (ELBs) play a crucial role in distributing incoming application traffic across multiple targets, such as EC2 instances, containers, and IP addresses. As a Cloud Architect or DevOps engineer, understanding the error messages associated with ELBs is essential for maintaining robust and reliable systems. This article aims to demystify the most common ELB error messages, providing you with the knowledge to quickly identify and resolve issues.

The Power of Load Balancers

Before we explore the error messages, let’s briefly recap the main features of Load Balancers:

Traffic Distribution: ELBs efficiently distribute incoming application traffic across multiple targets.
High Availability: They improve application fault tolerance by automatically routing traffic away from unhealthy targets.
Auto Scaling: ELBs work seamlessly with Auto Scaling groups to handle varying loads.
Security: They can offload SSL/TLS decryption, reducing the computational burden on your application servers.
Health Checks: Regular health checks ensure that traffic is only routed to healthy targets.

Now, let’s explore the error messages you might encounter when working with ELBs.

Decoding ELB Error Messages

When troubleshooting issues with your ELB, you’ll often encounter HTTP status codes. These codes are divided into two main categories:

4xx errors: Client-side errors
5xx errors: Server-side errors

Understanding this distinction is crucial for pinpointing the source of the problem and implementing the appropriate solution.

Client-Side Errors (4xx)

These errors indicate that the issue originates from the client’s request. Some common 4xx errors include:

400 Bad Request: The request was malformed or invalid.
401 Unauthorized: The request lacks valid authentication credentials.
403 Forbidden: The client cannot access the requested resource.
404 Not Found: The requested resource doesn’t exist on the server.

Server-Side Errors (5xx)

These errors suggest that the problem lies with the server. Common 5xx errors include:

500 Internal Server Error: A generic error message when the server encounters an unexpected condition.
502 Bad Gateway: The server received an invalid response from an upstream server.
503 Service Unavailable: The server is temporarily unable to handle the request.
504 Gateway Timeout: The server didn’t receive a timely response from an upstream server.

The Frustrating HTTP 504: Gateway Timeout Error

The 504 Gateway Timeout error deserves special attention due to its frequency and the frustration it can cause. This error occurs when the ELB doesn’t receive a response from the target within the configured timeout period.

Common causes of 504 errors include:

Overloaded backend servers
Network connectivity issues
Misconfigured timeout settings
Database query timeouts

To resolve 504 errors, you may need to:

Increase the timeout settings on your ELB
Optimize your application’s performance
Scale your backend resources
Check for and resolve any network issues

List of Common Error Messages

Here’s a more comprehensive list of error messages you might encounter:

400 Bad Request
401 Unauthorized
403 Forbidden
404 Not Found
408 Request Timeout
413 Payload Too Large
500 Internal Server Error
501 Not Implemented
502 Bad Gateway
503 Service Unavailable
504 Gateway Timeout
505 HTTP Version Not Supported

Tips to Avoid Errors and Quickly Identify Problems

Implement robust logging and monitoring: Use tools like CloudWatch to track ELB metrics and set up alarms for quick notification of issues.
Regularly review and optimize your application: Conduct performance testing to identify bottlenecks before they cause problems in production.
Use health checks effectively: Configure appropriate health check settings to ensure traffic is only routed to healthy targets.
Implement circuit breakers: Use circuit breakers in your application to prevent cascading failures.
Practice proper error handling: Ensure your application handles errors gracefully and provides meaningful error messages.
Keep your infrastructure up-to-date: Regularly update your ELB and target instances to benefit from the latest improvements and security patches.
Use AWS X-Ray: Implement AWS X-Ray to gain insights into request flows and quickly identify the root cause of errors.
Implement proper security measures: Use security groups, network ACLs, and SSL/TLS to secure your ELB and prevent unauthorized access.

In a few words

Understanding Elastic Load Balancer error messages is crucial for maintaining a robust and reliable cloud infrastructure. By familiarizing yourself with common error codes, their causes, and potential solutions, you’ll be better equipped to troubleshoot issues quickly and effectively.

Remember, the key to managing ELB errors lies in proactive monitoring, regular optimization, and a deep understanding of your application’s architecture. By following the tips provided and continuously improving your knowledge, you’ll be well-prepared to handle any ELB-related challenges that come your way.

As cloud architectures continue to evolve, staying informed about the latest best practices and error-handling techniques will be essential for success in your role as a Cloud Architect or DevOps engineer.

July 12, 2024 by Fernando SRE Cloud stuff DevOps stuff SRE stuff