Week 6.5 · Day 5

Alerts &
Alertmanager

Monitoring tells you the numbers. Alerting is what happens when those numbers cross a limit — the right person gets notified at the right time with the right information. Today you learn how to configure that entire chain.

Alert Rules Severity Levels Notifications Routing Proactive Response

01 The Simple Idea First

Real-life Analogy

Think of Alertmanager like a fire alarm system in a building. The smoke detectors measure temperature and smoke (monitoring). When a sensor crosses its threshold — the alarm fires (alert rule triggered). But different alarms have different responses — a kitchen smoke detector triggers a different response than a server room fire alarm.

Alertmanager is the intelligent routing system between the alarm and the response. It decides: which alarm goes to which person, how urgent it is, whether to silence repeated alerts, and whether to group similar alerts into one notification instead of 50 separate ones.

What is Alertmanager?

Alertmanager is a component that sits between your monitoring system (Prometheus, Grafana) and your notification channels (Slack, email, PagerDuty, WhatsApp). It receives alerts when rules fire, processes them, and routes them to the right destination.

Without Alertmanager, you would get raw alert pings with no context. With it, you get structured notifications that tell you: what fired, how severe it is, which team should respond, and what the alert means — grouped and deduplicated so you are not flooded with 200 identical messages.

Key features: Routing (send different alerts to different people), Grouping (bundle related alerts into one), Silencing (mute during maintenance), and Inhibition (suppress low-priority alerts when a high-priority one is firing).

02 Alert Rules — How They Work

What is an Alert Rule?

An alert rule is a condition that — when true for a defined period of time — fires an alert. It has three parts: the condition (what to check), the duration (how long it must be true before firing), and the labels (metadata like severity and team).

The duration matters because a metric might spike for 2 seconds and recover. You don't want an alert for that. You set the duration to 5 minutes — meaning the condition must be true continuously for 5 minutes before it fires. This prevents noise from short-lived spikes.

Alert rule structure — Prometheus format (simplified)

# Rule 1 — High CPU Alert
alert: HighCPU
condition: cpu_usage_percent > 90
for: 5m # must be true for 5 mins before firing
severity: critical
message: "CPU above 90% for 5 minutes on {{ server }}"

# Rule 2 — Disk Warning Alert
alert: DiskSpaceWarning
condition: disk_usage_percent > 80
for: 10m # 10 mins — disk fills slowly
severity: warning
message: "Disk at {{ value }}% on {{ server }}"

# Rule 3 — Payment Failure Rate
alert: PaymentFailureRateHigh
condition: error_rate_percent > 5
for: 2m # fire fast — 5% error rate is critical
severity: critical
message: "Payment error rate at {{ value }}% — P1 candidate"

03 Alert Severity Levels — What Each One Means

Why severity levels matter

Not every alert needs to wake someone up at 2 AM. Severity levels let Alertmanager route different alerts to different people in different ways. A warning at 3 AM waits until 9 AM. A critical at 3 AM wakes the on-call engineer immediately. Getting severity right is what makes alerting useful instead of annoying.

CRITICAL — Page the on-call engineer now

Active outage or imminent data loss. Needs immediate human response regardless of time of day. Example: Payment success rate below 80%, DB unreachable, disk at 100%, CPU pinned at 98% for 10 minutes. Route to: PagerDuty phone call + Slack #incidents + WhatsApp.

WARNING — Needs attention soon, not instantly

System is degraded but not fully broken. Can wait until working hours or the next check-in. Example: Disk at 84%, CPU at 85%, error rate at 3%, response time creeping up. Route to: Slack #monitoring channel only, no phone call.

INFO — Informational, no action required

Something worth noting but not a problem. For awareness only. Example: Deployment completed, backup finished, a service restarted successfully, traffic is higher than usual but still within safe limits. Route to: Slack #ops-log only.

RESOLVED — The problem cleared itself

A previous alert is no longer firing. Alertmanager sends a resolution notification automatically. Example: CPU came back down, disk was cleaned, error rate returned to normal. Route to: Same channel as the original alert, marked as resolved.

04 How Alertmanager Routes Alerts

🔔 Alert Lifecycle — From Rule to Notification

📊

Step 1 — Monitoring

Metric crosses the threshold

Prometheus or your monitoring tool sees that a metric (CPU, error rate, disk) has crossed the value defined in the alert rule and has stayed above it for the defined duration.

cpu_usage > 90% for 5 minutes → Alert fires

📨

Step 2 — Alertmanager receives

Alert delivered to Alertmanager

The monitoring system sends the alert to Alertmanager with its labels — severity, name, affected server, and timestamp. Alertmanager now processes it.

Alert: HighCPU | severity=critical | server=pay-server-01

🔀

Step 3 — Routing

Alertmanager decides who to notify

Based on routing rules, Alertmanager decides which receiver gets this alert. Critical alerts go to the on-call team via PagerDuty. Warnings go to Slack. Grouping combines multiple related alerts into one message.

severity=critical → PagerDuty + Slack #incidents

🔕

Step 4 — Silencing check

Is there a silence active?

Alertmanager checks if anyone has silenced this alert (e.g. during planned maintenance). If silenced — the notification is suppressed. If not silenced — it proceeds to delivery.

Silence active? No → proceed | Yes → suppress

📱

Step 5 — Notification sent

Engineer is notified

The notification is delivered to the configured receiver — Slack message, PagerDuty call, email, or WhatsApp. The engineer receives it, acknowledges, and begins investigation.

Slack: 🚨 CRITICAL — HighCPU on pay-server-01 for 5 min

✅

Step 6 — Resolution

Auto-resolved notification when metric recovers

When CPU drops below 90%, the alert condition is no longer true. Alertmanager automatically sends a RESOLVED notification to the same channels so the team knows the issue cleared.

Slack: ✅ RESOLVED — HighCPU on pay-server-01 — cleared

05 4 Key Alertmanager Concepts

📋 Alertmanager Features — What Each One Does

Feature	What it does	When you use it
Routing	Sends different alerts to different people based on labels like severity or team	Always — every alert needs a route. Critical → on-call, warning → team Slack
Grouping	Bundles multiple related alerts into one notification instead of sending 50 separate messages	When a full outage fires 20 alerts simultaneously — you get 1 grouped message
Silencing	Mutes alerts temporarily so planned maintenance doesn't trigger false alarms	Before any maintenance window — silence alerts for the affected server
Inhibition	Suppresses low-priority alerts when a higher-priority alert for the same system is already firing	When a DB is down — suppresses all downstream alerts caused by the DB being down

💡 Grouping is the most important for day-to-day L2 work. When a DB goes down, 15 different services will each fire their own alert. Without grouping you get 15 Slack messages. With grouping you get 1 message: "15 services affected — root cause likely DB."

06 Hands-on Lab — Configure an Alert

Lab scenario

You are the L2 engineer setting up alerts for the payment server. You need to configure three alerts — high CPU, disk warning, and payment failure rate — and simulate one of them firing so you see the full notification chain in action.

🔬 Lab: Build and Test an Alert System on Kali

Bash · Kali Linux

Create the alert rules config file

This file defines all your alert rules — condition, duration, severity, and message. In a real setup this would go into Prometheus. Here we simulate the structure.

terminal — create alert rules file

cat > ~/alert-rules.conf << 'EOF'
# ============================================
# Payment Server Alert Rules
# Engineer: Urwah Shafiq | Week 6.5 Day 5
# ============================================

[alert:HighCPU]
condition = cpu_percent > 90
for = 5m
severity = critical
team = l2-oncall
notify = slack,pagerduty,whatsapp
message = CPU above 90% for 5 minutes. Investigate immediately.

[alert:DiskWarning]
condition = disk_percent > 80
for = 10m
severity = warning
team = l2-team
notify = slack
message = Disk usage at {value}%. Schedule cleanup within 24 hours.

[alert:DiskCritical]
condition = disk_percent > 90
for = 2m
severity = critical
team = l2-oncall
notify = slack,pagerduty
message = DISK CRITICAL at {value}%. DB writes will fail. Act now.

[alert:PaymentFailureRate]
condition = error_rate_percent > 5
for = 2m
severity = critical
team = l2-oncall
notify = slack,pagerduty,whatsapp
message = Payment error rate at {value}%. P1 candidate. Open bridge.

[alert:QueueDepthGrowing]
condition = queue_depth > 50
for = 5m
severity = warning
team = l2-team
notify = slack
message = MQ queue depth at {value}. Consumer may be slow or down.
EOF
echo "Alert rules file created."

→ alert-rules.conf created with 5 configured alert rules across 3 severity levels.

Build the alert evaluation script

This script reads real metrics from your server and evaluates each rule — just like Alertmanager does every 15 seconds.

alert-evaluator.sh — create this script

#!/bin/bash
# Alert Evaluator — checks real metrics against rules
# chmod +x alert-evaluator.sh && ./alert-evaluator.sh

echo "========================================"
echo " Alert Evaluator — $(date)"
echo "========================================"

ALERTS_FIRED=0

# --- RULE 1: CPU ---
CPU_IDLE=$(top -bn1 | grep "Cpu(s)" | awk '{print $8}' | tr -d '%id,')
CPU_USED=$(echo "100 - $CPU_IDLE" | bc 2>/dev/null || echo "0")
if [ $(echo "$CPU_USED > 90" | bc 2>/dev/null) -eq 1 ] 2>/dev/null; then
  echo "[CRITICAL] HighCPU — CPU at ${CPU_USED}% → Notify: Slack + PagerDuty + WhatsApp"
  ALERTS_FIRED=$((ALERTS_FIRED+1))
else
  echo "[ OK ] CPU at ${CPU_USED}% — below 90% threshold"
fi

# --- RULE 2 & 3: DISK ---
DISK=$(df / | awk 'NR==2{print $5}' | tr -d '%')
if [ "$DISK" -ge 90 ]; then
  echo "[CRITICAL] DiskCritical — Disk at ${DISK}% → Notify: Slack + PagerDuty"
  ALERTS_FIRED=$((ALERTS_FIRED+1))
elif [ "$DISK" -ge 80 ]; then
  echo "[ WARN ] DiskWarning — Disk at ${DISK}% → Notify: Slack"
  ALERTS_FIRED=$((ALERTS_FIRED+1))
else
  echo "[ OK ] Disk at ${DISK}% — below 80% threshold"
fi

# --- RULE 4: MEMORY/SWAP ---
SWAP=$(free -m | awk 'NR==3{print $3}')
if [ "$SWAP" -gt 500 ]; then
  echo "[ WARN ] HighSwap — Swap at ${SWAP}MB → Notify: Slack"
  ALERTS_FIRED=$((ALERTS_FIRED+1))
else
  echo "[ OK ] Swap at ${SWAP}MB — within normal range"
fi

# --- RULE 5: LOG ERROR RATE ---
ERR=$(grep -c "ERROR" ~/payment-service.log 2>/dev/null || echo "0")
if [ "$ERR" -gt 5 ]; then
  echo "[CRITICAL] PaymentErrors — $ERR errors in log → Notify: All channels"
  ALERTS_FIRED=$((ALERTS_FIRED+1))
else
  echo "[ OK ] Error count: $ERR — below threshold"
fi

echo "----------------------------------------"
echo " Total alerts fired: $ALERTS_FIRED"
[ "$ALERTS_FIRED" -gt 0 ] \
  && echo " Action required!" \
  || echo " All systems OK."
echo "========================================"

terminal — save and run

chmod +x ~/alert-evaluator.sh
./alert-evaluator.sh

→ Script reads real CPU, disk, memory, and log error count. Evaluates each rule. Prints CRITICAL, WARN, or OK for every one with the notification channel it would fire to.

Trigger a real alert — simulate disk threshold crossing

Lower the disk threshold to a number you will definitely cross — so you can see the CRITICAL alert fire with your real disk value.

terminal — force a CRITICAL alert to fire

# Check your current disk % first
df -h / | awk 'NR==2{print "Current disk: "$5}'

# Run the evaluator with a lower threshold to simulate a critical alert
DISK=$(df / | awk 'NR==2{print $5}' | tr -d '%')
FAKE_THRESHOLD=1 # set threshold to 1% — your disk will always be above this

if [ "$DISK" -ge "$FAKE_THRESHOLD" ]; then
  echo "[CRITICAL] DiskCritical — Disk at ${DISK}%"
  echo " → Routing to: Slack #incidents + PagerDuty"
  echo " → Message: DISK CRITICAL. DB writes will fail. Act now."
  echo " → Severity: CRITICAL | Team: l2-oncall"
fi

→ CRITICAL alert fires with your real disk value. You see exactly what notification text and routing would be sent — disk critical, P1 level, notify l2-oncall team.

Set up silence — simulate a maintenance window

Before planned maintenance, you silence alerts so normal maintenance activity does not page everyone.

silence-manager.sh — create this

#!/bin/bash
# Simulates creating and checking a silence

SILENCE_FILE="~/active-silence.txt"
NOW=$(date +%s)
END_TIME=$(date -d "+2 hours" +%s 2>/dev/null || date -v +2H +%s)

echo "Creating silence for maintenance window..."
echo "SILENCE_ACTIVE=true" > ~/active-silence.txt
echo "SILENCE_REASON=Planned maintenance - disk cleanup" >> ~/active-silence.txt
echo "SILENCE_BY=Urwah Shafiq" >> ~/active-silence.txt
echo "SILENCE_DURATION=2 hours" >> ~/active-silence.txt

echo ""
echo "Silence created:"
cat ~/active-silence.txt
echo ""
echo "All alerts will now be suppressed for 2 hours."
echo "Run: rm ~/active-silence.txt — to clear the silence."

terminal

chmod +x ~/silence-manager.sh
./silence-manager.sh

→ Silence created. Shows reason, creator, duration. This is what you do before every maintenance window — silence first, do the work, clear the silence when done.

Schedule the alert evaluator with crontab

Run the evaluator every 5 minutes automatically — this is your personal Alertmanager running on Kali.

terminal

# Schedule alert evaluation every 5 minutes
(crontab -l 2>/dev/null; echo "# Alert evaluator — every 5 minutes") | crontab -
(crontab -l 2>/dev/null; echo "*/5 * * * * $HOME/alert-evaluator.sh >> $HOME/alert-log.txt 2>&1") | crontab -

echo "Scheduled. Active cron jobs:"
crontab -l

→ Alert evaluator runs every 5 minutes. Output saved to ~/alert-log.txt. Your proactive monitoring is now live. ✅

07 Real L2 Scenarios

It is 3 AM. CPU spikes to 94% and stays there. Alertmanager fires the HighCPU CRITICAL rule after 5 minutes. PagerDuty calls the on-call engineer. Slack #incidents gets a message. WhatsApp pings the L2 lead. The engineer wakes up, investigates, kills the runaway process. Total response time: 8 minutes. Nobody had to be watching a screen.

Disk fills to 83% at 11 PM. DiskWarning fires — severity WARNING — goes to Slack only, no phone call. The L2 engineer sees it at 9 AM, schedules cleanup that morning. If it had reached 91% instead, DiskCritical would have fired and paged someone. The two-level alert prevented a 3 AM wake-up for a non-emergency.

Planned maintenance at 2 AM — disk cleanup on the payment server. Before starting, you create a silence for 2 hours. During cleanup, disk temporarily hits 95% as old logs are moved. No alerts fire because of the silence. You complete the maintenance, remove the silence at 4 AM, and normal alerting resumes. Zero false alarms.

DB goes down. Alertmanager receives 18 alerts simultaneously — from the payment service, the MQ consumer, the API gateway, and the batch job, all screaming that DB is unreachable. Grouping bundles them into 1 message: "18 alerts firing — DBUnreachable is the common label. Likely root cause: DB outage." One message, clear context, engineer knows exactly what to investigate.

✅ Week 6.5 · Day 5 Outcomes

Explain what Alertmanager is and why it sits between monitoring and notifications
Describe the structure of an alert rule — condition, duration, severity, message, and receiver
Know the 4 severity levels — CRITICAL, WARNING, INFO, RESOLVED — and what action each requires
Explain the 4 key Alertmanager features — Routing, Grouping, Silencing, and Inhibition
Trace the full alert lifecycle — metric fires → Alertmanager receives → routes → silence check → notification sent → resolved
Complete the lab — create 5 alert rules in a config file, build alert-evaluator.sh that checks real metrics and maps them to rules, trigger a CRITICAL alert, create a silence for a maintenance window, and schedule automatic evaluation with crontab
Understand why the duration field matters — prevents noise from short-lived spikes
Understand why grouping matters — prevents alert storms from flooding your team during an outage

Alerts &Alertmanager