L2 Support Engineer · Fintech · Week 6.5
Week 6.5
Day 5
Week 6.5 · Day 5
Alerts &
Alertmanager
Monitoring tells you the numbers. Alerting is what happens when those numbers cross a limit — the right person gets notified at the right time with the right information. Today you learn how to configure that entire chain.
Alert Rules
Severity Levels
Notifications
Routing
Proactive Response
01 The Simple Idea First
Real-life Analogy
Think of Alertmanager like a fire alarm system in a building. The smoke detectors measure temperature and smoke (monitoring). When a sensor crosses its threshold — the alarm fires (alert rule triggered). But different alarms have different responses — a kitchen smoke detector triggers a different response than a server room fire alarm.
Alertmanager is the intelligent routing system between the alarm and the response. It decides: which alarm goes to which person, how urgent it is, whether to silence repeated alerts, and whether to group similar alerts into one notification instead of 50 separate ones.
What is Alertmanager?
Alertmanager is a component that sits between your monitoring system (Prometheus, Grafana) and your notification channels (Slack, email, PagerDuty, WhatsApp). It receives alerts when rules fire, processes them, and routes them to the right destination.
Without Alertmanager, you would get raw alert pings with no context. With it, you get structured notifications that tell you: what fired, how severe it is, which team should respond, and what the alert means — grouped and deduplicated so you are not flooded with 200 identical messages.
Key features: Routing (send different alerts to different people), Grouping (bundle related alerts into one), Silencing (mute during maintenance), and Inhibition (suppress low-priority alerts when a high-priority one is firing).
02 Alert Rules — How They Work
What is an Alert Rule?
An alert rule is a condition that — when true for a defined period of time — fires an alert. It has three parts: the condition (what to check), the duration (how long it must be true before firing), and the labels (metadata like severity and team).
The duration matters because a metric might spike for 2 seconds and recover. You don't want an alert for that. You set the duration to 5 minutes — meaning the condition must be true continuously for 5 minutes before it fires. This prevents noise from short-lived spikes.
Alert rule structure — Prometheus format (simplified)
# Rule 1 — High CPU Alert
alert: HighCPU
condition: cpu_usage_percent > 90
for: 5m # must be true for 5 mins before firing
severity: critical
message: "CPU above 90% for 5 minutes on {{ server }}"
# Rule 2 — Disk Warning Alert
alert: DiskSpaceWarning
condition: disk_usage_percent > 80
for: 10m # 10 mins — disk fills slowly
severity: warning
message: "Disk at {{ value }}% on {{ server }}"
# Rule 3 — Payment Failure Rate
alert: PaymentFailureRateHigh
condition: error_rate_percent > 5
for: 2m # fire fast — 5% error rate is critical
severity: critical
message: "Payment error rate at {{ value }}% — P1 candidate"
03 Alert Severity Levels — What Each One Means
Why severity levels matter
Not every alert needs to wake someone up at 2 AM. Severity levels let Alertmanager route different alerts to different people in different ways. A warning at 3 AM waits until 9 AM. A critical at 3 AM wakes the on-call engineer immediately. Getting severity right is what makes alerting useful instead of annoying.
CRITICAL — Page the on-call engineer now
Active outage or imminent data loss. Needs immediate human response regardless of time of day. Example: Payment success rate below 80%, DB unreachable, disk at 100%, CPU pinned at 98% for 10 minutes. Route to: PagerDuty phone call + Slack #incidents + WhatsApp.
WARNING — Needs attention soon, not instantly
System is degraded but not fully broken. Can wait until working hours or the next check-in. Example: Disk at 84%, CPU at 85%, error rate at 3%, response time creeping up. Route to: Slack #monitoring channel only, no phone call.
INFO — Informational, no action required
Something worth noting but not a problem. For awareness only. Example: Deployment completed, backup finished, a service restarted successfully, traffic is higher than usual but still within safe limits. Route to: Slack #ops-log only.
RESOLVED — The problem cleared itself
A previous alert is no longer firing. Alertmanager sends a resolution notification automatically. Example: CPU came back down, disk was cleaned, error rate returned to normal. Route to: Same channel as the original alert, marked as resolved.
04 How Alertmanager Routes Alerts
🔔 Alert Lifecycle — From Rule to Notification
Step 1 — Monitoring
Metric crosses the threshold
Prometheus or your monitoring tool sees that a metric (CPU, error rate, disk) has crossed the value defined in the alert rule and has stayed above it for the defined duration.
cpu_usage > 90% for 5 minutes → Alert fires
Step 2 — Alertmanager receives
Alert delivered to Alertmanager
The monitoring system sends the alert to Alertmanager with its labels — severity, name, affected server, and timestamp. Alertmanager now processes it.
Alert: HighCPU | severity=critical | server=pay-server-01
Step 3 — Routing
Alertmanager decides who to notify
Based on routing rules, Alertmanager decides which receiver gets this alert. Critical alerts go to the on-call team via PagerDuty. Warnings go to Slack. Grouping combines multiple related alerts into one message.
severity=critical → PagerDuty + Slack #incidents
Step 4 — Silencing check
Is there a silence active?
Alertmanager checks if anyone has silenced this alert (e.g. during planned maintenance). If silenced — the notification is suppressed. If not silenced — it proceeds to delivery.
Silence active? No → proceed | Yes → suppress
Step 5 — Notification sent
Engineer is notified
The notification is delivered to the configured receiver — Slack message, PagerDuty call, email, or WhatsApp. The engineer receives it, acknowledges, and begins investigation.
Slack: 🚨 CRITICAL — HighCPU on pay-server-01 for 5 min
Step 6 — Resolution
Auto-resolved notification when metric recovers
When CPU drops below 90%, the alert condition is no longer true. Alertmanager automatically sends a RESOLVED notification to the same channels so the team knows the issue cleared.
Slack: ✅ RESOLVED — HighCPU on pay-server-01 — cleared
05 4 Key Alertmanager Concepts
📋 Alertmanager Features — What Each One Does
| Feature | What it does | When you use it |
| Routing |
Sends different alerts to different people based on labels like severity or team |
Always — every alert needs a route. Critical → on-call, warning → team Slack |
| Grouping |
Bundles multiple related alerts into one notification instead of sending 50 separate messages |
When a full outage fires 20 alerts simultaneously — you get 1 grouped message |
| Silencing |
Mutes alerts temporarily so planned maintenance doesn't trigger false alarms |
Before any maintenance window — silence alerts for the affected server |
| Inhibition |
Suppresses low-priority alerts when a higher-priority alert for the same system is already firing |
When a DB is down — suppresses all downstream alerts caused by the DB being down |
💡 Grouping is the most important for day-to-day L2 work. When a DB goes down, 15 different services will each fire their own alert. Without grouping you get 15 Slack messages. With grouping you get 1 message: "15 services affected — root cause likely DB."
06 Hands-on Lab — Configure an Alert
Lab scenario
You are the L2 engineer setting up alerts for the payment server. You need to configure three alerts — high CPU, disk warning, and payment failure rate — and simulate one of them firing so you see the full notification chain in action.
🔬 Lab: Build and Test an Alert System on Kali
Bash · Kali Linux
Create the alert rules config file
This file defines all your alert rules — condition, duration, severity, and message. In a real setup this would go into Prometheus. Here we simulate the structure.
terminal — create alert rules file
cat > ~/alert-rules.conf << 'EOF'
# ============================================
# Payment Server Alert Rules
# Engineer: Urwah Shafiq | Week 6.5 Day 5
# ============================================
[alert:HighCPU]
condition = cpu_percent > 90
for = 5m
severity = critical
team = l2-oncall
notify = slack,pagerduty,whatsapp
message = CPU above 90% for 5 minutes. Investigate immediately.
[alert:DiskWarning]
condition = disk_percent > 80
for = 10m
severity = warning
team = l2-team
notify = slack
message = Disk usage at {value}%. Schedule cleanup within 24 hours.
[alert:DiskCritical]
condition = disk_percent > 90
for = 2m
severity = critical
team = l2-oncall
notify = slack,pagerduty
message = DISK CRITICAL at {value}%. DB writes will fail. Act now.
[alert:PaymentFailureRate]
condition = error_rate_percent > 5
for = 2m
severity = critical
team = l2-oncall
notify = slack,pagerduty,whatsapp
message = Payment error rate at {value}%. P1 candidate. Open bridge.
[alert:QueueDepthGrowing]
condition = queue_depth > 50
for = 5m
severity = warning
team = l2-team
notify = slack
message = MQ queue depth at {value}. Consumer may be slow or down.
EOF
echo "Alert rules file created."
→ alert-rules.conf created with 5 configured alert rules across 3 severity levels.
Build the alert evaluation script
This script reads real metrics from your server and evaluates each rule — just like Alertmanager does every 15 seconds.
alert-evaluator.sh — create this script
#!/bin/bash
# Alert Evaluator — checks real metrics against rules
# chmod +x alert-evaluator.sh && ./alert-evaluator.sh
echo "========================================"
echo " Alert Evaluator — $(date)"
echo "========================================"
ALERTS_FIRED=0
# --- RULE 1: CPU ---
CPU_IDLE=$(top -bn1 | grep "Cpu(s)" | awk '{print $8}' | tr -d '%id,')
CPU_USED=$(echo "100 - $CPU_IDLE" | bc 2>/dev/null || echo "0")
if [ $(echo "$CPU_USED > 90" | bc 2>/dev/null) -eq 1 ] 2>/dev/null; then
echo "[CRITICAL] HighCPU — CPU at ${CPU_USED}% → Notify: Slack + PagerDuty + WhatsApp"
ALERTS_FIRED=$((ALERTS_FIRED+1))
else
echo "[ OK ] CPU at ${CPU_USED}% — below 90% threshold"
fi
# --- RULE 2 & 3: DISK ---
DISK=$(df / | awk 'NR==2{print $5}' | tr -d '%')
if [ "$DISK" -ge 90 ]; then
echo "[CRITICAL] DiskCritical — Disk at ${DISK}% → Notify: Slack + PagerDuty"
ALERTS_FIRED=$((ALERTS_FIRED+1))
elif [ "$DISK" -ge 80 ]; then
echo "[ WARN ] DiskWarning — Disk at ${DISK}% → Notify: Slack"
ALERTS_FIRED=$((ALERTS_FIRED+1))
else
echo "[ OK ] Disk at ${DISK}% — below 80% threshold"
fi
# --- RULE 4: MEMORY/SWAP ---
SWAP=$(free -m | awk 'NR==3{print $3}')
if [ "$SWAP" -gt 500 ]; then
echo "[ WARN ] HighSwap — Swap at ${SWAP}MB → Notify: Slack"
ALERTS_FIRED=$((ALERTS_FIRED+1))
else
echo "[ OK ] Swap at ${SWAP}MB — within normal range"
fi
# --- RULE 5: LOG ERROR RATE ---
ERR=$(grep -c "ERROR" ~/payment-service.log 2>/dev/null || echo "0")
if [ "$ERR" -gt 5 ]; then
echo "[CRITICAL] PaymentErrors — $ERR errors in log → Notify: All channels"
ALERTS_FIRED=$((ALERTS_FIRED+1))
else
echo "[ OK ] Error count: $ERR — below threshold"
fi
echo "----------------------------------------"
echo " Total alerts fired: $ALERTS_FIRED"
[ "$ALERTS_FIRED" -gt 0 ] \
&& echo " Action required!" \
|| echo " All systems OK."
echo "========================================"
terminal — save and run
chmod +x ~/alert-evaluator.sh
./alert-evaluator.sh
→ Script reads real CPU, disk, memory, and log error count. Evaluates each rule. Prints CRITICAL, WARN, or OK for every one with the notification channel it would fire to.
Trigger a real alert — simulate disk threshold crossing
Lower the disk threshold to a number you will definitely cross — so you can see the CRITICAL alert fire with your real disk value.
terminal — force a CRITICAL alert to fire
# Check your current disk % first
df -h / | awk 'NR==2{print "Current disk: "$5}'
# Run the evaluator with a lower threshold to simulate a critical alert
DISK=$(df / | awk 'NR==2{print $5}' | tr -d '%')
FAKE_THRESHOLD=1 # set threshold to 1% — your disk will always be above this
if [ "$DISK" -ge "$FAKE_THRESHOLD" ]; then
echo "[CRITICAL] DiskCritical — Disk at ${DISK}%"
echo " → Routing to: Slack #incidents + PagerDuty"
echo " → Message: DISK CRITICAL. DB writes will fail. Act now."
echo " → Severity: CRITICAL | Team: l2-oncall"
fi
→ CRITICAL alert fires with your real disk value. You see exactly what notification text and routing would be sent — disk critical, P1 level, notify l2-oncall team.
Set up silence — simulate a maintenance window
Before planned maintenance, you silence alerts so normal maintenance activity does not page everyone.
silence-manager.sh — create this
#!/bin/bash
# Simulates creating and checking a silence
SILENCE_FILE="~/active-silence.txt"
NOW=$(date +%s)
END_TIME=$(date -d "+2 hours" +%s 2>/dev/null || date -v +2H +%s)
echo "Creating silence for maintenance window..."
echo "SILENCE_ACTIVE=true" > ~/active-silence.txt
echo "SILENCE_REASON=Planned maintenance - disk cleanup" >> ~/active-silence.txt
echo "SILENCE_BY=Urwah Shafiq" >> ~/active-silence.txt
echo "SILENCE_DURATION=2 hours" >> ~/active-silence.txt
echo ""
echo "Silence created:"
cat ~/active-silence.txt
echo ""
echo "All alerts will now be suppressed for 2 hours."
echo "Run: rm ~/active-silence.txt — to clear the silence."
terminal
chmod +x ~/silence-manager.sh
./silence-manager.sh
→ Silence created. Shows reason, creator, duration. This is what you do before every maintenance window — silence first, do the work, clear the silence when done.
Schedule the alert evaluator with crontab
Run the evaluator every 5 minutes automatically — this is your personal Alertmanager running on Kali.
terminal
# Schedule alert evaluation every 5 minutes
(crontab -l 2>/dev/null; echo "# Alert evaluator — every 5 minutes") | crontab -
(crontab -l 2>/dev/null; echo "*/5 * * * * $HOME/alert-evaluator.sh >> $HOME/alert-log.txt 2>&1") | crontab -
echo "Scheduled. Active cron jobs:"
crontab -l
→ Alert evaluator runs every 5 minutes. Output saved to ~/alert-log.txt. Your proactive monitoring is now live. ✅
07 Real L2 Scenarios
01
It is 3 AM. CPU spikes to 94% and stays there. Alertmanager fires the HighCPU CRITICAL rule after 5 minutes. PagerDuty calls the on-call engineer. Slack #incidents gets a message. WhatsApp pings the L2 lead. The engineer wakes up, investigates, kills the runaway process. Total response time: 8 minutes. Nobody had to be watching a screen.
02
Disk fills to 83% at 11 PM. DiskWarning fires — severity WARNING — goes to Slack only, no phone call. The L2 engineer sees it at 9 AM, schedules cleanup that morning. If it had reached 91% instead, DiskCritical would have fired and paged someone. The two-level alert prevented a 3 AM wake-up for a non-emergency.
03
Planned maintenance at 2 AM — disk cleanup on the payment server. Before starting, you create a silence for 2 hours. During cleanup, disk temporarily hits 95% as old logs are moved. No alerts fire because of the silence. You complete the maintenance, remove the silence at 4 AM, and normal alerting resumes. Zero false alarms.
04
DB goes down. Alertmanager receives 18 alerts simultaneously — from the payment service, the MQ consumer, the API gateway, and the batch job, all screaming that DB is unreachable. Grouping bundles them into 1 message: "18 alerts firing — DBUnreachable is the common label. Likely root cause: DB outage." One message, clear context, engineer knows exactly what to investigate.
✅ Week 6.5 · Day 5 Outcomes
- Explain what Alertmanager is and why it sits between monitoring and notifications
- Describe the structure of an alert rule — condition, duration, severity, message, and receiver
- Know the 4 severity levels — CRITICAL, WARNING, INFO, RESOLVED — and what action each requires
- Explain the 4 key Alertmanager features — Routing, Grouping, Silencing, and Inhibition
- Trace the full alert lifecycle — metric fires → Alertmanager receives → routes → silence check → notification sent → resolved
- Complete the lab — create 5 alert rules in a config file, build alert-evaluator.sh that checks real metrics and maps them to rules, trigger a CRITICAL alert, create a silence for a maintenance window, and schedule automatic evaluation with crontab
- Understand why the duration field matters — prevents noise from short-lived spikes
- Understand why grouping matters — prevents alert storms from flooding your team during an outage