L2 Support Engineer · Fintech · Week 6.5
Week 6.5
Day 8
Week 6.5 · Day 8 — Final Project
Monitoring Project
This is the capstone of the monitoring track. You build a complete health checklist from scratch, run a full monitoring drill end to end, and handle a simulated incident independently — using every skill from Days 1 through 7.
Health Checklist
Full Drill
Independent Handling
Capstone
01 The Simple Idea First
Real-life Analogy
Think of a pilot's final check before solo flight. They have trained on every individual instrument — altimeter, fuel gauge, flaps, throttle. The solo flight is not about learning anything new. It is about proving you can use all of them together, in the right order, without anyone telling you what to do next.
Today is your solo flight. You have learned system metrics, logs, alerts, app monitoring, and correlation. The monitoring project puts all of that into one drill — from opening your dashboard in the morning, through detecting an incident, to writing the final RCA. Independent, start to finish.
What you are building today
By the end of today you will have produced three things that you keep and use permanently:
1. A Personal Health Checklist — a structured list of every check you run every morning when you arrive, in the correct order, with thresholds you have memorised.
2. A Master Monitoring Script — one script that runs every check automatically and produces a full morning report. You schedule it so it runs before you arrive.
3. A Full Incident RCA — a complete, correlated, evidence-based root cause analysis from a simulated incident — written as a professional Jira post-incident report.
02 Part 1 — The L2 Health Checklist
What is a health checklist and why does it matter?
A health checklist is a fixed, ordered set of checks you run every single morning before support begins. It removes guesswork and ensures nothing is missed. The order matters — you check the most impactful things first. The thresholds are memorised — you do not consult documentation during an S1.
Pilots use checklists. Surgeons use checklists. Every safety-critical profession uses them. L2 engineers should too. Once your checklist is a habit, you will catch problems every week before any client reports them.
01
System Resource Checks
First 2 minutes of every shift — before checking emails or tickets
1
Disk Space — Check First
Run df -h on all servers. Below 80% = OK. 80–90% = Warning. Above 90% = Critical — act before anything else. Disk full stops everything. It is always the first check.
2
Memory & Swap
Run free -h. RAM free above 30% = OK. Swap above 100MB = Warning. High swap means RAM is exhausted — everything slows down.
3
CPU & Load Average
Run ps aux --sort=-%cpu | head -5 and cat /proc/loadavg. CPU below 70% = OK. Above 90% = find the process immediately.
4
Network Connectivity
Run ping -c 3 8.8.8.8. Under 20ms and 0% loss = OK. Any packet loss = investigate. Failures here explain external timeouts.
02
Application Health Checks
Minutes 3–5 — check the software layer after confirming server health
5
Log Error Count
Run grep -c "ERROR" ~/payment-service.log. 0 errors = OK. 1–5 = Warning, investigate. Above 5 = escalate. Compare to yesterday's baseline.
6
Error Type Breakdown
If errors found: run grep "ERROR" log | awk '{print $4}' | sort | uniq -c. Identify the most common error type. DB timeout = DB issue. Socket timeout = external issue.
7
PENDING Transaction Check
Search for transactions stuck PENDING for more than 30 minutes. Any found = investigate the callback chain. Money may have moved but the system does not know.
8
Queue Depth
Check MQ queue depth. At or near 0 = OK. Growing steadily = consumer may be down. A growing queue with no processing is always an incident in the making.
03
Alerts & Incident Review
Minutes 5–10 — check what happened overnight and what needs follow-up
9
Check Alert Log
Open cat ~/alert-log.txt. Any CRITICAL alerts fired overnight? If yes — investigate and update the open Jira ticket before clients call.
10
Dead Letter Queue
Check if any messages are in the dead letter queue. Any messages there = something failed all retries overnight. Requires manual intervention — reprocess or escalate.
11
Open Jira Tickets
Review all P1/P2 tickets opened in the last 24 hours. Are any still open with no update? Every open P1 needs an update every 30 minutes during business hours.
12
Morning Handover Report
Write a 3-line handover in the team Slack: system status, any overnight alerts, and one action for the morning. 30 seconds to write, saves 10 minutes of "what happened last night" questions.
03 Part 2 — The Master Monitoring Script
🔬 Lab: Build the Master Monitoring Script
Full Drill · Kali Linux
Build the complete master script — all 12 checks in one run
This is the script that runs every morning at 07:45 — 15 minutes before your shift starts. By the time you sit down, the report is waiting for you.
l2-morning-drill.sh — full master script
#!/bin/bash
# ============================================
# l2-morning-drill.sh
# L2 Daily Health Check — All 12 checks
# Runs every morning at 07:45 via crontab
# chmod +x l2-morning-drill.sh && ./l2-morning-drill.sh
# ============================================
REPORT="$HOME/morning-report-$(date +%Y-%m-%d).txt"
LOG="$HOME/payment-service.log"
ISSUES=0
plog() { echo "$1" | tee -a "$REPORT"; }
plog "============================================"
plog " L2 Morning Health Check — $(date)"
plog "============================================"
plog ""
plog "=== PHASE 1: SYSTEM RESOURCES ==="
# Check 1 — Disk
DISK=$(df / | awk 'NR==2{print $5}' | tr -d '%')
[ "$DISK" -ge 90 ] && { plog "[CRITICAL] Disk at $DISK%"; ISSUES+=1; } \
|| { [ "$DISK" -ge 80 ] && { plog "[WARNING ] Disk at $DISK%"; ISSUES+=1; } \
|| plog "[ OK ] Disk at $DISK%"; }
# Check 2 — Memory
SWAP=$(free -m | awk 'NR==3{print $3}')
MEMF=$(free -h | awk 'NR==2{print $7}')
[ "$SWAP" -gt 100 ] && { plog "[WARNING ] Swap ${SWAP}MB — RAM may be low. Free: $MEMF"; ISSUES+=1; } \
|| plog "[ OK ] RAM free: $MEMF | Swap: ${SWAP}MB"
# Check 3 — CPU Load
LOAD=$(cat /proc/loadavg | awk '{print $1}')
TOP_PROC=$(ps aux --sort=-%cpu | awk 'NR==2{print $11, $3"%"}')
plog "[ OK ] Load avg: $LOAD | Top: $TOP_PROC"
# Check 4 — Network
ping -c 2 -W 2 8.8.8.8 >/dev/null 2>&1 \
&& plog "[ OK ] Network: external connectivity confirmed" \
|| { plog "[WARNING ] Network: cannot reach 8.8.8.8"; ISSUES+=1; }
plog ""
plog "=== PHASE 2: APPLICATION HEALTH ==="
# Check 5 — Log errors
ERRS=$(grep -c "ERROR" "$LOG" 2>/dev/null || echo "0")
[ "$ERRS" -gt 5 ] && { plog "[CRITICAL] $ERRS errors in log"; ISSUES+=1; } \
|| { [ "$ERRS" -gt 0 ] && { plog "[WARNING ] $ERRS errors in log"; ISSUES+=1; } \
|| plog "[ OK ] Log errors: 0"; }
# Check 6 — Error types
[ "$ERRS" -gt 0 ] && {
plog " Error breakdown:"
grep "ERROR" "$LOG" 2>/dev/null | awk '{print $4}' | sort | uniq -c | \
while read line; do plog " $line"; done
}
# Check 7 — PENDING transactions
PEND=$(grep -c "PENDING" "$LOG" 2>/dev/null || echo "0")
[ "$PEND" -gt 0 ] && { plog "[WARNING ] $PEND PENDING entries in log"; ISSUES+=1; } \
|| plog "[ OK ] No stuck PENDING entries"
plog ""
plog "=== PHASE 3: ALERTS & OVERNIGHT ==="
# Check 8 — Overnight alerts
CRIT_ALERTS=$(grep -c "CRITICAL" ~/alert-log.txt 2>/dev/null || echo "0")
[ "$CRIT_ALERTS" -gt 0 ] && { plog "[CRITICAL] $CRIT_ALERTS critical alerts fired overnight!"; ISSUES+=1; } \
|| plog "[ OK ] No critical alerts overnight"
# Summary
plog ""
plog "============================================"
plog " MORNING SUMMARY: $ISSUES issue(s) found"
[ "$ISSUES" -eq 0 ] && plog " Status: ALL SYSTEMS HEALTHY" \
|| plog " Status: ACTION REQUIRED — review above"
plog " Report: $REPORT"
plog "============================================"
terminal — save, run, and schedule
chmod +x ~/l2-morning-drill.sh
./l2-morning-drill.sh
# Schedule — runs at 07:45 every weekday
(crontab -l 2>/dev/null; echo "45 7 * * 1-5 $HOME/l2-morning-drill.sh") | crontab -
crontab -l
→ Master script runs all 12 checks, prints CRITICAL/WARNING/OK for each, counts total issues, and saves a dated report. Scheduled to run at 07:45 every weekday.
Run the full monitoring drill — simulate an incident
Create a simulated incident log and metrics file, then work through the full drill independently using the health checklist and correlation skills.
terminal — create the drill scenario
# Create the drill incident log
cat > ~/payment-service.log << 'EOF'
[08:00:00] [INFO ] Service started. TXN rate normal at 140/sec.
[08:45:10] [WARN ] DB connection pool at 79%
[08:45:25] [WARN ] DB connection pool at 91%
[08:45:30] [WARN ] DB connection pool at 97%
[08:45:35] [ERROR] DB_CONNECTION_TIMEOUT TXN-901 FAILED
[08:45:35] [ERROR] DB_CONNECTION_TIMEOUT TXN-902 FAILED
[08:45:36] [ERROR] DB_CONNECTION_TIMEOUT TXN-903 FAILED
[08:45:37] [ERROR] DB_CONNECTION_REFUSED port 5432 not responding
[08:45:37] [ERROR] TXN-904 FAILED DB process appears down
[08:46:00] [ERROR] MAX_RETRY_EXCEEDED TXN-901 moved to dead letter
[09:00:00] [INFO ] DB connection restored. Pool at 10%.
[09:00:02] [INFO ] TXN-905 processed successfully. Recovery confirmed.
EOF
# Run the master drill script
./l2-morning-drill.sh
→ Script detects errors in the log. Report shows CRITICAL for log errors and WARNING for PENDING. Issues found: 2. This is your trigger to investigate.
Independently investigate — apply correlation
Using the skills from Days 1–7, trace the incident without guidance. Find: first warning, first error, affected transactions, root cause, recovery time.
terminal — independent investigation
# 1. How many errors and what type?
grep "ERROR" ~/payment-service.log | awk '{print $4}' | sort | uniq -c
# 2. What warned us before the crash?
grep -B 3 -m 1 "ERROR" ~/payment-service.log
# 3. Which transactions were affected?
grep "FAILED\|dead letter" ~/payment-service.log | grep -o "TXN-[0-9]*"
# 4. When did it start and when did it recover?
grep -m 1 "ERROR" ~/payment-service.log | awk '{print "First error:", $1, $2}'
grep "restored" ~/payment-service.log | awk '{print "Recovery :", $1, $2}'
→ DB_CONNECTION_TIMEOUT x3 + DB_CONNECTION_REFUSED x1. Build-up: pool 79→91→97% in 20 seconds. First error 08:45:35. Recovery 09:00:00. TXN-901 in dead letter.
Write the complete post-incident RCA
Write the full Jira post-incident report — timeline, root cause, impact, actions, next steps. This is the final deliverable of the monitoring project.
Final RCA — Post-Incident Report
POST-INCIDENT REPORT
Engineer : Urwah Shafiq (L2)
Date : $(date)
Severity : P1 — Payment service down
======================================
TIMELINE:
08:00:00 — Service running normally, 140 TXN/sec
08:45:10 — WARN: DB pool 79% (first warning sign)
08:45:30 — WARN: DB pool 97% (critical threshold)
08:45:35 — ERROR: DB_CONNECTION_TIMEOUT — pool exhausted
08:45:37 — ERROR: DB_CONNECTION_REFUSED — DB process down
08:46:00 — TXN-901 moved to dead letter queue
09:00:00 — DB restored. Recovery confirmed in logs.
ROOT CAUSE:
DB connection pool exhausted at 08:45:35.
Pool filled from 79% to 97% in 20 seconds —
suspected connection leak in application code.
DB process subsequently crashed at 08:45:37.
IMPACT:
Transactions failed: TXN-901, 902, 903, 904
Dead letter: TXN-901 (needs reprocessing)
Duration: 08:45:35 — 09:00:00 (14 min 25 sec)
ACTIONS TAKEN:
DBA restarted DB service at 08:55
DB recovered at 09:00
TXN-901 reprocessed from dead letter
NEXT STEPS:
L3 to investigate connection leak in payment service
DBA to increase connection pool limit as short-term fix
Add pool % alert at 80% threshold for earlier warning
→ Complete professional post-incident report. Every event has a timestamp. Root cause is evidence-based. Actions and next steps are clear. ✅
04 What You Now Own Permanently
🏆 Your Monitoring Toolkit — Built Across Week 6.5
| Deliverable | What it does | When you use it |
| l2-morning-drill.sh | 12-check morning script covering all resources, logs, and alerts | Every morning at 07:45 — auto-runs, report waiting when you arrive |
| alert-evaluator.sh | Evaluates real metrics against alert rules every 5 minutes | Continuous — runs via crontab, alerts you when thresholds cross |
| app-health-check.sh | Reads API log, reports latency, error rate, throughput per endpoint | When a client reports API issues — run for instant health picture |
| resource-check.sh | All 4 resources checked with OK/WARNING/CRITICAL verdicts | During any incident — first thing you run to rule out resource cause |
| Health Checklist (12 items) | Structured morning check across 3 phases — system, app, alerts | Every shift start — 10 minutes, nothing missed |
| Correlation Method (5 steps) | Define window → metrics → logs → align → write RCA | Every incident investigation — consistent, defensible root cause |
05 Real L2 Scenarios
01
You arrive at work. The morning report is already in ~/morning-report-today.txt. It shows: 1 issue — disk at 83%. Before your first coffee, you know exactly what to do. You archive old logs, disk drops to 66%, and you document it in Jira. Day started proactively, not reactively.
02
Alert fires at 09:00. You run the health checklist in sequence — disk OK, memory OK, CPU at 91%. You check ps aux — Java payment service at 89% CPU. You check logs — DB_CONNECTION_TIMEOUT errors 30 seconds before CPU spike. You diagnose DB failure, not CPU overload, in under 2 minutes. Correct diagnosis, correct escalation.
03
Your lead is on leave. A P1 fires at 14:00. You handle it independently — health checklist, correlation, RCA, bridge communication, Jira documentation, client notification. When your lead returns they read the post-incident report and say "this is exactly what I would have done." That is what independent handling looks like.
04
Bi-weekly review meeting. Manager asks about the 09:00 incident last Tuesday. You open the morning report from that day and the post-incident RCA you wrote. Full timeline, root cause, impact, actions — all documented with timestamps. The meeting takes 5 minutes instead of 30. The records speak for themselves.
✅ Week 6.5 · Day 8 — Monitoring Project Outcomes
- Build and memorise a 12-item health checklist covering system resources, application health, and overnight alerts — in the correct phase order
- Understand why the checklist order matters — disk first, then memory, then CPU, then application, then alerts
- Build the master l2-morning-drill.sh script that runs all 12 checks, classifies each as OK/WARNING/CRITICAL, and saves a dated report
- Schedule the master script to run at 07:45 every weekday so the report is ready before your shift starts
- Complete the full monitoring drill — detect an incident from the report, investigate using correlation, identify root cause from log timestamps and metric sequence
- Write a complete professional post-incident report — timeline with timestamps, evidence-based root cause, impact count, actions taken, and next steps
- Demonstrate independent handling — running the full investigation and RCA without guidance, start to finish
- Own your complete monitoring toolkit: morning drill script, alert evaluator, app health checker, resource checker, health checklist, and correlation method