Week 6.5 · Day 8 — Final Project

Monitoring Project

This is the capstone of the monitoring track. You build a complete health checklist from scratch, run a full monitoring drill end to end, and handle a simulated incident independently — using every skill from Days 1 through 7.

Health Checklist Full Drill Independent Handling Capstone

01 The Simple Idea First

Real-life Analogy

Think of a pilot's final check before solo flight. They have trained on every individual instrument — altimeter, fuel gauge, flaps, throttle. The solo flight is not about learning anything new. It is about proving you can use all of them together, in the right order, without anyone telling you what to do next.

Today is your solo flight. You have learned system metrics, logs, alerts, app monitoring, and correlation. The monitoring project puts all of that into one drill — from opening your dashboard in the morning, through detecting an incident, to writing the final RCA. Independent, start to finish.

What you are building today

By the end of today you will have produced three things that you keep and use permanently:

1. A Personal Health Checklist — a structured list of every check you run every morning when you arrive, in the correct order, with thresholds you have memorised.

2. A Master Monitoring Script — one script that runs every check automatically and produces a full morning report. You schedule it so it runs before you arrive.

3. A Full Incident RCA — a complete, correlated, evidence-based root cause analysis from a simulated incident — written as a professional Jira post-incident report.

02 Part 1 — The L2 Health Checklist

What is a health checklist and why does it matter?

A health checklist is a fixed, ordered set of checks you run every single morning before support begins. It removes guesswork and ensures nothing is missed. The order matters — you check the most impactful things first. The thresholds are memorised — you do not consult documentation during an S1.

Pilots use checklists. Surgeons use checklists. Every safety-critical profession uses them. L2 engineers should too. Once your checklist is a habit, you will catch problems every week before any client reports them.

System Resources — Run in This Order

Phase 1 of 3

Disk Space — Check First

Run df -h on all servers. Below 80% = OK. 80–90% = Warning. Above 90% = Critical — act before anything else. Disk full stops everything. It is always the first check.

Memory & Swap

Run free -h. RAM free above 30% = OK. Swap above 100MB = Warning. High swap means RAM is exhausted — everything slows down.

CPU & Load Average

Run ps aux --sort=-%cpu | head -5 and cat /proc/loadavg. CPU below 70% = OK. Above 90% = find the process immediately.

Network Connectivity

Run ping -c 3 8.8.8.8. Under 20ms and 0% loss = OK. Any packet loss = investigate. Failures here explain external timeouts.

Application Layer — API & Log Checks

Phase 2 of 3

Log Error Count

Run grep -c "ERROR" ~/payment-service.log. 0 errors = OK. 1–5 = Warning, investigate. Above 5 = escalate. Compare to yesterday's baseline.

Error Type Breakdown

If errors found: run grep "ERROR" log | awk '{print $4}' | sort | uniq -c. Identify the most common error type. DB timeout = DB issue. Socket timeout = external issue.

PENDING Transaction Check

Search for transactions stuck PENDING for more than 30 minutes. Any found = investigate the callback chain. Money may have moved but the system does not know.

Queue Depth

Check MQ queue depth. At or near 0 = OK. Growing steadily = consumer may be down. A growing queue with no processing is always an incident in the making.

Alerts & Overnight Activity

Phase 3 of 3

Check Alert Log

Open cat ~/alert-log.txt. Any CRITICAL alerts fired overnight? If yes — investigate and update the open Jira ticket before clients call.

Dead Letter Queue

Check if any messages are in the dead letter queue. Any messages there = something failed all retries overnight. Requires manual intervention — reprocess or escalate.

Open Jira Tickets

Review all P1/P2 tickets opened in the last 24 hours. Are any still open with no update? Every open P1 needs an update every 30 minutes during business hours.

Morning Handover Report

Write a 3-line handover in the team Slack: system status, any overnight alerts, and one action for the morning. 30 seconds to write, saves 10 minutes of "what happened last night" questions.

03 Part 2 — The Master Monitoring Script

🔬 Lab: Build the Master Monitoring Script

Full Drill · Kali Linux

Build the complete master script — all 12 checks in one run

This is the script that runs every morning at 07:45 — 15 minutes before your shift starts. By the time you sit down, the report is waiting for you.

l2-morning-drill.sh — full master script

#!/bin/bash
# ============================================
# l2-morning-drill.sh
# L2 Daily Health Check — All 12 checks
# Runs every morning at 07:45 via crontab
# chmod +x l2-morning-drill.sh && ./l2-morning-drill.sh
# ============================================

REPORT="$HOME/morning-report-$(date +%Y-%m-%d).txt"
LOG="$HOME/payment-service.log"
ISSUES=0

plog() { echo "$1" | tee -a "$REPORT"; }

plog "============================================"
plog " L2 Morning Health Check — $(date)"
plog "============================================"
plog ""
plog "=== PHASE 1: SYSTEM RESOURCES ==="

# Check 1 — Disk
DISK=$(df / | awk 'NR==2{print $5}' | tr -d '%')
[ "$DISK" -ge 90 ] && { plog "[CRITICAL] Disk at $DISK%"; ISSUES+=1; } \
|| { [ "$DISK" -ge 80 ] && { plog "[WARNING ] Disk at $DISK%"; ISSUES+=1; } \
|| plog "[ OK ] Disk at $DISK%"; }

# Check 2 — Memory
SWAP=$(free -m | awk 'NR==3{print $3}')
MEMF=$(free -h | awk 'NR==2{print $7}')
[ "$SWAP" -gt 100 ] && { plog "[WARNING ] Swap ${SWAP}MB — RAM may be low. Free: $MEMF"; ISSUES+=1; } \
|| plog "[ OK ] RAM free: $MEMF | Swap: ${SWAP}MB"

# Check 3 — CPU Load
LOAD=$(cat /proc/loadavg | awk '{print $1}')
TOP_PROC=$(ps aux --sort=-%cpu | awk 'NR==2{print $11, $3"%"}')
plog "[ OK ] Load avg: $LOAD | Top: $TOP_PROC"

# Check 4 — Network
ping -c 2 -W 2 8.8.8.8 >/dev/null 2>&1 \
&& plog "[ OK ] Network: external connectivity confirmed" \
|| { plog "[WARNING ] Network: cannot reach 8.8.8.8"; ISSUES+=1; }

plog ""
plog "=== PHASE 2: APPLICATION HEALTH ==="

# Check 5 — Log errors
ERRS=$(grep -c "ERROR" "$LOG" 2>/dev/null || echo "0")
[ "$ERRS" -gt 5 ] && { plog "[CRITICAL] $ERRS errors in log"; ISSUES+=1; } \
|| { [ "$ERRS" -gt 0 ] && { plog "[WARNING ] $ERRS errors in log"; ISSUES+=1; } \
|| plog "[ OK ] Log errors: 0"; }

# Check 6 — Error types
[ "$ERRS" -gt 0 ] && {
  plog " Error breakdown:"
  grep "ERROR" "$LOG" 2>/dev/null | awk '{print $4}' | sort | uniq -c | \
  while read line; do plog " $line"; done
}

# Check 7 — PENDING transactions
PEND=$(grep -c "PENDING" "$LOG" 2>/dev/null || echo "0")
[ "$PEND" -gt 0 ] && { plog "[WARNING ] $PEND PENDING entries in log"; ISSUES+=1; } \
|| plog "[ OK ] No stuck PENDING entries"

plog ""
plog "=== PHASE 3: ALERTS & OVERNIGHT ==="

# Check 8 — Overnight alerts
CRIT_ALERTS=$(grep -c "CRITICAL" ~/alert-log.txt 2>/dev/null || echo "0")
[ "$CRIT_ALERTS" -gt 0 ] && { plog "[CRITICAL] $CRIT_ALERTS critical alerts fired overnight!"; ISSUES+=1; } \
|| plog "[ OK ] No critical alerts overnight"

# Summary
plog ""
plog "============================================"
plog " MORNING SUMMARY: $ISSUES issue(s) found"
[ "$ISSUES" -eq 0 ] && plog " Status: ALL SYSTEMS HEALTHY" \
|| plog " Status: ACTION REQUIRED — review above"
plog " Report: $REPORT"
plog "============================================"

terminal — save, run, and schedule

chmod +x ~/l2-morning-drill.sh
./l2-morning-drill.sh

# Schedule — runs at 07:45 every weekday
(crontab -l 2>/dev/null; echo "45 7 * * 1-5 $HOME/l2-morning-drill.sh") | crontab -
crontab -l

→ Master script runs all 12 checks, prints CRITICAL/WARNING/OK for each, counts total issues, and saves a dated report. Scheduled to run at 07:45 every weekday.

Run the full monitoring drill — simulate an incident

Create a simulated incident log and metrics file, then work through the full drill independently using the health checklist and correlation skills.

terminal — create the drill scenario

# Create the drill incident log
cat > ~/payment-service.log << 'EOF'
[08:00:00] [INFO ] Service started. TXN rate normal at 140/sec.
[08:45:10] [WARN ] DB connection pool at 79%
[08:45:25] [WARN ] DB connection pool at 91%
[08:45:30] [WARN ] DB connection pool at 97%
[08:45:35] [ERROR] DB_CONNECTION_TIMEOUT TXN-901 FAILED
[08:45:35] [ERROR] DB_CONNECTION_TIMEOUT TXN-902 FAILED
[08:45:36] [ERROR] DB_CONNECTION_TIMEOUT TXN-903 FAILED
[08:45:37] [ERROR] DB_CONNECTION_REFUSED port 5432 not responding
[08:45:37] [ERROR] TXN-904 FAILED DB process appears down
[08:46:00] [ERROR] MAX_RETRY_EXCEEDED TXN-901 moved to dead letter
[09:00:00] [INFO ] DB connection restored. Pool at 10%.
[09:00:02] [INFO ] TXN-905 processed successfully. Recovery confirmed.
EOF

# Run the master drill script
./l2-morning-drill.sh

→ Script detects errors in the log. Report shows CRITICAL for log errors and WARNING for PENDING. Issues found: 2. This is your trigger to investigate.

Independently investigate — apply correlation

Using the skills from Days 1–7, trace the incident without guidance. Find: first warning, first error, affected transactions, root cause, recovery time.

terminal — independent investigation

# 1. How many errors and what type?
grep "ERROR" ~/payment-service.log | awk '{print $4}' | sort | uniq -c

# 2. What warned us before the crash?
grep -B 3 -m 1 "ERROR" ~/payment-service.log

# 3. Which transactions were affected?
grep "FAILED\|dead letter" ~/payment-service.log | grep -o "TXN-[0-9]*"

# 4. When did it start and when did it recover?
grep -m 1 "ERROR" ~/payment-service.log | awk '{print "First error:", $1, $2}'
grep "restored" ~/payment-service.log | awk '{print "Recovery :", $1, $2}'

→ DB_CONNECTION_TIMEOUT x3 + DB_CONNECTION_REFUSED x1. Build-up: pool 79→91→97% in 20 seconds. First error 08:45:35. Recovery 09:00:00. TXN-901 in dead letter.

Write the complete post-incident RCA

Write the full Jira post-incident report — timeline, root cause, impact, actions, next steps. This is the final deliverable of the monitoring project.

Final RCA — Post-Incident Report

POST-INCIDENT REPORT
Engineer : Urwah Shafiq (L2)
Date : $(date)
Severity : P1 — Payment service down
======================================

TIMELINE:
08:00:00 — Service running normally, 140 TXN/sec
08:45:10 — WARN: DB pool 79% (first warning sign)
08:45:30 — WARN: DB pool 97% (critical threshold)
08:45:35 — ERROR: DB_CONNECTION_TIMEOUT — pool exhausted
08:45:37 — ERROR: DB_CONNECTION_REFUSED — DB process down
08:46:00 — TXN-901 moved to dead letter queue
09:00:00 — DB restored. Recovery confirmed in logs.

ROOT CAUSE:
DB connection pool exhausted at 08:45:35.
Pool filled from 79% to 97% in 20 seconds —
suspected connection leak in application code.
DB process subsequently crashed at 08:45:37.

IMPACT:
Transactions failed: TXN-901, 902, 903, 904
Dead letter: TXN-901 (needs reprocessing)
Duration: 08:45:35 — 09:00:00 (14 min 25 sec)

ACTIONS TAKEN:
DBA restarted DB service at 08:55
DB recovered at 09:00
TXN-901 reprocessed from dead letter

NEXT STEPS:
L3 to investigate connection leak in payment service
DBA to increase connection pool limit as short-term fix
Add pool % alert at 80% threshold for earlier warning

→ Complete professional post-incident report. Every event has a timestamp. Root cause is evidence-based. Actions and next steps are clear. ✅

04 What You Now Own Permanently

🏆 Your Monitoring Toolkit — Built Across Week 6.5

Deliverable	What it does	When you use it
l2-morning-drill.sh	12-check morning script covering all resources, logs, and alerts	Every morning at 07:45 — auto-runs, report waiting when you arrive
alert-evaluator.sh	Evaluates real metrics against alert rules every 5 minutes	Continuous — runs via crontab, alerts you when thresholds cross
app-health-check.sh	Reads API log, reports latency, error rate, throughput per endpoint	When a client reports API issues — run for instant health picture
resource-check.sh	All 4 resources checked with OK/WARNING/CRITICAL verdicts	During any incident — first thing you run to rule out resource cause
Health Checklist (12 items)	Structured morning check across 3 phases — system, app, alerts	Every shift start — 10 minutes, nothing missed
Correlation Method (5 steps)	Define window → metrics → logs → align → write RCA	Every incident investigation — consistent, defensible root cause

05 Real L2 Scenarios

You arrive at work. The morning report is already in ~/morning-report-today.txt. It shows: 1 issue — disk at 83%. Before your first coffee, you know exactly what to do. You archive old logs, disk drops to 66%, and you document it in Jira. Day started proactively, not reactively.

Alert fires at 09:00. You run the health checklist in sequence — disk OK, memory OK, CPU at 91%. You check ps aux — Java payment service at 89% CPU. You check logs — DB_CONNECTION_TIMEOUT errors 30 seconds before CPU spike. You diagnose DB failure, not CPU overload, in under 2 minutes. Correct diagnosis, correct escalation.

Your lead is on leave. A P1 fires at 14:00. You handle it independently — health checklist, correlation, RCA, bridge communication, Jira documentation, client notification. When your lead returns they read the post-incident report and say "this is exactly what I would have done." That is what independent handling looks like.

Bi-weekly review meeting. Manager asks about the 09:00 incident last Tuesday. You open the morning report from that day and the post-incident RCA you wrote. Full timeline, root cause, impact, actions — all documented with timestamps. The meeting takes 5 minutes instead of 30. The records speak for themselves.

✅ Week 6.5 · Day 8 — Monitoring Project Outcomes

Build and memorise a 12-item health checklist covering system resources, application health, and overnight alerts — in the correct phase order
Understand why the checklist order matters — disk first, then memory, then CPU, then application, then alerts
Build the master l2-morning-drill.sh script that runs all 12 checks, classifies each as OK/WARNING/CRITICAL, and saves a dated report
Schedule the master script to run at 07:45 every weekday so the report is ready before your shift starts
Complete the full monitoring drill — detect an incident from the report, investigate using correlation, identify root cause from log timestamps and metric sequence
Write a complete professional post-incident report — timeline with timestamps, evidence-based root cause, impact count, actions taken, and next steps
Demonstrate independent handling — running the full investigation and RCA without guidance, start to finish
Own your complete monitoring toolkit: morning drill script, alert evaluator, app health checker, resource checker, health checklist, and correlation method