L2 Support Engineer · Fintech · Week 3
Week 3
Day 5
Today's Topic
Health Check Script
One script that checks everything — disk, CPU, memory — and tells you if your server is healthy or needs attention. Every L2 engineer needs this in their toolkit.
Disk Check
CPU Check
Memory Check
Monitoring Automation
01 The Simple Idea First
Real-life Analogy
Think of a pilot's pre-flight checklist. Before every flight, the pilot checks fuel level, engine status, pressure, temperature — every critical metric in a fixed order. They don't skip anything and they don't do it from memory — they follow the same checklist every single time.
Your health check script is that checklist. Every morning before support starts, you run it and instantly know if the server is ready or if something needs attention.
02 The 3 Things You Always Check
How full is the hard drive? When disk hits 100% the server stops writing logs, transactions fail, everything breaks.
Command: df -h
What to check: The Use% column on the / (root) partition.
How much RAM is available? When RAM runs out the server uses disk as backup (swap) — much slower, causes lag and crashes.
Command: free -h
What to check: Available column and Swap used.
How hard is the processor working? High CPU means the server is struggling — transactions slow down and timeouts increase.
Command: top or mpstat
What to check: %idle — the lower the idle, the more stressed.
Are there any ERROR lines in the application logs? Even if the server looks healthy, the app could be throwing errors.
Command: grep -c "ERROR" app.log
What to check: Any count above 0 means something needs attention.
| Metric | OK | Warning | Critical — Act Now |
| Disk Space | Below 80% | 80% – 90% | Above 90% |
| RAM Available | Above 30% | 15% – 30% | Below 15% |
| CPU Idle | Above 30% | 10% – 30% | Below 10% |
| Swap Used | 0 – low | Growing | High / Maxed |
| Log Errors | 0 errors | 1 – 10 errors | More than 10 |
03 The Scripts — Every One Written Out
📌 Run order: Create the sample log first (Script 1), then run the health check (Script 2). The full monitoring script (Script 3) combines everything.
create-sample-log.sh — copy exactly as is
#!/bin/bash
# Creates a sample payment log for testing
# Run this FIRST before any other script
LOG_FILE="$HOME/payment-service.log"
echo "Creating sample log at $LOG_FILE ..."
cat > "$LOG_FILE" << 'EOF'
[2024-03-15 14:01:55] [INFO ] TXN-9820 received. Amount: 2000 Status: QUEUED
[2024-03-15 14:01:57] [INFO ] TXN-9820 processed successfully Status: SUCCESS
[2024-03-15 14:02:00] [INFO ] TXN-9821 received. Amount: 8500 Status: QUEUED
[2024-03-15 14:02:02] [WARN ] DB connection pool at 82% capacity
[2024-03-15 14:02:03] [WARN ] DB connection pool at 91% capacity
[2024-03-15 14:02:04] [WARN ] DB connection pool at 97% capacity CRITICAL
[2024-03-15 14:02:05] [ERROR] DB_CONNECTION_TIMEOUT failed to acquire connection
[2024-03-15 14:02:05] [ERROR] TXN-9821 FAILED unable to write to database
[2024-03-15 14:02:06] [ERROR] DB_CONNECTION_TIMEOUT failed to acquire connection
[2024-03-15 14:02:06] [ERROR] TXN-9822 FAILED unable to write to database
[2024-03-15 14:02:07] [WARN ] Retry attempt 1/3 for TXN-9822
[2024-03-15 14:02:09] [WARN ] Retry attempt 2/3 for TXN-9822
[2024-03-15 14:02:12] [ERROR] Max retries exceeded TXN-9822 dead-letter queue
[2024-03-15 14:02:15] [INFO ] TXN-9824 received. Amount: 1200 Status: QUEUED
[2024-03-15 14:02:16] [INFO ] TXN-9824 processed successfully Status: SUCCESS
[2024-03-15 14:03:00] [ERROR] SOCKET_TIMEOUT no response from SBP after 10000ms
[2024-03-15 14:03:01] [ERROR] TXN-9825 FAILED external gateway unreachable
EOF
echo "Done. Log created."
echo "ERROR lines : $(grep -c 'ERROR' $LOG_FILE)"
echo "WARN lines : $(grep -c 'WARN' $LOG_FILE)"
echo "INFO lines : $(grep -c 'INFO' $LOG_FILE)"
✅ How to run: chmod +x create-sample-log.sh && ./create-sample-log.sh
disk-check.sh
#!/bin/bash
# Checks disk usage and alerts if above limit
DISK_LIMIT=80
DISK_USED=$(df / | awk 'NR==2 {print $5}' | tr -d '%')
echo "[DISK] Usage: $DISK_USED%"
if [ $DISK_USED -gt $DISK_LIMIT ]; then
echo "[DISK] WARNING: Disk at $DISK_USED% — above $DISK_LIMIT% limit!"
else
echo "[DISK] OK — below threshold"
fi
✅ How to run: chmod +x disk-check.sh && ./disk-check.sh
memory-check.sh
#!/bin/bash
# Checks RAM usage and swap status
echo "[MEMORY] Current RAM usage:"
free -h
# Get available memory in MB
MEM_AVAIL=$(free -m | awk 'NR==2 {print $7}')
MEM_TOTAL=$(free -m | awk 'NR==2 {print $2}')
SWAP_USED=$(free -m | awk 'NR==3 {print $3}')
echo ""
echo "[MEMORY] Available: ${MEM_AVAIL}MB of ${MEM_TOTAL}MB"
echo "[MEMORY] Swap used: ${SWAP_USED}MB"
if [ $SWAP_USED -gt 100 ]; then
echo "[MEMORY] WARNING: High swap usage — RAM is running low!"
else
echo "[MEMORY] OK — Swap usage is normal"
fi
✅ How to run: chmod +x memory-check.sh && ./memory-check.sh
cpu-check.sh
#!/bin/bash
# Checks CPU usage using top snapshot
# Get CPU idle % from top (run once, not interactive)
CPU_IDLE=$(top -bn1 | grep "Cpu(s)" | awk '{print $8}' | tr -d '%id,')
CPU_USED=$(echo "100 - $CPU_IDLE" | bc)
echo "[CPU] Usage : ${CPU_USED}%"
echo "[CPU] Idle : ${CPU_IDLE}%"
if [ $(echo "$CPU_IDLE < 20" | bc) -eq 1 ]; then
echo "[CPU] WARNING: CPU idle below 20% — server is under heavy load!"
else
echo "[CPU] OK — CPU load is acceptable"
fi
# Show top 3 CPU-eating processes
echo ""
echo "[CPU] Top 3 processes by CPU:"
ps aux --sort=-%cpu | awk 'NR==2,NR==4 {print $11, $3"%"}'
✅ How to run: chmod +x cpu-check.sh && ./cpu-check.sh
⚠️ Note: If bc is not installed run: sudo apt install bc -y
📌 This is the main script — combines disk + memory + CPU + log error check all in one. This is what gets scheduled with crontab.
server-health-check.sh — FULL SCRIPT
#!/bin/bash
# ============================================
# server-health-check.sh
# Full server health check — disk, RAM, CPU, logs
# Run: chmod +x server-health-check.sh
# Then: ./server-health-check.sh
# ============================================
# --- CONFIG — change these if needed ---
DISK_WARN=80 # warn above 80%
DISK_CRIT=90 # critical above 90%
SWAP_WARN=100 # warn if swap above 100MB
LOG_FILE="$HOME/payment-service.log"
REPORT="$HOME/health-report.txt"
# ---------------------------------------
# --- HEADER ---
echo "============================================"
echo " SERVER HEALTH CHECK"
echo " Date : $(date)"
echo "============================================"
# --- DISK CHECK ---
echo ""
echo "[1/4] DISK SPACE"
echo "--------------------------------------------"
DISK_USED=$(df / | awk 'NR==2 {print $5}' | tr -d '%')
df -h /
echo ""
if [ $DISK_USED -ge $DISK_CRIT ]; then
echo "STATUS: CRITICAL — Disk at $DISK_USED%! Clean up immediately!"
elif [ $DISK_USED -ge $DISK_WARN ]; then
echo "STATUS: WARNING — Disk at $DISK_USED%. Schedule cleanup soon."
else
echo "STATUS: OK — Disk at $DISK_USED%"
fi
# --- MEMORY CHECK ---
echo ""
echo "[2/4] MEMORY (RAM)"
echo "--------------------------------------------"
free -h
echo ""
SWAP_USED=$(free -m | awk 'NR==3 {print $3}')
if [ $SWAP_USED -gt $SWAP_WARN ]; then
echo "STATUS: WARNING — Swap used: ${SWAP_USED}MB. RAM may be low."
else
echo "STATUS: OK — Swap used: ${SWAP_USED}MB"
fi
# --- CPU CHECK ---
echo ""
echo "[3/4] CPU USAGE"
echo "--------------------------------------------"
echo "Top 5 processes by CPU:"
ps aux --sort=-%cpu | awk 'NR==1,NR==6 {printf "%-20s %s\n", $11, $3"%"}'
echo ""
# Load average (1min 5min 15min) — above 1.0 per core is stressed
LOAD=$(cat /proc/loadavg | awk '{print $1, $2, $3}')
echo "Load average (1m 5m 15m): $LOAD"
# --- LOG ERROR CHECK ---
echo ""
echo "[4/4] LOG ERRORS"
echo "--------------------------------------------"
if [ -f "$LOG_FILE" ]; then
ERR_COUNT=$(grep -c "ERROR" "$LOG_FILE" 2>/dev/null || echo 0)
WARN_COUNT=$(grep -c "WARN" "$LOG_FILE" 2>/dev/null || echo 0)
echo "Log file : $LOG_FILE"
echo "Errors : $ERR_COUNT"
echo "Warnings : $WARN_COUNT"
echo ""
if [ $ERR_COUNT -gt 10 ]; then
echo "STATUS: CRITICAL — $ERR_COUNT errors found! Investigate now."
echo "Top errors:"
grep "ERROR" "$LOG_FILE" | awk '{print $4}' | sort | uniq -c | sort -rn | head -3
elif [ $ERR_COUNT -gt 0 ]; then
echo "STATUS: WARNING — $ERR_COUNT errors found. Monitor closely."
else
echo "STATUS: OK — No errors in log"
fi
else
echo "Note: Log file not found at $LOG_FILE"
fi
# --- FOOTER ---
echo ""
echo "============================================"
echo " Check complete. Report saved to: $REPORT"
echo "============================================"
✅ How to run: chmod +x server-health-check.sh && ./server-health-check.sh
schedule-health-check.sh
#!/bin/bash
# Gives all scripts permission and schedules them
# Run this LAST after all scripts are in place
# Give all scripts permission to run
chmod +x "$HOME/server-health-check.sh"
chmod +x "$HOME/disk-check.sh"
chmod +x "$HOME/memory-check.sh"
chmod +x "$HOME/cpu-check.sh"
echo "Permissions set."
echo "Scheduling cron jobs..."
# Add cron jobs (without removing existing ones)
(crontab -l 2>/dev/null; echo "# Daily health check at 8 AM") | crontab -
(crontab -l 2>/dev/null; echo "0 8 * * * $HOME/server-health-check.sh >> $HOME/health-report.txt 2>&1") | crontab -
(crontab -l 2>/dev/null; echo "# Disk check every 30 minutes") | crontab -
(crontab -l 2>/dev/null; echo "*/30 * * * * $HOME/disk-check.sh >> $HOME/disk-report.txt 2>&1") | crontab -
echo "Done. Your active cron jobs:"
echo ""
crontab -l
✅ How to run: chmod +x schedule-health-check.sh && ./schedule-health-check.sh
04 Hands-on Lab — Run Everything Step by Step
Copy all scripts to your home folder on Kali
Place all 6 scripts in /home/kali/ — then give them all permission at once.
Create the sample log file first
Other scripts need this log to exist.
terminal
./create-sample-log.sh
→ payment-service.log created with 17 lines ✅
Run each individual check to understand each metric
Test each one separately before running the full script.
terminal
./disk-check.sh
./memory-check.sh
./cpu-check.sh
→ Each check runs and shows OK or WARNING status ✅
Run the FULL health check script
This is the main script — runs all 4 checks together.
terminal
./server-health-check.sh
→ Full report: disk, RAM, CPU load, error count — all in one output ✅
Save output to a report file
Run it and save the result so you can review it later.
terminal
./server-health-check.sh >> ~/health-report.txt
cat ~/health-report.txt
→ health-report.txt created — full report saved ✅
Schedule it to run automatically every day
Run the scheduling script to set up crontab.
terminal
./schedule-health-check.sh
crontab -l
→ Cron jobs confirmed — health check runs every day at 8 AM automatically ✅
05 Real L2 Scenarios
01
You arrive at work and run ./server-health-check.sh as your first action. It shows disk at 93% — CRITICAL. Before any client calls you, you already know there's an issue and start the cleanup. Proactive support, not reactive.
02
An S1 fires and transactions are slow. You run the health check and it shows swap at 800MB — RAM is nearly full. You check ps aux --sort=-%mem | head -5 to find which process is eating memory, then report on the bridge immediately.
03
Manager asks: "What was the server status at 8 AM today?" — Because your health check is scheduled and saving to health-report.txt, you open the file and read the exact status from this morning. Full history, no manual checking needed.
04
The health check runs at 8 AM and finds 6 ERROR lines in the log. You investigate before clients call. You find a DB connection error from overnight, trace it through the DB tables, and have a root cause ready before the working day even starts.
✅ Week 3 · Day 5 Outcomes
- Understand the 4 key server health metrics — disk, RAM, CPU, and log errors
- Know the alert thresholds for each metric — OK, Warning, and Critical levels
- Write and run the individual check scripts — disk-check, memory-check, cpu-check
- Build and run the full server-health-check.sh combining all 4 checks with smart alerts
- Save health check output to a report file for historical reference
- Schedule the health check to run automatically every day at 8 AM using crontab
- Use the health check output during an S1 to report server status on the bridge instantly