L2 Support Engineer · Fintech · Week 3
Week 3 Day 5
Today's Topic

Health Check Script

One script that checks everything — disk, CPU, memory — and tells you if your server is healthy or needs attention. Every L2 engineer needs this in their toolkit.

Disk Check CPU Check Memory Check Monitoring Automation
01 The Simple Idea First
Real-life Analogy

Think of a pilot's pre-flight checklist. Before every flight, the pilot checks fuel level, engine status, pressure, temperature — every critical metric in a fixed order. They don't skip anything and they don't do it from memory — they follow the same checklist every single time.

Your health check script is that checklist. Every morning before support starts, you run it and instantly know if the server is ready or if something needs attention.

02 The 3 Things You Always Check
💾
Metric 1
Disk Space

How full is the hard drive? When disk hits 100% the server stops writing logs, transactions fail, everything breaks.

Command: df -h

What to check: The Use% column on the / (root) partition.

🧠
Metric 2
Memory (RAM)

How much RAM is available? When RAM runs out the server uses disk as backup (swap) — much slower, causes lag and crashes.

Command: free -h

What to check: Available column and Swap used.

⚙️
Metric 3
CPU Usage

How hard is the processor working? High CPU means the server is struggling — transactions slow down and timeouts increase.

Command: top or mpstat

What to check: %idle — the lower the idle, the more stressed.

📋
Metric 4
Log Errors

Are there any ERROR lines in the application logs? Even if the server looks healthy, the app could be throwing errors.

Command: grep -c "ERROR" app.log

What to check: Any count above 0 means something needs attention.

⚠️ Alert Thresholds — When to Act
MetricOKWarningCritical — Act Now
Disk SpaceBelow 80%80% – 90%Above 90%
RAM AvailableAbove 30%15% – 30%Below 15%
CPU IdleAbove 30%10% – 30%Below 10%
Swap Used0 – lowGrowingHigh / Maxed
Log Errors0 errors1 – 10 errorsMore than 10
03 The Scripts — Every One Written Out
📌 Run order: Create the sample log first (Script 1), then run the health check (Script 2). The full monitoring script (Script 3) combines everything.
create-sample-log.sh

Script 1 — Create Sample Log File

Run This First
create-sample-log.sh — copy exactly as is
#!/bin/bash
# Creates a sample payment log for testing
# Run this FIRST before any other script

LOG_FILE="$HOME/payment-service.log"

echo "Creating sample log at $LOG_FILE ..."

cat > "$LOG_FILE" << 'EOF'
[2024-03-15 14:01:55] [INFO ] TXN-9820 received. Amount: 2000 Status: QUEUED
[2024-03-15 14:01:57] [INFO ] TXN-9820 processed successfully Status: SUCCESS
[2024-03-15 14:02:00] [INFO ] TXN-9821 received. Amount: 8500 Status: QUEUED
[2024-03-15 14:02:02] [WARN ] DB connection pool at 82% capacity
[2024-03-15 14:02:03] [WARN ] DB connection pool at 91% capacity
[2024-03-15 14:02:04] [WARN ] DB connection pool at 97% capacity CRITICAL
[2024-03-15 14:02:05] [ERROR] DB_CONNECTION_TIMEOUT failed to acquire connection
[2024-03-15 14:02:05] [ERROR] TXN-9821 FAILED unable to write to database
[2024-03-15 14:02:06] [ERROR] DB_CONNECTION_TIMEOUT failed to acquire connection
[2024-03-15 14:02:06] [ERROR] TXN-9822 FAILED unable to write to database
[2024-03-15 14:02:07] [WARN ] Retry attempt 1/3 for TXN-9822
[2024-03-15 14:02:09] [WARN ] Retry attempt 2/3 for TXN-9822
[2024-03-15 14:02:12] [ERROR] Max retries exceeded TXN-9822 dead-letter queue
[2024-03-15 14:02:15] [INFO ] TXN-9824 received. Amount: 1200 Status: QUEUED
[2024-03-15 14:02:16] [INFO ] TXN-9824 processed successfully Status: SUCCESS
[2024-03-15 14:03:00] [ERROR] SOCKET_TIMEOUT no response from SBP after 10000ms
[2024-03-15 14:03:01] [ERROR] TXN-9825 FAILED external gateway unreachable
EOF

echo "Done. Log created."
echo "ERROR lines : $(grep -c 'ERROR' $LOG_FILE)"
echo "WARN lines : $(grep -c 'WARN' $LOG_FILE)"
echo "INFO lines : $(grep -c 'INFO' $LOG_FILE)"
How to run: chmod +x create-sample-log.sh && ./create-sample-log.sh
disk-check.sh

Script 2 — Disk Space Check

Individual Check
disk-check.sh
#!/bin/bash
# Checks disk usage and alerts if above limit

DISK_LIMIT=80
DISK_USED=$(df / | awk 'NR==2 {print $5}' | tr -d '%')

echo "[DISK] Usage: $DISK_USED%"

if [ $DISK_USED -gt $DISK_LIMIT ]; then
  echo "[DISK] WARNING: Disk at $DISK_USED% — above $DISK_LIMIT% limit!"
else
  echo "[DISK] OK — below threshold"
fi
How to run: chmod +x disk-check.sh && ./disk-check.sh
memory-check.sh

Script 3 — Memory (RAM) Check

Individual Check
memory-check.sh
#!/bin/bash
# Checks RAM usage and swap status

echo "[MEMORY] Current RAM usage:"
free -h

# Get available memory in MB
MEM_AVAIL=$(free -m | awk 'NR==2 {print $7}')
MEM_TOTAL=$(free -m | awk 'NR==2 {print $2}')
SWAP_USED=$(free -m | awk 'NR==3 {print $3}')

echo ""
echo "[MEMORY] Available: ${MEM_AVAIL}MB of ${MEM_TOTAL}MB"
echo "[MEMORY] Swap used: ${SWAP_USED}MB"

if [ $SWAP_USED -gt 100 ]; then
  echo "[MEMORY] WARNING: High swap usage — RAM is running low!"
else
  echo "[MEMORY] OK — Swap usage is normal"
fi
How to run: chmod +x memory-check.sh && ./memory-check.sh
cpu-check.sh

Script 4 — CPU Usage Check

Individual Check
cpu-check.sh
#!/bin/bash
# Checks CPU usage using top snapshot

# Get CPU idle % from top (run once, not interactive)
CPU_IDLE=$(top -bn1 | grep "Cpu(s)" | awk '{print $8}' | tr -d '%id,')
CPU_USED=$(echo "100 - $CPU_IDLE" | bc)

echo "[CPU] Usage : ${CPU_USED}%"
echo "[CPU] Idle : ${CPU_IDLE}%"

if [ $(echo "$CPU_IDLE < 20" | bc) -eq 1 ]; then
  echo "[CPU] WARNING: CPU idle below 20% — server is under heavy load!"
else
  echo "[CPU] OK — CPU load is acceptable"
fi

# Show top 3 CPU-eating processes
echo ""
echo "[CPU] Top 3 processes by CPU:"
ps aux --sort=-%cpu | awk 'NR==2,NR==4 {print $11, $3"%"}'
How to run: chmod +x cpu-check.sh && ./cpu-check.sh
⚠️ Note: If bc is not installed run: sudo apt install bc -y
server-health-check.sh

Script 5 — FULL Server Health Check (Main Script)

⭐ Main Script
📌 This is the main script — combines disk + memory + CPU + log error check all in one. This is what gets scheduled with crontab.
server-health-check.sh — FULL SCRIPT
#!/bin/bash
# ============================================
# server-health-check.sh
# Full server health check — disk, RAM, CPU, logs
# Run: chmod +x server-health-check.sh
# Then: ./server-health-check.sh
# ============================================

# --- CONFIG — change these if needed ---
DISK_WARN=80 # warn above 80%
DISK_CRIT=90 # critical above 90%
SWAP_WARN=100 # warn if swap above 100MB
LOG_FILE="$HOME/payment-service.log"
REPORT="$HOME/health-report.txt"
# ---------------------------------------

# --- HEADER ---
echo "============================================"
echo " SERVER HEALTH CHECK"
echo " Date : $(date)"
echo "============================================"

# --- DISK CHECK ---
echo ""
echo "[1/4] DISK SPACE"
echo "--------------------------------------------"
DISK_USED=$(df / | awk 'NR==2 {print $5}' | tr -d '%')
df -h /
echo ""

if [ $DISK_USED -ge $DISK_CRIT ]; then
  echo "STATUS: CRITICAL — Disk at $DISK_USED%! Clean up immediately!"
elif [ $DISK_USED -ge $DISK_WARN ]; then
  echo "STATUS: WARNING — Disk at $DISK_USED%. Schedule cleanup soon."
else
  echo "STATUS: OK — Disk at $DISK_USED%"
fi

# --- MEMORY CHECK ---
echo ""
echo "[2/4] MEMORY (RAM)"
echo "--------------------------------------------"
free -h
echo ""
SWAP_USED=$(free -m | awk 'NR==3 {print $3}')

if [ $SWAP_USED -gt $SWAP_WARN ]; then
  echo "STATUS: WARNING — Swap used: ${SWAP_USED}MB. RAM may be low."
else
  echo "STATUS: OK — Swap used: ${SWAP_USED}MB"
fi

# --- CPU CHECK ---
echo ""
echo "[3/4] CPU USAGE"
echo "--------------------------------------------"
echo "Top 5 processes by CPU:"
ps aux --sort=-%cpu | awk 'NR==1,NR==6 {printf "%-20s %s\n", $11, $3"%"}'
echo ""
# Load average (1min 5min 15min) — above 1.0 per core is stressed
LOAD=$(cat /proc/loadavg | awk '{print $1, $2, $3}')
echo "Load average (1m 5m 15m): $LOAD"

# --- LOG ERROR CHECK ---
echo ""
echo "[4/4] LOG ERRORS"
echo "--------------------------------------------"

if [ -f "$LOG_FILE" ]; then
  ERR_COUNT=$(grep -c "ERROR" "$LOG_FILE" 2>/dev/null || echo 0)
  WARN_COUNT=$(grep -c "WARN" "$LOG_FILE" 2>/dev/null || echo 0)
  echo "Log file : $LOG_FILE"
  echo "Errors : $ERR_COUNT"
  echo "Warnings : $WARN_COUNT"
  echo ""
  if [ $ERR_COUNT -gt 10 ]; then
    echo "STATUS: CRITICAL — $ERR_COUNT errors found! Investigate now."
    echo "Top errors:"
    grep "ERROR" "$LOG_FILE" | awk '{print $4}' | sort | uniq -c | sort -rn | head -3
  elif [ $ERR_COUNT -gt 0 ]; then
    echo "STATUS: WARNING — $ERR_COUNT errors found. Monitor closely."
  else
    echo "STATUS: OK — No errors in log"
  fi
else
  echo "Note: Log file not found at $LOG_FILE"
fi

# --- FOOTER ---
echo ""
echo "============================================"
echo " Check complete. Report saved to: $REPORT"
echo "============================================"
How to run: chmod +x server-health-check.sh && ./server-health-check.sh
schedule-health-check.sh

Script 6 — Schedule Health Check with Crontab

Automation
schedule-health-check.sh
#!/bin/bash
# Gives all scripts permission and schedules them
# Run this LAST after all scripts are in place

# Give all scripts permission to run
chmod +x "$HOME/server-health-check.sh"
chmod +x "$HOME/disk-check.sh"
chmod +x "$HOME/memory-check.sh"
chmod +x "$HOME/cpu-check.sh"

echo "Permissions set."
echo "Scheduling cron jobs..."

# Add cron jobs (without removing existing ones)
(crontab -l 2>/dev/null; echo "# Daily health check at 8 AM") | crontab -
(crontab -l 2>/dev/null; echo "0 8 * * * $HOME/server-health-check.sh >> $HOME/health-report.txt 2>&1") | crontab -
(crontab -l 2>/dev/null; echo "# Disk check every 30 minutes") | crontab -
(crontab -l 2>/dev/null; echo "*/30 * * * * $HOME/disk-check.sh >> $HOME/disk-report.txt 2>&1") | crontab -

echo "Done. Your active cron jobs:"
echo ""
crontab -l
How to run: chmod +x schedule-health-check.sh && ./schedule-health-check.sh
04 Hands-on Lab — Run Everything Step by Step

🔬 Lab: Build and Run the Full Health Check

Kali Linux · VirtualBox
01
Copy all scripts to your home folder on Kali
Place all 6 scripts in /home/kali/ — then give them all permission at once.
terminal
chmod +x ~/*.sh
02
Create the sample log file first
Other scripts need this log to exist.
terminal
./create-sample-log.sh
→ payment-service.log created with 17 lines ✅
03
Run each individual check to understand each metric
Test each one separately before running the full script.
terminal
./disk-check.sh
./memory-check.sh
./cpu-check.sh
→ Each check runs and shows OK or WARNING status ✅
04
Run the FULL health check script
This is the main script — runs all 4 checks together.
terminal
./server-health-check.sh
→ Full report: disk, RAM, CPU load, error count — all in one output ✅
05
Save output to a report file
Run it and save the result so you can review it later.
terminal
./server-health-check.sh >> ~/health-report.txt
cat ~/health-report.txt
→ health-report.txt created — full report saved ✅
06
Schedule it to run automatically every day
Run the scheduling script to set up crontab.
terminal
./schedule-health-check.sh
crontab -l
→ Cron jobs confirmed — health check runs every day at 8 AM automatically ✅
05 Real L2 Scenarios
01

You arrive at work and run ./server-health-check.sh as your first action. It shows disk at 93% — CRITICAL. Before any client calls you, you already know there's an issue and start the cleanup. Proactive support, not reactive.

02

An S1 fires and transactions are slow. You run the health check and it shows swap at 800MB — RAM is nearly full. You check ps aux --sort=-%mem | head -5 to find which process is eating memory, then report on the bridge immediately.

03

Manager asks: "What was the server status at 8 AM today?" — Because your health check is scheduled and saving to health-report.txt, you open the file and read the exact status from this morning. Full history, no manual checking needed.

04

The health check runs at 8 AM and finds 6 ERROR lines in the log. You investigate before clients call. You find a DB connection error from overnight, trace it through the DB tables, and have a root cause ready before the working day even starts.

✅ Week 3 · Day 5 Outcomes