L2 Support Engineer · Fintech · Week 6.5
Week 6.5
Day 2
Week 6.5 · Day 2
System Metrics
CPU, Memory, Disk, and Network — the four resources every server runs on. When one of them gets stressed, everything slows down or breaks. Today you learn to read each one, understand what stress looks like, and trace it to its cause.
CPU
Memory
Disk
Network
Resource Spike
Troubleshooting
01 The Simple Idea First
Real-life Analogy
Think of a server like a person running a busy office. They have four things they depend on every second of the day — their brain to think (CPU), desk space to work (Memory/RAM), a filing cabinet to store things (Disk), and a phone to communicate (Network).
If their brain is overloaded — they think slowly and make mistakes. If the desk is full — they cannot start new work. If the filing cabinet is jammed — they cannot save or find anything. If the phone line is congested — communication stops.
Resource troubleshooting is finding which of the four is overwhelmed — and why.
02 The 4 System Resources — Deep Dive
⚙️
Resource 1
CPU — The Brain
The CPU executes every instruction your server runs — processing transactions, running queries, handling API requests. When CPU is overloaded, everything slows down proportionally. A task that took 200ms now takes 2 seconds.
What high CPU looks like: Slow response times across all services. Applications hanging. Load average growing over time.
What to check: Which process is consuming it. How long it has been high. Whether it is trending up or plateauing.
top -bn1 | grep "Cpu"
ps aux --sort=-%cpu | head -6
🧠
Resource 2
Memory — The Desk
RAM holds everything the server is actively working on. When RAM fills up, the OS starts using disk as backup memory — this is called swap. Swap is 100x slower than RAM. When swap grows, the server becomes painfully slow even if CPU looks fine.
What high memory looks like: Increasing swap usage. Applications crashing with OutOfMemory errors. Gradual slowdown that gets worse over hours.
What to check: Available RAM. Swap used. Which process is the biggest memory consumer.
free -h
ps aux --sort=-%mem | head -6
💾
Resource 3
Disk — The Filing Cabinet
Disk stores everything permanently — logs, databases, application files. When disk hits 100%, the server cannot write anything new. The DB stops recording transactions. Log files stop growing. Applications crash trying to write.
What a full disk looks like: Write errors in logs. DB transactions failing. Applications unable to start.
What to check: Which partition is full. Which folder is taking the most space. Whether logs are the culprit.
df -h
du -sh /* 2>/dev/null | sort -rh | head -8
🌐
Resource 4
Network — The Phone Line
Network carries data between your server, the DB, external APIs (SBP, 1LINK), and clients. When the network is congested or a connection is broken, transactions time out even if every other resource is perfectly healthy.
What network issues look like: SOCKET_TIMEOUT and CONNECTION_REFUSED errors. Intermittent failures. Selective failures for external APIs only.
What to check: Can you reach the target server? Is packet loss occurring? Is bandwidth saturated?
ping -c 4 8.8.8.8
netstat -an | grep ESTABLISHED | wc -l
03 Resource Thresholds — When to Act
⚠️ Action Thresholds for All 4 Resources
| Resource | Healthy | Warning — Monitor | Critical — Act Now |
| CPU Usage |
Below 70% |
70% – 85% |
Above 90% — find the process |
| RAM Available |
Above 30% free |
15% – 30% free |
Below 15% free — check swap |
| Swap Used |
0 – low |
Growing steadily |
High and maxed — RAM gone |
| Disk Usage |
Below 80% |
80% – 90% |
Above 90% — clean up now |
| Network Ping |
Under 20ms |
20ms – 100ms |
Over 200ms / packet loss |
| Open Connections |
Normal baseline |
2x normal |
Maxed out — connection leak |
04 How the 4 Resources Affect Each Other
Resources do not fail in isolation
This is the most important thing to understand. A problem in one resource almost always creates symptoms in another. The visible symptom is not always the root cause.
Disk full → CPU spikes. When disk is full, applications start throwing errors and retrying. The retry loops consume CPU. You see high CPU but the real cause is disk.
RAM full → Disk activity spikes. When RAM is exhausted, the OS swaps to disk constantly. Disk read/write speeds shoot up and disk appears busy — but the real cause is RAM exhaustion.
Network congestion → CPU spikes. When network is slow, connection queues build up. The application keeps trying to push data out, consuming CPU while waiting. CPU looks high but the network is the bottleneck.
Rule: Always check all 4 resources together. Run disk, memory, CPU, and network checks before concluding anything. The noisy one is often covering for the real culprit.
💡 The investigation order that works every time: Check disk first (most common silent killer) → Check memory and swap → Check CPU and which process → Check network if all three above look fine.
05 Commands for Every Resource
Full resource investigation — run these in order
# ===== DISK — check first =====
df -h # all partitions with % used
du -sh /var/log/* 2>/dev/null | sort -rh | head -5 # biggest log folders
# ===== MEMORY — check second =====
free -h # RAM and swap overview
ps aux --sort=-%mem | head -5 # top memory consumers
# ===== CPU — check third =====
ps aux --sort=-%cpu | head -5 # top CPU consumers
cat /proc/loadavg # load average 1m 5m 15m
# ===== NETWORK — check if above 3 look fine =====
ping -c 4 8.8.8.8 # basic connectivity test
netstat -an | grep ESTABLISHED | wc -l # open connection count
netstat -an | grep TIME_WAIT | wc -l # stuck connections
06 Hands-on Lab — Detect a Resource Spike
🔬 Lab: Detect and Investigate a Resource Spike
Kali Linux · Real Commands
Create a full resource snapshot — your baseline
Before creating a spike, record what normal looks like. This is what you compare against when investigating.
terminal — record baseline
echo "=== BASELINE READING — $(date) ==="
echo ""
echo "-- DISK --"
df -h /
echo ""
echo "-- MEMORY --"
free -h
echo ""
echo "-- CPU LOAD --"
cat /proc/loadavg
echo ""
echo "-- NETWORK --"
ping -c 2 8.8.8.8 | tail -2
→ Save these numbers. They are your "normal." When a spike happens you compare the new reading against this.
Create a real CPU spike and catch it
Run the spike in background, then immediately investigate it the proper way.
terminal — create spike then investigate
# Step 1 — create a 30 second CPU spike in background
timeout 30 yes > /dev/null &
echo "Spike started. PID: $!"
# Step 2 — immediately run the investigation
ps aux --sort=-%cpu | head -6
# Step 3 — what is load average now vs baseline?
cat /proc/loadavg
# Step 4 — find which process and how long it has run
ps aux --sort=-%cpu | awk 'NR==2 {print "PID: "$2, "CPU: "$3"%", "Process: "$11}'
→ "yes" process appears at top using high CPU. You can see the PID, CPU%, and load average increased from baseline — real resource spike detected.
Simulate a disk pressure check
Create a large temporary file to simulate disk pressure, check it, then clean it up.
terminal
# Create a temporary 100MB file (simulating log growth)
dd if=/dev/zero of=~/test-large-file.tmp bs=1M count=100 2>/dev/null
echo "File created. Checking disk..."
# Check disk after creating the file
df -h ~
# Find the biggest files in home folder
du -sh ~/* 2>/dev/null | sort -rh | head -5
# Clean up — always delete test files
rm ~/test-large-file.tmp
echo "Cleaned up. Disk back to normal."
→ After creating the file, disk usage increases. du shows the test file at the top as biggest item. After rm, disk returns to baseline. This is how you find what is eating disk space.
Check memory and swap — the hidden slowdown
Read the free command output carefully — available RAM and swap usage together tell the full story.
terminal — memory deep dive
# Full memory breakdown
free -h
# Just the numbers you need — available RAM and swap used
free -m | awk 'NR==2 {print "RAM Available: "$7"MB"} NR==3 {print "Swap Used: "$3"MB"}'
# Top 3 memory-consuming processes
ps aux --sort=-%mem | awk 'NR==2,NR==4 {printf "%-25s %s%%\n", $11, $4}'
→ Shows RAM available in MB, swap used, and top 3 memory hogs. On Kali this will likely be your browser and desktop environment — on a server it would be Java apps and the DB process.
Build a full resource spike detection script
This script checks all 4 resources, compares against thresholds, and prints a verdict for each.
resource-check.sh — create and run
#!/bin/bash
# Full resource spike detector
# chmod +x resource-check.sh && ./resource-check.sh
echo "========================================"
echo " Resource Spike Check — $(date)"
echo "========================================"
# DISK
DISK=$(df / | awk 'NR==2{print $5}' | tr -d '%')
[ "$DISK" -ge 90 ] && echo "[DISK ] CRITICAL: $DISK% used!" \
|| { [ "$DISK" -ge 80 ] && echo "[DISK ] WARNING : $DISK% used" \
|| echo "[DISK ] OK : $DISK% used"; }
# SWAP
SWAP=$(free -m | awk 'NR==3{print $3}')
[ "$SWAP" -gt 500 ] && echo "[MEMORY] WARNING : Swap ${SWAP}MB — RAM low!" \
|| echo "[MEMORY] OK : Swap ${SWAP}MB"
# CPU — load average
LOAD=$(cat /proc/loadavg | awk '{print $1}')
echo "[CPU ] Load avg: $LOAD (1 min)"
echo "[CPU ] Top process: $(ps aux --sort=-%cpu | awk 'NR==2{print $11, $3"%"}')"
# NETWORK
ping -c 1 -W 2 8.8.8.8 >/dev/null 2>&1 \
&& echo "[NETWORK] OK — external reach confirmed" \
|| echo "[NETWORK] WARNING — cannot reach 8.8.8.8"
echo "========================================"
→ Single script covers all 4 resources. Each one prints OK, WARNING, or CRITICAL with the actual value. Schedule this with crontab to run every 30 minutes. ✅
07 Real L2 Scenarios
01
CPU looks fine at 45%. Memory looks fine. But transactions are timing out. You check network — ping to SBP gateway is 800ms with 30% packet loss. Not a server resource issue at all. Pure network problem between your server and SBP. You escalate to the network team and notify the bridge.
02
Server is extremely slow. CPU is at 95%. You investigate — top process is the Java app. But you also check memory — swap is at 2GB and RAM is 98% used. The real cause is RAM exhaustion causing constant swapping which in turn is driving up CPU. Fix the memory leak, CPU comes down on its own.
03
Applications start failing to write logs. DB transactions start failing. CPU is normal. Memory is normal. You check disk — disk is at 100%. You run du -sh /var/log/* — the application has been writing debug logs for a week and they are 48GB. You archive the old logs, disk drops to 62%, everything resumes within seconds.
04
Everything looks healthy at first glance. But response times are slow only for external API calls — internal calls are fast. You run netstat — 2,400 connections in TIME_WAIT state, way above normal. The connection pool to the external gateway is exhausted. A connection leak in the app is holding sockets open. L3 to fix the leak, netstat count drops to normal.
✅ Week 6.5 · Day 2 Outcomes
- Explain what each of the 4 system resources does — CPU (processing), Memory (active work), Disk (storage), Network (communication)
- Understand how resources affect each other — full disk can spike CPU, RAM exhaustion spikes disk I/O, network congestion spikes CPU
- Know the key commands for each resource — df -h, free -h, ps aux, ping, netstat
- Apply the correct investigation order — Disk first, then Memory, then CPU, then Network
- Know the thresholds for each resource — what is OK, Warning, and Critical
- Complete the lab — record a baseline, create and detect a real CPU spike, simulate disk pressure, read memory and swap output, and build a full resource-check.sh script that covers all 4
- Schedule the resource check script with crontab for automatic monitoring every 30 minutes