L2 Support Engineer · Fintech · Week 6.5
Week 6.5
Day 7
Week 6.5 · Day 7
Incident Correlation
Metrics show you the numbers. Logs show you the events. Correlation is what happens when you look at both together at the same timestamp and ask — what caused what? Today you learn to connect those two stories into one clear root cause.
Logs + Metrics
Correlation
Timeline Analysis
RCA Skills
Root Cause
01 The Simple Idea First
Real-life Analogy
Think of a doctor diagnosing a patient. The heart monitor shows the pulse suddenly dropped at 3:42 PM (metric). The nurse's notes say the patient complained of chest pain at 3:40 PM (log). The medication record shows a drug was administered at 3:38 PM (another log).
The doctor correlates all three. Drug at 3:38 → chest pain at 3:40 → pulse drop at 3:42. That is the root cause chain. No single data source tells the full story — only by connecting them does the cause become clear.
Incident correlation in L2 is exactly this. CPU spike at 14:00 (metric) + DB timeout errors at 14:00 (log) + queue depth growing at 14:01 (another metric) = the DB caused the CPU spike which caused the queue to back up. Connected.
Why correlation matters — what happens without it
Without correlation: You see a CPU spike and restart the application. CPU comes down. 10 minutes later it spikes again. You restart again. This cycle repeats because you treated the symptom, not the cause.
With correlation: You see the CPU spike at 14:00 and immediately check the logs for 14:00 — you find DB timeouts. The DB is the cause of the spike, not the application itself. You fix the DB and CPU never spikes again.
The metric told you something is wrong. The log told you what is wrong. Correlating them told you why it is wrong. That is the full chain needed for a correct RCA.
02 What Metrics Tell You vs What Logs Tell You
Metrics — The Numbers
What metrics show
- CPU jumped to 92% at 14:00
- Error rate climbed to 8% at 14:01
- Response time hit 4.2 seconds at 14:02
- Queue depth grew to 48 at 14:03
- Transaction success rate dropped to 88%
Logs — The Events
What logs show
- [14:00:04] DB_CONNECTION_TIMEOUT TXN-501
- [14:00:05] TXN-502 FAILED cannot write to DB
- [14:00:06] Consumer paused — DB unreachable
- [14:01:00] DB_CONNECTION_REFUSED port 5432
- [14:01:30] Queue depth 28 — processing halted
💡 The correlation: Metrics show CPU spiked and error rate climbed at 14:00. Logs show DB_CONNECTION_TIMEOUT errors started at exactly 14:00:04. The timestamp match confirms — the DB failure caused the CPU spike and the error rate increase. One root cause explains all the metric changes.
03 Reading a Correlated Timeline
The correlated timeline — putting both data sources side by side
When you line up metrics and logs by timestamp, the cause-and-effect chain becomes visible. The metric changes always follow the log events by a small lag. Find the log event that happened just before a metric started moving — that is your root cause.
📊 Incident Timeline — Metrics + Logs Together
13:58
METRIC
CPU: 42% | Error rate: 0.3% | TXN success: 99.8% — everything normal
14:00:01
LOG
WARN DB connection pool at 81% — first warning sign appears
14:00:03
LOG
WARN DB connection pool at 96% — approaching full capacity
14:00:04
CRITICAL
ERROR DB_CONNECTION_TIMEOUT — pool exhausted. TXN-501 FAILED. Root cause event.
14:00:15
METRIC
CPU: 68% ↑ | Error rate: 3.2% ↑ — metrics start reacting to the DB failure
14:01:00
LOG
ERROR DB_CONNECTION_REFUSED port 5432 — DB process crashed
14:01:20
METRIC
CPU: 89% ↑ | Error rate: 8.4% ↑ | Queue depth: 28 ↑ — full impact across all metrics
14:15:00
LOG
INFO DB connection restored. Pool at 12%. Processing resuming.
14:15:30
METRIC
CPU: 44% ↓ | Error rate: 0.4% ↓ | Queue depth: 0 ↓ — all metrics recover together
💡 Read the timeline rule: The log event at 14:00:04 (DB pool exhausted) appeared before the metrics started moving at 14:00:15. Logs react faster than metrics. The first log event before any metric moved = your root cause.
04 Common Correlation Patterns in Fintech
🔗 Metric Change → Log Pattern → Root Cause
| Metric you see | Log pattern to search | Root cause |
| CPU spike + error rate up |
DB_CONNECTION_TIMEOUT in logs at same time |
DB pool exhausted — app retrying DB connections drives CPU |
| Response time slow + CPU normal |
QUERY_TIMEOUT or slow query log entries |
Slow DB query — all requests waiting for one query |
| Error rate up + CPU normal |
SOCKET_TIMEOUT to external gateway |
External API down — SBP or 1LINK not responding |
| Queue depth growing + no CPU spike |
Consumer crashed or Consumer paused |
MQ consumer down — messages piling up, not consuming |
| RAM high + CPU high |
OutOfMemoryError or GC overhead |
Memory leak — app not releasing memory, causing GC pressure |
| All metrics normal but 1 endpoint slow |
Slow log for /raast/transfer endpoint only |
Endpoint-specific issue — not system-wide, targeted DB query |
| Disk I/O spiking + CPU up |
Swap usage growing in free -h logs |
RAM exhausted — OS swapping to disk, causing both spikes |
05 Hands-on Lab — Trace a Real Issue
Lab scenario
A payment outage occurred between 14:00 and 14:20. You have a metrics snapshot and a log file. Your job is to correlate both data sources, build a timeline, identify the root cause, and write the complete RCA.
🔬 Lab: Correlate Metrics + Logs to Find Root Cause
Correlation · Timeline · RCA
Create the metrics snapshot file
This is what a monitoring dashboard would show — metric values at each timestamp during the incident.
terminal — create metrics file
cat > ~/metrics-snapshot.txt << 'EOF'
TIMESTAMP CPU% ERROR_RATE QUEUE_DEPTH RESP_TIME_MS
2024-03-15 13:55 41 0.2 2 180
2024-03-15 13:58 43 0.3 2 190
2024-03-15 14:00 45 0.3 2 195
2024-03-15 14:01 72 4.8 12 1840
2024-03-15 14:02 88 9.2 28 6200
2024-03-15 14:05 91 11.4 47 9100
2024-03-15 14:10 89 10.8 52 8800
2024-03-15 14:15 48 1.2 30 410
2024-03-15 14:18 42 0.4 8 200
2024-03-15 14:20 41 0.2 0 185
EOF
echo "Metrics snapshot created."
→ 10 metric snapshots across a 25-minute window. Normal → spike → recovery visible in the numbers.
Create the incident log file
This is what the log search returns for the same time window.
terminal — create incident log
cat > ~/incident.log << 'EOF'
[2024-03-15 13:55:00] [INFO ] Service running normally. 142 TXN/min
[2024-03-15 14:00:20] [WARN ] DB connection pool at 79%
[2024-03-15 14:00:40] [WARN ] DB connection pool at 88%
[2024-03-15 14:00:55] [WARN ] DB connection pool at 97%
[2024-03-15 14:01:02] [ERROR] DB_CONNECTION_TIMEOUT TXN-601 failed after 30000ms
[2024-03-15 14:01:03] [ERROR] TXN-601 FAILED cannot acquire DB connection
[2024-03-15 14:01:10] [ERROR] DB_CONNECTION_TIMEOUT TXN-602 failed after 30000ms
[2024-03-15 14:01:11] [ERROR] TXN-602 FAILED cannot acquire DB connection
[2024-03-15 14:01:30] [ERROR] DB_CONNECTION_REFUSED connection refused port 5432
[2024-03-15 14:01:31] [ERROR] TXN-603 FAILED DB service appears down
[2024-03-15 14:02:00] [ERROR] MQ consumer paused - DB unreachable. Queue depth: 28
[2024-03-15 14:05:00] [ERROR] Queue depth critical: 47 messages pending
[2024-03-15 14:15:01] [INFO ] DB connection restored. Pool at 8%. DB process restarted by DBA.
[2024-03-15 14:15:10] [INFO ] MQ consumer resumed. Processing queued messages.
[2024-03-15 14:20:00] [INFO ] Queue cleared. All pending transactions processed.
EOF
echo "Incident log created."
→ Incident log created. 15 entries spanning the full 14:00–14:20 window.
Step 1 — Find the exact moment metrics started changing
Look at the metrics file and find where numbers first moved from normal.
terminal
# Show all metric readings — spot where numbers change
cat ~/metrics-snapshot.txt
# Find first row where error rate jumped above 1%
awk 'NR>1 && $3>1 {print "FIRST SPIKE:", $0; exit}' ~/metrics-snapshot.txt
# Find peak values — worst point of the incident
awk 'NR>1 {if($2>max_cpu){max_cpu=$2; peak=$0}} END{print "PEAK:", peak}' ~/metrics-snapshot.txt
→ Metrics first spiked at 14:01. CPU went from 45% to 72%, error rate from 0.3% to 4.8%. Peak was at 14:05 with CPU 91%, error rate 11.4%, queue depth 47.
Step 2 — Find the log event that happened just before the metrics moved
Metrics moved at 14:01. Search logs for what happened just before that — 14:00:xx.
terminal
# What happened in the log between 14:00 and 14:01?
grep "2024-03-15 14:00" ~/incident.log
# What was the first ERROR in the log?
grep -m 1 "ERROR" ~/incident.log
# Show the build-up — 3 lines before first error
grep -B 3 -m 1 "ERROR" ~/incident.log
→ WARN messages at 14:00:20, 14:00:40, 14:00:55 show DB pool climbing. First ERROR at 14:01:02 — 38 seconds after the first WARN. Metrics started moving at 14:01 — exactly when the first DB error appeared. Correlation confirmed.
Step 3 — Confirm the correlation between metrics and logs
Verify that the log events and metric changes happened at matching timestamps.
terminal — build the correlation evidence
echo "=== CORRELATION ANALYSIS ==="
echo ""
echo "--- METRICS: When did values first change? ---"
awk 'NR>1 && $3>1 {print " Error rate crossed 1% at:", $1, $2; exit}' ~/metrics-snapshot.txt
echo ""
echo "--- LOGS: First warning before spike ---"
grep -m 1 "WARN" ~/incident.log
echo ""
echo "--- LOGS: First error (root cause event) ---"
grep -m 1 "ERROR" ~/incident.log
echo ""
echo "--- LOGS: When did recovery happen? ---"
grep "restored\|resumed\|cleared" ~/incident.log
echo ""
echo "--- METRICS: When did values return to normal? ---"
awk 'NR>1 && $3<1 && prev_high==1 {print " Metrics recovered at:", $1, $2; prev_high=0} NR>1 && $3>1 {prev_high=1}' ~/metrics-snapshot.txt
→ Log WARN at 14:00:20 → Log ERROR at 14:01:02 → Metric spike at 14:01 → Log recovery at 14:15 → Metric recovery at 14:15. Perfect correlation. Root cause: DB pool exhaustion.
Step 4 — Count affected transactions
How many transactions failed? What are their IDs? This goes in the RCA.
terminal
# List all failed transaction IDs
grep "FAILED" ~/incident.log | grep -o "TXN-[0-9]*"
# Count total failures
echo "Total failed TXNs: $(grep -c 'FAILED' ~/incident.log)"
# Outage duration from first error to recovery
echo "Outage start: $(grep -m1 'ERROR' ~/incident.log | awk '{print $1, $2}')"
echo "Outage end : $(grep -m1 'restored' ~/incident.log | awk '{print $1, $2}')"
→ 3 failed transactions: TXN-601, TXN-602, TXN-603. Outage lasted from 14:01:02 to 14:15:01 — approximately 14 minutes.
Write the complete correlated RCA
Use both the metric evidence and the log evidence to write a complete, professional Jira RCA.
Complete correlated RCA — ready for Jira
INCIDENT REPORT — Payment Outage 14:01–14:15
METRIC EVIDENCE:
Normal baseline: CPU 41-45%, Error rate 0.2-0.3%, Queue 2
14:01 — CPU jumped to 72%, Error rate 4.8%, Queue 12
14:05 — Peak: CPU 91%, Error rate 11.4%, Queue 47 (critical)
14:15 — Recovery: all metrics returned to baseline
LOG EVIDENCE:
14:00:20 — WARN DB pool 79% (first warning sign)
14:00:55 — WARN DB pool 97% (near exhaustion)
14:01:02 — ERROR DB_CONNECTION_TIMEOUT (root cause event)
14:01:30 — ERROR DB_CONNECTION_REFUSED (DB process crashed)
14:15:01 — INFO DB restored by DBA team
CORRELATION:
Log WARN at 14:00:20 preceded metric spike at 14:01 by 40s.
Log ERROR at 14:01:02 matches exactly with metric deterioration.
Log recovery at 14:15:01 matches metric recovery at 14:15.
ROOT CAUSE: DB connection pool exhausted at 14:01:02.
DB process crashed at 14:01:30. 3 transactions failed
(TXN-601, TXN-602, TXN-603). MQ consumer paused.
DBA restarted DB at 14:15. Full recovery by 14:20.
NEXT STEPS: Reconcile TXN-601/602/603. DBA to
investigate connection leak. Increase pool limit.
→ Complete correlated RCA with both metric and log evidence. Professional, timestamped, actionable. ✅
06 Real L2 Scenarios
01
CPU spike alert fires. Without correlation, you restart the app and CPU drops. 20 minutes later it spikes again. You check the logs at the timestamp of the spike — DB timeouts every time. The DB is the cause. You stop restarting the app and call the DBA. Problem solved permanently.
02
Error rate alert fires but CPU, disk, and memory are all healthy. You check logs at the alert timestamp — SOCKET_TIMEOUT to SBP gateway. No internal metric is spiking because the problem is external. You check SBP's status page, confirm they are degraded, and inform the bridge: our system is healthy, external dependency is down.
03
Manager asks: "Why did response time spike yesterday at 11 PM?" You pull metric snapshot — response time jumped from 200ms to 6.4 seconds at 23:00. You pull logs for 22:58–23:02 — a scheduled batch job started at 23:00 and ran a full table scan. Metric + log correlation = root cause in 4 minutes. Manager gets a clear answer.
04
Two alerts fire simultaneously — high CPU and high queue depth. Normally these look like two separate problems. You correlate the logs: consumer crashed at the same timestamp CPU started rising. One root cause, two alerts. Fix the consumer → CPU drops and queue drains. The correlation stopped you from treating them as two separate incidents.
✅ Week 6.5 · Day 7 Outcomes
- Explain what incident correlation is — looking at metrics and logs at the same timestamp to find cause-and-effect
- Understand what metrics tell you (the numbers changed) vs what logs tell you (what event caused the change)
- Know the rule — the log event just before a metric starts moving is your root cause candidate
- Read a correlated timeline and identify where the root cause event sits relative to metric changes
- Know 7 common correlation patterns — CPU spike + DB timeout logs, slow response + query timeout logs, error rate + external gateway logs
- Complete the lab — create both a metrics snapshot and an incident log, correlate them by timestamp, find the first warning sign, confirm the root cause event, count affected transactions, and write a complete correlated RCA with both metric and log evidence
- Understand why correlation prevents treating symptoms — without it you restart the app. With it you fix the actual DB.