Week 6.5 · Day 7

Incident Correlation

Metrics show you the numbers. Logs show you the events. Correlation is what happens when you look at both together at the same timestamp and ask — what caused what? Today you learn to connect those two stories into one clear root cause.

Logs + Metrics Correlation Timeline Analysis RCA Skills Root Cause

01 The Simple Idea First

Real-life Analogy

Think of a doctor diagnosing a patient. The heart monitor shows the pulse suddenly dropped at 3:42 PM (metric). The nurse's notes say the patient complained of chest pain at 3:40 PM (log). The medication record shows a drug was administered at 3:38 PM (another log).

The doctor correlates all three. Drug at 3:38 → chest pain at 3:40 → pulse drop at 3:42. That is the root cause chain. No single data source tells the full story — only by connecting them does the cause become clear.

Incident correlation in L2 is exactly this. CPU spike at 14:00 (metric) + DB timeout errors at 14:00 (log) + queue depth growing at 14:01 (another metric) = the DB caused the CPU spike which caused the queue to back up. Connected.

Why correlation matters — what happens without it

Without correlation: You see a CPU spike and restart the application. CPU comes down. 10 minutes later it spikes again. You restart again. This cycle repeats because you treated the symptom, not the cause.

With correlation: You see the CPU spike at 14:00 and immediately check the logs for 14:00 — you find DB timeouts. The DB is the cause of the spike, not the application itself. You fix the DB and CPU never spikes again.

The metric told you something is wrong. The log told you what is wrong. Correlating them told you why it is wrong. That is the full chain needed for a correct RCA.

02 What Metrics Tell You vs What Logs Tell You

Metrics — The Numbers

What metrics show

CPU jumped to 92% at 14:00
Error rate climbed to 8% at 14:01
Response time hit 4.2 seconds at 14:02
Queue depth grew to 48 at 14:03
Transaction success rate dropped to 88%

Logs — The Events

What logs show

[14:00:04] DB_CONNECTION_TIMEOUT TXN-501
[14:00:05] TXN-502 FAILED cannot write to DB
[14:00:06] Consumer paused — DB unreachable
[14:01:00] DB_CONNECTION_REFUSED port 5432
[14:01:30] Queue depth 28 — processing halted

💡 The correlation: Metrics show CPU spiked and error rate climbed at 14:00. Logs show DB_CONNECTION_TIMEOUT errors started at exactly 14:00:04. The timestamp match confirms — the DB failure caused the CPU spike and the error rate increase. One root cause explains all the metric changes.

03 Reading a Correlated Timeline

The correlated timeline — putting both data sources side by side

When you line up metrics and logs by timestamp, the cause-and-effect chain becomes visible. The metric changes always follow the log events by a small lag. Find the log event that happened just before a metric started moving — that is your root cause.

📊 Incident Timeline — Metrics + Logs Together

Metric

Log

13:58

METRIC

CPU: 42% | Error rate: 0.3% | TXN success: 99.8% — everything normal

14:00:01

LOG

WARN DB connection pool at 81% — first warning sign appears

14:00:03

LOG

WARN DB connection pool at 96% — approaching full capacity

14:00:04

CRITICAL

ERROR DB_CONNECTION_TIMEOUT — pool exhausted. TXN-501 FAILED. Root cause event.

14:00:15

METRIC

CPU: 68% ↑ | Error rate: 3.2% ↑ — metrics start reacting to the DB failure

14:01:00

LOG

ERROR DB_CONNECTION_REFUSED port 5432 — DB process crashed

14:01:20

METRIC

CPU: 89% ↑ | Error rate: 8.4% ↑ | Queue depth: 28 ↑ — full impact across all metrics

14:15:00

LOG

INFO DB connection restored. Pool at 12%. Processing resuming.

14:15:30

METRIC

CPU: 44% ↓ | Error rate: 0.4% ↓ | Queue depth: 0 ↓ — all metrics recover together

💡 Read the timeline rule: The log event at 14:00:04 (DB pool exhausted) appeared before the metrics started moving at 14:00:15. Logs react faster than metrics. The first log event before any metric moved = your root cause.

04 Common Correlation Patterns in Fintech

🔗 Metric Change → Log Pattern → Root Cause

Metric you see	Log pattern to search	Root cause
CPU spike + error rate up	DB_CONNECTION_TIMEOUT in logs at same time	DB pool exhausted — app retrying DB connections drives CPU
Response time slow + CPU normal	QUERY_TIMEOUT or slow query log entries	Slow DB query — all requests waiting for one query
Error rate up + CPU normal	SOCKET_TIMEOUT to external gateway	External API down — SBP or 1LINK not responding
Queue depth growing + no CPU spike	Consumer crashed or Consumer paused	MQ consumer down — messages piling up, not consuming
RAM high + CPU high	OutOfMemoryError or GC overhead	Memory leak — app not releasing memory, causing GC pressure
All metrics normal but 1 endpoint slow	Slow log for /raast/transfer endpoint only	Endpoint-specific issue — not system-wide, targeted DB query
Disk I/O spiking + CPU up	Swap usage growing in free -h logs	RAM exhausted — OS swapping to disk, causing both spikes

05 Hands-on Lab — Trace a Real Issue

Lab scenario

A payment outage occurred between 14:00 and 14:20. You have a metrics snapshot and a log file. Your job is to correlate both data sources, build a timeline, identify the root cause, and write the complete RCA.

🔬 Lab: Correlate Metrics + Logs to Find Root Cause

Correlation · Timeline · RCA

01

Create the metrics snapshot file

This is what a monitoring dashboard would show — metric values at each timestamp during the incident.

terminal — create metrics file

cat > ~/metrics-snapshot.txt << 'EOF'
TIMESTAMP CPU% ERROR_RATE QUEUE_DEPTH RESP_TIME_MS
2024-03-15 13:55 41 0.2 2 180
2024-03-15 13:58 43 0.3 2 190
2024-03-15 14:00 45 0.3 2 195
2024-03-15 14:01 72 4.8 12 1840
2024-03-15 14:02 88 9.2 28 6200
2024-03-15 14:05 91 11.4 47 9100
2024-03-15 14:10 89 10.8 52 8800
2024-03-15 14:15 48 1.2 30 410
2024-03-15 14:18 42 0.4 8 200
2024-03-15 14:20 41 0.2 0 185
EOF
echo "Metrics snapshot created."

→ 10 metric snapshots across a 25-minute window. Normal → spike → recovery visible in the numbers.

02

Create the incident log file

This is what the log search returns for the same time window.

terminal — create incident log

cat > ~/incident.log << 'EOF'
[2024-03-15 13:55:00] [INFO ] Service running normally. 142 TXN/min
[2024-03-15 14:00:20] [WARN ] DB connection pool at 79%
[2024-03-15 14:00:40] [WARN ] DB connection pool at 88%
[2024-03-15 14:00:55] [WARN ] DB connection pool at 97%
[2024-03-15 14:01:02] [ERROR] DB_CONNECTION_TIMEOUT TXN-601 failed after 30000ms
[2024-03-15 14:01:03] [ERROR] TXN-601 FAILED cannot acquire DB connection
[2024-03-15 14:01:10] [ERROR] DB_CONNECTION_TIMEOUT TXN-602 failed after 30000ms
[2024-03-15 14:01:11] [ERROR] TXN-602 FAILED cannot acquire DB connection
[2024-03-15 14:01:30] [ERROR] DB_CONNECTION_REFUSED connection refused port 5432
[2024-03-15 14:01:31] [ERROR] TXN-603 FAILED DB service appears down
[2024-03-15 14:02:00] [ERROR] MQ consumer paused - DB unreachable. Queue depth: 28
[2024-03-15 14:05:00] [ERROR] Queue depth critical: 47 messages pending
[2024-03-15 14:15:01] [INFO ] DB connection restored. Pool at 8%. DB process restarted by DBA.
[2024-03-15 14:15:10] [INFO ] MQ consumer resumed. Processing queued messages.
[2024-03-15 14:20:00] [INFO ] Queue cleared. All pending transactions processed.
EOF
echo "Incident log created."

→ Incident log created. 15 entries spanning the full 14:00–14:20 window.

03

Step 1 — Find the exact moment metrics started changing

Look at the metrics file and find where numbers first moved from normal.

terminal

# Show all metric readings — spot where numbers change
cat ~/metrics-snapshot.txt

# Find first row where error rate jumped above 1%
awk 'NR>1 && $3>1 {print "FIRST SPIKE:", $0; exit}' ~/metrics-snapshot.txt

# Find peak values — worst point of the incident
awk 'NR>1 {if($2>max_cpu){max_cpu=$2; peak=$0}} END{print "PEAK:", peak}' ~/metrics-snapshot.txt

→ Metrics first spiked at 14:01. CPU went from 45% to 72%, error rate from 0.3% to 4.8%. Peak was at 14:05 with CPU 91%, error rate 11.4%, queue depth 47.

04

Step 2 — Find the log event that happened just before the metrics moved

Metrics moved at 14:01. Search logs for what happened just before that — 14:00:xx.

terminal

# What happened in the log between 14:00 and 14:01?
grep "2024-03-15 14:00" ~/incident.log

# What was the first ERROR in the log?
grep -m 1 "ERROR" ~/incident.log

# Show the build-up — 3 lines before first error
grep -B 3 -m 1 "ERROR" ~/incident.log

→ WARN messages at 14:00:20, 14:00:40, 14:00:55 show DB pool climbing. First ERROR at 14:01:02 — 38 seconds after the first WARN. Metrics started moving at 14:01 — exactly when the first DB error appeared. Correlation confirmed.

05

Step 3 — Confirm the correlation between metrics and logs

Verify that the log events and metric changes happened at matching timestamps.

terminal — build the correlation evidence

echo "=== CORRELATION ANALYSIS ==="
echo ""
echo "--- METRICS: When did values first change? ---"
awk 'NR>1 && $3>1 {print " Error rate crossed 1% at:", $1, $2; exit}' ~/metrics-snapshot.txt

echo ""
echo "--- LOGS: First warning before spike ---"
grep -m 1 "WARN" ~/incident.log

echo ""
echo "--- LOGS: First error (root cause event) ---"
grep -m 1 "ERROR" ~/incident.log

echo ""
echo "--- LOGS: When did recovery happen? ---"
grep "restored\|resumed\|cleared" ~/incident.log

echo ""
echo "--- METRICS: When did values return to normal? ---"
awk 'NR>1 && $3<1 && prev_high==1 {print " Metrics recovered at:", $1, $2; prev_high=0} NR>1 && $3>1 {prev_high=1}' ~/metrics-snapshot.txt

→ Log WARN at 14:00:20 → Log ERROR at 14:01:02 → Metric spike at 14:01 → Log recovery at 14:15 → Metric recovery at 14:15. Perfect correlation. Root cause: DB pool exhaustion.

06

Step 4 — Count affected transactions

How many transactions failed? What are their IDs? This goes in the RCA.

terminal

# List all failed transaction IDs
grep "FAILED" ~/incident.log | grep -o "TXN-[0-9]*"

# Count total failures
echo "Total failed TXNs: $(grep -c 'FAILED' ~/incident.log)"

# Outage duration from first error to recovery
echo "Outage start: $(grep -m1 'ERROR' ~/incident.log | awk '{print $1, $2}')"
echo "Outage end : $(grep -m1 'restored' ~/incident.log | awk '{print $1, $2}')"

→ 3 failed transactions: TXN-601, TXN-602, TXN-603. Outage lasted from 14:01:02 to 14:15:01 — approximately 14 minutes.

07

Write the complete correlated RCA

Use both the metric evidence and the log evidence to write a complete, professional Jira RCA.

Complete correlated RCA — ready for Jira

INCIDENT REPORT — Payment Outage 14:01–14:15

METRIC EVIDENCE:
Normal baseline: CPU 41-45%, Error rate 0.2-0.3%, Queue 2
14:01 — CPU jumped to 72%, Error rate 4.8%, Queue 12
14:05 — Peak: CPU 91%, Error rate 11.4%, Queue 47 (critical)
14:15 — Recovery: all metrics returned to baseline

LOG EVIDENCE:
14:00:20 — WARN DB pool 79% (first warning sign)
14:00:55 — WARN DB pool 97% (near exhaustion)
14:01:02 — ERROR DB_CONNECTION_TIMEOUT (root cause event)
14:01:30 — ERROR DB_CONNECTION_REFUSED (DB process crashed)
14:15:01 — INFO DB restored by DBA team

CORRELATION:
Log WARN at 14:00:20 preceded metric spike at 14:01 by 40s.
Log ERROR at 14:01:02 matches exactly with metric deterioration.
Log recovery at 14:15:01 matches metric recovery at 14:15.

ROOT CAUSE: DB connection pool exhausted at 14:01:02.
DB process crashed at 14:01:30. 3 transactions failed
(TXN-601, TXN-602, TXN-603). MQ consumer paused.
DBA restarted DB at 14:15. Full recovery by 14:20.

NEXT STEPS: Reconcile TXN-601/602/603. DBA to
investigate connection leak. Increase pool limit.

→ Complete correlated RCA with both metric and log evidence. Professional, timestamped, actionable. ✅

06 Real L2 Scenarios

01

CPU spike alert fires. Without correlation, you restart the app and CPU drops. 20 minutes later it spikes again. You check the logs at the timestamp of the spike — DB timeouts every time. The DB is the cause. You stop restarting the app and call the DBA. Problem solved permanently.

02

Error rate alert fires but CPU, disk, and memory are all healthy. You check logs at the alert timestamp — SOCKET_TIMEOUT to SBP gateway. No internal metric is spiking because the problem is external. You check SBP's status page, confirm they are degraded, and inform the bridge: our system is healthy, external dependency is down.

03

Manager asks: "Why did response time spike yesterday at 11 PM?" You pull metric snapshot — response time jumped from 200ms to 6.4 seconds at 23:00. You pull logs for 22:58–23:02 — a scheduled batch job started at 23:00 and ran a full table scan. Metric + log correlation = root cause in 4 minutes. Manager gets a clear answer.

04

Two alerts fire simultaneously — high CPU and high queue depth. Normally these look like two separate problems. You correlate the logs: consumer crashed at the same timestamp CPU started rising. One root cause, two alerts. Fix the consumer → CPU drops and queue drains. The correlation stopped you from treating them as two separate incidents.

✅ Week 6.5 · Day 7 Outcomes

Explain what incident correlation is — looking at metrics and logs at the same timestamp to find cause-and-effect
Understand what metrics tell you (the numbers changed) vs what logs tell you (what event caused the change)
Know the rule — the log event just before a metric starts moving is your root cause candidate
Read a correlated timeline and identify where the root cause event sits relative to metric changes
Know 7 common correlation patterns — CPU spike + DB timeout logs, slow response + query timeout logs, error rate + external gateway logs
Complete the lab — create both a metrics snapshot and an incident log, correlate them by timestamp, find the first warning sign, confirm the root cause event, count affected transactions, and write a complete correlated RCA with both metric and log evidence
Understand why correlation prevents treating symptoms — without it you restart the app. With it you fix the actual DB.