Week 6 · Day 3 & Day 4

High CPU Scenario &
Txn Timeout RCA

Day 3 — you learn to find what is eating your server's CPU and what to do about it. Day 4 — you learn to trace why transactions are timing out and find the delay in the chain.

Day 3 — High CPU Day 4 — Timeout RCA topps aux LatencyDB Query Analysis

Day 3 High CPU Scenario — Process Analysis Using top

01 The Simple Idea

Real-life Analogy

Think of your server's CPU like a chef in a kitchen. When one dish takes all the chef's attention — everything else in the kitchen slows down. Other dishes are left waiting, orders pile up, customers complain.

A high CPU spike is that one dish taking over. One process is consuming all the processor's time and starving everything else. Your job is to identify which process is the greedy one, understand why, and decide whether to wait, kill it, or escalate.

What causes a CPU spike?

Runaway process — a Java application, a batch job, or a database query that goes into an infinite loop or processes far more data than expected.

Sudden traffic surge — too many requests hitting the server at once. Each request uses CPU and together they overwhelm it.

Poorly optimised query — a database query doing a full table scan instead of using an index. On a large table this can pin the CPU for minutes.

Memory pressure — when RAM runs out, the OS starts swapping (using disk as RAM). Swap operations are CPU-heavy and cause the whole server to slow down.

02 Reading the top Command — What Each Column Means

top — server process monitor (simulated)

top - 14:32:05 load average: 3.82, 2.41, 1.20
Tasks: 142 total, 3 running, 139 sleeping
%Cpu(s): 91.2 us, 2.1 sy, 0.0 ni, 6.7 id, 0.0 wa
MiB Mem: 8000 total, 620 free, 6800 used, 580 buff

PID USER %CPU %MEM COMMAND

        18821
        paymentapp
        92.3 
        18.4 
        java PaymentProcessorService
      

1204 postgres 4.1 3.2 postgres: query worker

892 root 0.8 0.4 nginx: worker process

1055 root 0.2 0.1 sshd

📋 What Each top Column Tells You

Column	What it means	What to look for
PID	Process ID — the unique number of each running process	Note the PID of the heavy process — you need it to kill or investigate it
USER	Which user account is running this process	If it is running as root unexpectedly — suspicious
%CPU	How much of the CPU this process is using right now	Anything above 80% on a single process is your problem
%MEM	How much RAM this process is using	High CPU + high MEM together = serious resource hog
COMMAND	The name of the process	Tells you what application it is — java, python, nginx, postgres
load average	System load over last 1min, 5min, 15min	Above 1.0 per CPU core = overloaded. e.g. 4.0 on a 4-core = 100% busy
id (idle %)	How much CPU is free / doing nothing	Below 10% idle means the server is critically overloaded

03 High CPU Investigation — What to Do in Order

Step 1 — Confirm it is CPU and find the process

Run top and press P to sort by CPU. The guilty process is at the top. Note the PID and the process name. If it is a Java process — it is likely your payment application. If it is postgres — a DB query is doing it.

terminal — CPU investigation commands

# Run top and sort by CPU (press P after opening)
top
# then press P to sort highest CPU first

# Get a snapshot sorted by CPU — no interactive mode
ps aux --sort=-%cpu | head -10

# Find exactly which java threads are running hot
ps -p 18821 -o pid,ppid,pcpu,pmem,cmd

# Check system load average
cat /proc/loadavg

# Check how long the process has been running
ps -p 18821 -o pid,etime,pcpu,cmd

Step 2 — Decide what to do with the process

Wait and monitor — if it is a legitimate batch job that will finish, let it run. Check if it should be running at this time. Watch if CPU is trending down.

Kill the process — if it is stuck in a loop, consuming CPU with no progress, and blocking everything else. Only do this with bridge approval on PROD.

Escalate to L3 — if you are not sure what the process is doing or if killing it might have side effects. Always better to escalate than to accidentally stop a critical service.

terminal — process control commands

# Graceful stop — tells the process to stop cleanly
kill -15 18821

# Force kill — only if graceful stop failed
kill -9 18821

# Watch CPU in real time — refresh every 2 seconds
watch -n 2 "ps aux --sort=-%cpu | head -5"

# Check if CPU came down after killing the process
top -bn1 | grep "Cpu(s)"

⚠️ Never use kill -9 on a database process. Forcing a DB to stop can corrupt data. Always use systemctl stop or kill -15 for DB processes, and always with DBA approval.

04 Day 3 Lab — Analyse a CPU Spike

🔬 Lab: Investigate a High CPU Scenario

Kali Linux · Real Commands

Create a CPU spike to investigate

This command deliberately uses CPU so you can practice investigating a real spike. It runs for 20 seconds then stops automatically.

terminal — create artificial CPU load

# Open a new terminal tab first, then run this
# This creates CPU load for 20 seconds then stops
timeout 20 yes > /dev/null &
echo "CPU load started. PID: $!"

→ A process starts consuming CPU. Note the PID printed. You now have a real CPU spike to investigate.

Find the heavy process using top and ps

In your original terminal — run these commands to find what is eating CPU.

terminal

# Snapshot — top 5 processes by CPU
ps aux --sort=-%cpu | head -6

# Check system load
cat /proc/loadavg

# Interactive top — press P to sort by CPU, q to quit
top

→ You will see "yes" process at the top with high %CPU. That is the simulated heavy process.

Investigate how long the process has been running

Check the PID from step 1. How long has this process been consuming CPU?

terminal — replace PID with the one from step 1

# How long has this process been running?
ps -p YOUR_PID -o pid,etime,pcpu,pmem,cmd

# Check what user is running it
ps -p YOUR_PID -o pid,user,pcpu,cmd

→ Shows run time, CPU%, and who is running it. On a real server this tells you if it's a legit process or something unexpected.

Make the decision — kill it or escalate

In this lab we know it is our test process — safe to kill. On PROD you would get bridge approval first.

terminal

# Graceful stop (replace with your PID)
kill -15 YOUR_PID

# Confirm CPU came back down
ps aux --sort=-%cpu | head -4
cat /proc/loadavg

→ CPU usage drops. Load average starts decreasing. Server returns to normal. ✅

Write the findings for Jira

Document what you found and what you did.

Jira entry

Issue : High CPU on Payment Server
Detected : 14:32 — CPU at 92%
Process : PID 18821 — java PaymentProcessorService
Run time : 47 minutes (abnormal — should complete in 5)
Action : Graceful kill with bridge approval at 14:35
Result : CPU returned to 8% within 30 seconds
Next steps : L3 to investigate why process ran for 47 mins

→ Jira RCA complete. ✅

Day 4 Transaction Timeout RCA — Latency & DB Query Analysis

05 The Simple Idea

Real-life Analogy

Think of a timeout like a restaurant where you order food and it never arrives. After 30 minutes you give up and leave. The restaurant didn't say your order was rejected — it just took too long and you lost patience.

A transaction timeout is the same. The system sent a request but waited too long for a reply and gave up. The root cause is always somewhere in the chain — the app was slow, the DB query took too long, the external API didn't respond in time, or the network had high latency.

What is a Timeout and why does it happen?

A timeout occurs when a system waits longer than the configured limit for a response. Instead of waiting forever, it gives up and marks the transaction as failed. Common timeout values: 5 seconds for APIs, 30 seconds for DB queries, 10 seconds for external gateways.

Common causes: A slow or locked database query. An external gateway (SBP, 1LINK) responding slowly. A network bottleneck between two servers. High CPU causing the app to process requests slowly. A full connection pool making new queries wait.

06 Understanding the Latency Chain

📊 Transaction Response Time — Where Time Is Spent

Normal API (200ms)

200ms

Slow DB query (2.5s)

2,500ms

Slow gateway (8s)

8,000ms

Timeout (30s+)

>30,000ms

The Timeout Investigation Chain

Every timeout has a location. You trace it from the outside in — starting from what the client sees and working inward until you find where time is being lost.

Step 1: Check the log timestamp — how long between request received and timeout error? That is the total wait time.

Step 2: Was it an external timeout (SBP, 1LINK gateway) or an internal one (DB, app)? Check which system the timeout message names.

Step 3: If internal — check DB query times. A slow query is the most common cause of internal timeouts in fintech.

⏱️ Timeout Types — What Each One Tells You

Timeout Type	Where it happens	Root cause	Who to involve
DB_CONNECTION_TIMEOUT	App → Database	Pool exhausted or DB too slow	DBA
QUERY_TIMEOUT	Inside DB	Slow query — no index, full table scan	DBA
SOCKET_TIMEOUT	App → External (SBP/1LINK)	External system slow or down	Vendor
HTTP_TIMEOUT	Client → App API	App is slow to respond — high CPU or slow DB	L3 Dev
READ_TIMEOUT	Between internal services	Downstream service too slow	L3 Dev
LOCK_TIMEOUT	Inside DB	A long query holding a lock blocks others	DBA

07 DB Query Analysis — Finding the Slow Query

How slow queries cause timeouts

A slow DB query locks a table or uses so much CPU that all other queries queue up behind it. The payment app is waiting for the DB to respond. If the wait exceeds the timeout limit — the transaction fails with QUERY_TIMEOUT even though the DB is technically running fine.

The fix is almost never to restart the DB. The fix is to identify the slow query, understand why it is slow, and either add an index, optimise the query, or kill the specific query that is causing the blockage.

SQL — find slow and long-running queries (PostgreSQL / MySQL)

-- PostgreSQL: Find all queries running longer than 30 seconds
SELECT pid, now() - pg_stat_activity.query_start AS duration,
query, state
FROM pg_stat_activity
WHERE state != 'idle'
AND now() - pg_stat_activity.query_start > interval '30 seconds'
ORDER BY duration DESC;

-- Find queries waiting for locks
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event_type = 'Lock';

-- SQLite (your lab DB): Check stuck transactions
-- In SQLite: simulate by looking at PENDING rows

08 Day 4 Lab — Investigate a Transaction Timeout

🔬 Lab: Trace and Root-Cause a Timeout

Kali Linux · Log + SQL

Create a simulated timeout log

This log shows a realistic timeout scenario with latency build-up visible in the timestamps.

terminal

cat > ~/timeout-scenario.log << 'EOF'
[14:00:00.100] [INFO ] TXN-201 received. Starting processing.
[14:00:00.110] [INFO ] TXN-201 validated. Sending to DB.
[14:00:00.115] [INFO ] DB query started: SELECT balance FROM accounts WHERE id=201
[14:00:04.820] [WARN ] DB query slow: 4705ms elapsed. Expected < 200ms.
[14:00:08.900] [WARN ] DB query still running: 8785ms. Lock contention suspected.
[14:00:15.200] [WARN ] DB response time: 15085ms. Approaching timeout limit.
[14:00:30.115] [ERROR] QUERY_TIMEOUT: DB query exceeded 30000ms limit. Killed.
[14:00:30.116] [ERROR] TXN-201 FAILED: Timeout waiting for DB response.
[14:01:00.100] [INFO ] TXN-202 received. Starting processing.
[14:01:00.120] [ERROR] DB_CONNECTION_TIMEOUT: All connections busy. TXN-202 FAILED.
[14:01:30.100] [INFO ] TXN-203 received. Starting processing.
[14:01:30.115] [INFO ] TXN-203 validated. Sending to DB.
[14:01:30.320] [INFO ] TXN-203 SUCCESS. Response time: 205ms.
EOF

→ Timeout log created. 3 transactions — 2 failed, 1 succeeded after DB recovered.

Step 1 — Identify the total wait time

How long did the transaction wait before timing out? Find start and end timestamps.

terminal

# Find when TXN-201 started and when it failed
grep "TXN-201" ~/timeout-scenario.log

# Find all WARN lines — where did slowness start?
grep "WARN\|slow\|contention" ~/timeout-scenario.log

# Find the exact timeout error
grep "TIMEOUT" ~/timeout-scenario.log

→ TXN-201 started at 14:00:00, timed out at 14:00:30. Total wait: 30 seconds. DB query was the bottleneck.

Step 2 — Identify the type of timeout

Was it a DB timeout, a gateway timeout, or an API timeout?

terminal

# What type of timeout was it?
grep "TIMEOUT" ~/timeout-scenario.log | awk '{print $3}'

# How many transactions were affected?
grep -c "FAILED" ~/timeout-scenario.log

# Did any succeed after the timeout?
grep "SUCCESS" ~/timeout-scenario.log

→ QUERY_TIMEOUT = DB query was the cause. 2 transactions failed. 1 succeeded after recovery — confirms DB resolved itself.

Step 3 — Simulate checking for slow queries in SQLite

On a real server you query pg_stat_activity. Here we simulate with your SQLite lab database.

terminal — SQLite simulation

sqlite3 ~/fintech_lab.db << 'EOF'
-- Simulate finding stuck/slow transactions
SELECT txn_id, status, amount
FROM transactions
WHERE status = 'PENDING'
ORDER BY txn_id;

-- Simulate response time analysis by status
SELECT status, COUNT(*) as count, SUM(amount) as total
FROM transactions
GROUP BY status;
EOF

→ PENDING transactions = simulated stuck/slow queries. This shows how many are waiting and their total value at risk.

Write the Timeout RCA for Jira

Put all findings together into a clear RCA entry.

Jira RCA — Timeout

Incident : Transaction Timeout — TXN-201, TXN-202
Start Time : 14:00:00
Timeout At : 14:00:30 (30 second limit exceeded)
Type : QUERY_TIMEOUT — DB query exceeded limit
Root Cause : DB query on accounts table ran 30x slower
than expected. WARN logs showed lock contention
at 14:00:04. A long-running query was holding
a lock and blocking all other queries behind it.
Impact : 2 transactions failed. TXN-203 succeeded after
lock was released — confirms self-recovery.
Action : DBA identified and killed the blocking query.
Next Steps : DBA to add index on accounts.id column.
Review query timeout limits with L3 team.

→ Complete, professional timeout RCA ready for Jira. ✅

09 Real L2 Scenarios

Monitoring shows CPU at 97% for 10 minutes. You run ps aux --sort=-%cpu | head -5 — a java batch job is at 96%. It has been running 3 hours — normally takes 20 minutes. Stuck in a loop. You announce on bridge, get approval, kill -15 the PID. CPU drops to 8% in 30 seconds.

Transactions are timing out but CPU is normal and DB is running. You check the log — SOCKET_TIMEOUT pointing to SBP gateway. It is not your system at all. You check SBP's status page — they are experiencing issues. You inform the client and the bridge: "External timeout — SBP gateway slow. Our system is healthy."

All IBFT transactions are timing out after exactly 10 seconds. You check the log — it is always exactly 10s. This is the configured API timeout — not random latency. The DB query runs in 9.8 seconds every time. Fix: DBA adds a missing index — query drops to 80ms. All timeouts stop immediately.

CPU is at 85% but no single process stands out — all at 15–20% each. This is a traffic surge — too many requests simultaneously. Each is using its fair share but together they overwhelm the server. The fix is to add more capacity or implement rate limiting — not kill any process.

✅ Week 6 · Day 3 & 4 Outcomes

Read the top command output — identify %CPU, load average, idle%, and the heavy process
Know the 4 most common causes of a CPU spike in a fintech server
Use ps aux to find the top CPU-consuming process and get its PID
Decide whether to wait, kill, or escalate based on the process identity and run time
Complete the CPU lab — create a real spike on Kali, investigate it with top and ps, kill it, confirm recovery, write the Jira entry
Explain what a transaction timeout is and why it happens
Identify the 6 types of timeouts and know which system each one points to
Read a timeout log and calculate total wait time from timestamps
Understand how a slow DB query causes a QUERY_TIMEOUT and how to find it
Complete the timeout RCA lab — trace the full chain, identify root cause, produce a Jira-ready RCA

High CPU Scenario &Txn Timeout RCA

What causes a CPU spike?

Step 1 — Confirm it is CPU and find the process

Step 2 — Decide what to do with the process

🔬 Lab: Investigate a High CPU Scenario

What is a Timeout and why does it happen?

📊 Transaction Response Time — Where Time Is Spent

The Timeout Investigation Chain

How slow queries cause timeouts

🔬 Lab: Trace and Root-Cause a Timeout

✅ Week 6 · Day 3 & 4 Outcomes

High CPU Scenario &
Txn Timeout RCA