L2 Support Engineer · Fintech · Week 6
Week 6 Day 3 Day 4
Week 6 · Day 3 & Day 4

High CPU Scenario &
Txn Timeout RCA

Day 3 — you learn to find what is eating your server's CPU and what to do about it. Day 4 — you learn to trace why transactions are timing out and find the delay in the chain.

Day 3 — High CPU Day 4 — Timeout RCA topps aux LatencyDB Query Analysis
Day 3 High CPU Scenario — Process Analysis Using top
01 The Simple Idea
Real-life Analogy

Think of your server's CPU like a chef in a kitchen. When one dish takes all the chef's attention — everything else in the kitchen slows down. Other dishes are left waiting, orders pile up, customers complain.

A high CPU spike is that one dish taking over. One process is consuming all the processor's time and starving everything else. Your job is to identify which process is the greedy one, understand why, and decide whether to wait, kill it, or escalate.

What causes a CPU spike?

Runaway process — a Java application, a batch job, or a database query that goes into an infinite loop or processes far more data than expected.

Sudden traffic surge — too many requests hitting the server at once. Each request uses CPU and together they overwhelm it.

Poorly optimised query — a database query doing a full table scan instead of using an index. On a large table this can pin the CPU for minutes.

Memory pressure — when RAM runs out, the OS starts swapping (using disk as RAM). Swap operations are CPU-heavy and cause the whole server to slow down.

02 Reading the top Command — What Each Column Means
top — server process monitor (simulated)
top - 14:32:05 load average: 3.82, 2.41, 1.20
Tasks: 142 total, 3 running, 139 sleeping
%Cpu(s): 91.2 us, 2.1 sy, 0.0 ni, 6.7 id, 0.0 wa
MiB Mem: 8000 total, 620 free, 6800 used, 580 buff
PID     USER       %CPU   %MEM   COMMAND
18821 paymentapp 92.3  18.4  java PaymentProcessorService
1204  postgres   4.1   3.2   postgres: query worker
892   root       0.8   0.4   nginx: worker process
1055  root       0.2   0.1   sshd
📋 What Each top Column Tells You
ColumnWhat it meansWhat to look for
PIDProcess ID — the unique number of each running processNote the PID of the heavy process — you need it to kill or investigate it
USERWhich user account is running this processIf it is running as root unexpectedly — suspicious
%CPUHow much of the CPU this process is using right nowAnything above 80% on a single process is your problem
%MEMHow much RAM this process is usingHigh CPU + high MEM together = serious resource hog
COMMANDThe name of the processTells you what application it is — java, python, nginx, postgres
load averageSystem load over last 1min, 5min, 15minAbove 1.0 per CPU core = overloaded. e.g. 4.0 on a 4-core = 100% busy
id (idle %)How much CPU is free / doing nothingBelow 10% idle means the server is critically overloaded
03 High CPU Investigation — What to Do in Order

Step 1 — Confirm it is CPU and find the process

Run top and press P to sort by CPU. The guilty process is at the top. Note the PID and the process name. If it is a Java process — it is likely your payment application. If it is postgres — a DB query is doing it.

terminal — CPU investigation commands
# Run top and sort by CPU (press P after opening)
top
# then press P to sort highest CPU first

# Get a snapshot sorted by CPU — no interactive mode
ps aux --sort=-%cpu | head -10

# Find exactly which java threads are running hot
ps -p 18821 -o pid,ppid,pcpu,pmem,cmd

# Check system load average
cat /proc/loadavg

# Check how long the process has been running
ps -p 18821 -o pid,etime,pcpu,cmd

Step 2 — Decide what to do with the process

Wait and monitor — if it is a legitimate batch job that will finish, let it run. Check if it should be running at this time. Watch if CPU is trending down.

Kill the process — if it is stuck in a loop, consuming CPU with no progress, and blocking everything else. Only do this with bridge approval on PROD.

Escalate to L3 — if you are not sure what the process is doing or if killing it might have side effects. Always better to escalate than to accidentally stop a critical service.

terminal — process control commands
# Graceful stop — tells the process to stop cleanly
kill -15 18821

# Force kill — only if graceful stop failed
kill -9 18821

# Watch CPU in real time — refresh every 2 seconds
watch -n 2 "ps aux --sort=-%cpu | head -5"

# Check if CPU came down after killing the process
top -bn1 | grep "Cpu(s)"
⚠️ Never use kill -9 on a database process. Forcing a DB to stop can corrupt data. Always use systemctl stop or kill -15 for DB processes, and always with DBA approval.
04 Day 3 Lab — Analyse a CPU Spike

🔬 Lab: Investigate a High CPU Scenario

Kali Linux · Real Commands
01
Create a CPU spike to investigate
This command deliberately uses CPU so you can practice investigating a real spike. It runs for 20 seconds then stops automatically.
terminal — create artificial CPU load
# Open a new terminal tab first, then run this
# This creates CPU load for 20 seconds then stops
timeout 20 yes > /dev/null &
echo "CPU load started. PID: $!"
→ A process starts consuming CPU. Note the PID printed. You now have a real CPU spike to investigate.
02
Find the heavy process using top and ps
In your original terminal — run these commands to find what is eating CPU.
terminal
# Snapshot — top 5 processes by CPU
ps aux --sort=-%cpu | head -6

# Check system load
cat /proc/loadavg

# Interactive top — press P to sort by CPU, q to quit
top
→ You will see "yes" process at the top with high %CPU. That is the simulated heavy process.
03
Investigate how long the process has been running
Check the PID from step 1. How long has this process been consuming CPU?
terminal — replace PID with the one from step 1
# How long has this process been running?
ps -p YOUR_PID -o pid,etime,pcpu,pmem,cmd

# Check what user is running it
ps -p YOUR_PID -o pid,user,pcpu,cmd
→ Shows run time, CPU%, and who is running it. On a real server this tells you if it's a legit process or something unexpected.
04
Make the decision — kill it or escalate
In this lab we know it is our test process — safe to kill. On PROD you would get bridge approval first.
terminal
# Graceful stop (replace with your PID)
kill -15 YOUR_PID

# Confirm CPU came back down
ps aux --sort=-%cpu | head -4
cat /proc/loadavg
→ CPU usage drops. Load average starts decreasing. Server returns to normal. ✅
05
Write the findings for Jira
Document what you found and what you did.
Jira entry
Issue : High CPU on Payment Server
Detected : 14:32 — CPU at 92%
Process : PID 18821 — java PaymentProcessorService
Run time : 47 minutes (abnormal — should complete in 5)
Action : Graceful kill with bridge approval at 14:35
Result : CPU returned to 8% within 30 seconds
Next steps : L3 to investigate why process ran for 47 mins
→ Jira RCA complete. ✅
Day 4 Transaction Timeout RCA — Latency & DB Query Analysis
05 The Simple Idea
Real-life Analogy

Think of a timeout like a restaurant where you order food and it never arrives. After 30 minutes you give up and leave. The restaurant didn't say your order was rejected — it just took too long and you lost patience.

A transaction timeout is the same. The system sent a request but waited too long for a reply and gave up. The root cause is always somewhere in the chain — the app was slow, the DB query took too long, the external API didn't respond in time, or the network had high latency.

What is a Timeout and why does it happen?

A timeout occurs when a system waits longer than the configured limit for a response. Instead of waiting forever, it gives up and marks the transaction as failed. Common timeout values: 5 seconds for APIs, 30 seconds for DB queries, 10 seconds for external gateways.

Common causes: A slow or locked database query. An external gateway (SBP, 1LINK) responding slowly. A network bottleneck between two servers. High CPU causing the app to process requests slowly. A full connection pool making new queries wait.

06 Understanding the Latency Chain

📊 Transaction Response Time — Where Time Is Spent

Normal API (200ms)
200ms
Slow DB query (2.5s)
2,500ms
Slow gateway (8s)
8,000ms
Timeout (30s+)
>30,000ms

The Timeout Investigation Chain

Every timeout has a location. You trace it from the outside in — starting from what the client sees and working inward until you find where time is being lost.

Step 1: Check the log timestamp — how long between request received and timeout error? That is the total wait time.

Step 2: Was it an external timeout (SBP, 1LINK gateway) or an internal one (DB, app)? Check which system the timeout message names.

Step 3: If internal — check DB query times. A slow query is the most common cause of internal timeouts in fintech.

⏱️ Timeout Types — What Each One Tells You
Timeout TypeWhere it happensRoot causeWho to involve
DB_CONNECTION_TIMEOUTApp → DatabasePool exhausted or DB too slowDBA
QUERY_TIMEOUTInside DBSlow query — no index, full table scanDBA
SOCKET_TIMEOUTApp → External (SBP/1LINK)External system slow or downVendor
HTTP_TIMEOUTClient → App APIApp is slow to respond — high CPU or slow DBL3 Dev
READ_TIMEOUTBetween internal servicesDownstream service too slowL3 Dev
LOCK_TIMEOUTInside DBA long query holding a lock blocks othersDBA
07 DB Query Analysis — Finding the Slow Query

How slow queries cause timeouts

A slow DB query locks a table or uses so much CPU that all other queries queue up behind it. The payment app is waiting for the DB to respond. If the wait exceeds the timeout limit — the transaction fails with QUERY_TIMEOUT even though the DB is technically running fine.

The fix is almost never to restart the DB. The fix is to identify the slow query, understand why it is slow, and either add an index, optimise the query, or kill the specific query that is causing the blockage.

SQL — find slow and long-running queries (PostgreSQL / MySQL)
-- PostgreSQL: Find all queries running longer than 30 seconds
SELECT pid, now() - pg_stat_activity.query_start AS duration,
       query, state
FROM pg_stat_activity
WHERE state != 'idle'
AND now() - pg_stat_activity.query_start > interval '30 seconds'
ORDER BY duration DESC;

-- Find queries waiting for locks
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event_type = 'Lock';

-- SQLite (your lab DB): Check stuck transactions
-- In SQLite: simulate by looking at PENDING rows
08 Day 4 Lab — Investigate a Transaction Timeout

🔬 Lab: Trace and Root-Cause a Timeout

Kali Linux · Log + SQL
01
Create a simulated timeout log
This log shows a realistic timeout scenario with latency build-up visible in the timestamps.
terminal
cat > ~/timeout-scenario.log << 'EOF'
[14:00:00.100] [INFO ] TXN-201 received. Starting processing.
[14:00:00.110] [INFO ] TXN-201 validated. Sending to DB.
[14:00:00.115] [INFO ] DB query started: SELECT balance FROM accounts WHERE id=201
[14:00:04.820] [WARN ] DB query slow: 4705ms elapsed. Expected < 200ms.
[14:00:08.900] [WARN ] DB query still running: 8785ms. Lock contention suspected.
[14:00:15.200] [WARN ] DB response time: 15085ms. Approaching timeout limit.
[14:00:30.115] [ERROR] QUERY_TIMEOUT: DB query exceeded 30000ms limit. Killed.
[14:00:30.116] [ERROR] TXN-201 FAILED: Timeout waiting for DB response.
[14:01:00.100] [INFO ] TXN-202 received. Starting processing.
[14:01:00.120] [ERROR] DB_CONNECTION_TIMEOUT: All connections busy. TXN-202 FAILED.
[14:01:30.100] [INFO ] TXN-203 received. Starting processing.
[14:01:30.115] [INFO ] TXN-203 validated. Sending to DB.
[14:01:30.320] [INFO ] TXN-203 SUCCESS. Response time: 205ms.
EOF
→ Timeout log created. 3 transactions — 2 failed, 1 succeeded after DB recovered.
02
Step 1 — Identify the total wait time
How long did the transaction wait before timing out? Find start and end timestamps.
terminal
# Find when TXN-201 started and when it failed
grep "TXN-201" ~/timeout-scenario.log

# Find all WARN lines — where did slowness start?
grep "WARN\|slow\|contention" ~/timeout-scenario.log

# Find the exact timeout error
grep "TIMEOUT" ~/timeout-scenario.log
→ TXN-201 started at 14:00:00, timed out at 14:00:30. Total wait: 30 seconds. DB query was the bottleneck.
03
Step 2 — Identify the type of timeout
Was it a DB timeout, a gateway timeout, or an API timeout?
terminal
# What type of timeout was it?
grep "TIMEOUT" ~/timeout-scenario.log | awk '{print $3}'

# How many transactions were affected?
grep -c "FAILED" ~/timeout-scenario.log

# Did any succeed after the timeout?
grep "SUCCESS" ~/timeout-scenario.log
→ QUERY_TIMEOUT = DB query was the cause. 2 transactions failed. 1 succeeded after recovery — confirms DB resolved itself.
04
Step 3 — Simulate checking for slow queries in SQLite
On a real server you query pg_stat_activity. Here we simulate with your SQLite lab database.
terminal — SQLite simulation
sqlite3 ~/fintech_lab.db << 'EOF'
-- Simulate finding stuck/slow transactions
SELECT txn_id, status, amount
FROM transactions
WHERE status = 'PENDING'
ORDER BY txn_id;

-- Simulate response time analysis by status
SELECT status, COUNT(*) as count, SUM(amount) as total
FROM transactions
GROUP BY status;
EOF
→ PENDING transactions = simulated stuck/slow queries. This shows how many are waiting and their total value at risk.
05
Write the Timeout RCA for Jira
Put all findings together into a clear RCA entry.
Jira RCA — Timeout
Incident : Transaction Timeout — TXN-201, TXN-202
Start Time : 14:00:00
Timeout At : 14:00:30 (30 second limit exceeded)
Type : QUERY_TIMEOUT — DB query exceeded limit
Root Cause : DB query on accounts table ran 30x slower
than expected. WARN logs showed lock contention
at 14:00:04. A long-running query was holding
a lock and blocking all other queries behind it.
Impact : 2 transactions failed. TXN-203 succeeded after
lock was released — confirms self-recovery.
Action : DBA identified and killed the blocking query.
Next Steps : DBA to add index on accounts.id column.
Review query timeout limits with L3 team.
→ Complete, professional timeout RCA ready for Jira. ✅
09 Real L2 Scenarios
01

Monitoring shows CPU at 97% for 10 minutes. You run ps aux --sort=-%cpu | head -5 — a java batch job is at 96%. It has been running 3 hours — normally takes 20 minutes. Stuck in a loop. You announce on bridge, get approval, kill -15 the PID. CPU drops to 8% in 30 seconds.

02

Transactions are timing out but CPU is normal and DB is running. You check the log — SOCKET_TIMEOUT pointing to SBP gateway. It is not your system at all. You check SBP's status page — they are experiencing issues. You inform the client and the bridge: "External timeout — SBP gateway slow. Our system is healthy."

03

All IBFT transactions are timing out after exactly 10 seconds. You check the log — it is always exactly 10s. This is the configured API timeout — not random latency. The DB query runs in 9.8 seconds every time. Fix: DBA adds a missing index — query drops to 80ms. All timeouts stop immediately.

04

CPU is at 85% but no single process stands out — all at 15–20% each. This is a traffic surge — too many requests simultaneously. Each is using its fair share but together they overwhelm the server. The fix is to add more capacity or implement rate limiting — not kill any process.

✅ Week 6 · Day 3 & 4 Outcomes