L2 Support Engineer · Fintech · Week 6
Week 6
Day 3
Day 4
Week 6 · Day 3 & Day 4
High CPU Scenario &
Txn Timeout RCA
Day 3 — you learn to find what is eating your server's CPU and what to do about it. Day 4 — you learn to trace why transactions are timing out and find the delay in the chain.
Day 3 — High CPU
Day 4 — Timeout RCA
topps aux
LatencyDB Query Analysis
Day 3
High CPU Scenario — Process Analysis Using top
01 The Simple Idea
Real-life Analogy
Think of your server's CPU like a chef in a kitchen. When one dish takes all the chef's attention — everything else in the kitchen slows down. Other dishes are left waiting, orders pile up, customers complain.
A high CPU spike is that one dish taking over. One process is consuming all the processor's time and starving everything else. Your job is to identify which process is the greedy one, understand why, and decide whether to wait, kill it, or escalate.
What causes a CPU spike?
Runaway process — a Java application, a batch job, or a database query that goes into an infinite loop or processes far more data than expected.
Sudden traffic surge — too many requests hitting the server at once. Each request uses CPU and together they overwhelm it.
Poorly optimised query — a database query doing a full table scan instead of using an index. On a large table this can pin the CPU for minutes.
Memory pressure — when RAM runs out, the OS starts swapping (using disk as RAM). Swap operations are CPU-heavy and cause the whole server to slow down.
02 Reading the top Command — What Each Column Means
top — server process monitor (simulated)
PID USER %CPU %MEM COMMAND
18821
paymentapp
92.3
18.4
java PaymentProcessorService
1204
postgres
4.1
3.2
postgres: query worker
892
root
0.8
0.4
nginx: worker process
1055
root
0.2
0.1
sshd
📋 What Each top Column Tells You
| Column | What it means | What to look for |
| PID | Process ID — the unique number of each running process | Note the PID of the heavy process — you need it to kill or investigate it |
| USER | Which user account is running this process | If it is running as root unexpectedly — suspicious |
| %CPU | How much of the CPU this process is using right now | Anything above 80% on a single process is your problem |
| %MEM | How much RAM this process is using | High CPU + high MEM together = serious resource hog |
| COMMAND | The name of the process | Tells you what application it is — java, python, nginx, postgres |
| load average | System load over last 1min, 5min, 15min | Above 1.0 per CPU core = overloaded. e.g. 4.0 on a 4-core = 100% busy |
| id (idle %) | How much CPU is free / doing nothing | Below 10% idle means the server is critically overloaded |
03 High CPU Investigation — What to Do in Order
Step 1 — Confirm it is CPU and find the process
Run top and press P to sort by CPU. The guilty process is at the top. Note the PID and the process name. If it is a Java process — it is likely your payment application. If it is postgres — a DB query is doing it.
terminal — CPU investigation commands
# Run top and sort by CPU (press P after opening)
top
# then press P to sort highest CPU first
# Get a snapshot sorted by CPU — no interactive mode
ps aux --sort=-%cpu | head -10
# Find exactly which java threads are running hot
ps -p 18821 -o pid,ppid,pcpu,pmem,cmd
# Check system load average
cat /proc/loadavg
# Check how long the process has been running
ps -p 18821 -o pid,etime,pcpu,cmd
Step 2 — Decide what to do with the process
Wait and monitor — if it is a legitimate batch job that will finish, let it run. Check if it should be running at this time. Watch if CPU is trending down.
Kill the process — if it is stuck in a loop, consuming CPU with no progress, and blocking everything else. Only do this with bridge approval on PROD.
Escalate to L3 — if you are not sure what the process is doing or if killing it might have side effects. Always better to escalate than to accidentally stop a critical service.
terminal — process control commands
# Graceful stop — tells the process to stop cleanly
kill -15 18821
# Force kill — only if graceful stop failed
kill -9 18821
# Watch CPU in real time — refresh every 2 seconds
watch -n 2 "ps aux --sort=-%cpu | head -5"
# Check if CPU came down after killing the process
top -bn1 | grep "Cpu(s)"
⚠️ Never use kill -9 on a database process. Forcing a DB to stop can corrupt data. Always use systemctl stop or kill -15 for DB processes, and always with DBA approval.
04 Day 3 Lab — Analyse a CPU Spike
🔬 Lab: Investigate a High CPU Scenario
Kali Linux · Real Commands
Create a CPU spike to investigate
This command deliberately uses CPU so you can practice investigating a real spike. It runs for 20 seconds then stops automatically.
terminal — create artificial CPU load
# Open a new terminal tab first, then run this
# This creates CPU load for 20 seconds then stops
timeout 20 yes > /dev/null &
echo "CPU load started. PID: $!"
→ A process starts consuming CPU. Note the PID printed. You now have a real CPU spike to investigate.
Find the heavy process using top and ps
In your original terminal — run these commands to find what is eating CPU.
terminal
# Snapshot — top 5 processes by CPU
ps aux --sort=-%cpu | head -6
# Check system load
cat /proc/loadavg
# Interactive top — press P to sort by CPU, q to quit
top
→ You will see "yes" process at the top with high %CPU. That is the simulated heavy process.
Investigate how long the process has been running
Check the PID from step 1. How long has this process been consuming CPU?
terminal — replace PID with the one from step 1
# How long has this process been running?
ps -p YOUR_PID -o pid,etime,pcpu,pmem,cmd
# Check what user is running it
ps -p YOUR_PID -o pid,user,pcpu,cmd
→ Shows run time, CPU%, and who is running it. On a real server this tells you if it's a legit process or something unexpected.
Make the decision — kill it or escalate
In this lab we know it is our test process — safe to kill. On PROD you would get bridge approval first.
terminal
# Graceful stop (replace with your PID)
kill -15 YOUR_PID
# Confirm CPU came back down
ps aux --sort=-%cpu | head -4
cat /proc/loadavg
→ CPU usage drops. Load average starts decreasing. Server returns to normal. ✅
Write the findings for Jira
Document what you found and what you did.
Jira entry
Issue : High CPU on Payment Server
Detected : 14:32 — CPU at 92%
Process : PID 18821 — java PaymentProcessorService
Run time : 47 minutes (abnormal — should complete in 5)
Action : Graceful kill with bridge approval at 14:35
Result : CPU returned to 8% within 30 seconds
Next steps : L3 to investigate why process ran for 47 mins
→ Jira RCA complete. ✅
Day 4
Transaction Timeout RCA — Latency & DB Query Analysis
05 The Simple Idea
Real-life Analogy
Think of a timeout like a restaurant where you order food and it never arrives. After 30 minutes you give up and leave. The restaurant didn't say your order was rejected — it just took too long and you lost patience.
A transaction timeout is the same. The system sent a request but waited too long for a reply and gave up. The root cause is always somewhere in the chain — the app was slow, the DB query took too long, the external API didn't respond in time, or the network had high latency.
What is a Timeout and why does it happen?
A timeout occurs when a system waits longer than the configured limit for a response. Instead of waiting forever, it gives up and marks the transaction as failed. Common timeout values: 5 seconds for APIs, 30 seconds for DB queries, 10 seconds for external gateways.
Common causes: A slow or locked database query. An external gateway (SBP, 1LINK) responding slowly. A network bottleneck between two servers. High CPU causing the app to process requests slowly. A full connection pool making new queries wait.
06 Understanding the Latency Chain
📊 Transaction Response Time — Where Time Is Spent
Slow DB query (2.5s)
2,500ms
Slow gateway (8s)
8,000ms
The Timeout Investigation Chain
Every timeout has a location. You trace it from the outside in — starting from what the client sees and working inward until you find where time is being lost.
Step 1: Check the log timestamp — how long between request received and timeout error? That is the total wait time.
Step 2: Was it an external timeout (SBP, 1LINK gateway) or an internal one (DB, app)? Check which system the timeout message names.
Step 3: If internal — check DB query times. A slow query is the most common cause of internal timeouts in fintech.
⏱️ Timeout Types — What Each One Tells You
| Timeout Type | Where it happens | Root cause | Who to involve |
| DB_CONNECTION_TIMEOUT | App → Database | Pool exhausted or DB too slow | DBA |
| QUERY_TIMEOUT | Inside DB | Slow query — no index, full table scan | DBA |
| SOCKET_TIMEOUT | App → External (SBP/1LINK) | External system slow or down | Vendor |
| HTTP_TIMEOUT | Client → App API | App is slow to respond — high CPU or slow DB | L3 Dev |
| READ_TIMEOUT | Between internal services | Downstream service too slow | L3 Dev |
| LOCK_TIMEOUT | Inside DB | A long query holding a lock blocks others | DBA |
07 DB Query Analysis — Finding the Slow Query
How slow queries cause timeouts
A slow DB query locks a table or uses so much CPU that all other queries queue up behind it. The payment app is waiting for the DB to respond. If the wait exceeds the timeout limit — the transaction fails with QUERY_TIMEOUT even though the DB is technically running fine.
The fix is almost never to restart the DB. The fix is to identify the slow query, understand why it is slow, and either add an index, optimise the query, or kill the specific query that is causing the blockage.
SQL — find slow and long-running queries (PostgreSQL / MySQL)
-- PostgreSQL: Find all queries running longer than 30 seconds
SELECT pid, now() - pg_stat_activity.query_start AS duration,
query, state
FROM pg_stat_activity
WHERE state != 'idle'
AND now() - pg_stat_activity.query_start > interval '30 seconds'
ORDER BY duration DESC;
-- Find queries waiting for locks
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event_type = 'Lock';
-- SQLite (your lab DB): Check stuck transactions
-- In SQLite: simulate by looking at PENDING rows
08 Day 4 Lab — Investigate a Transaction Timeout
🔬 Lab: Trace and Root-Cause a Timeout
Kali Linux · Log + SQL
Create a simulated timeout log
This log shows a realistic timeout scenario with latency build-up visible in the timestamps.
terminal
cat > ~/timeout-scenario.log << 'EOF'
[14:00:00.100] [INFO ] TXN-201 received. Starting processing.
[14:00:00.110] [INFO ] TXN-201 validated. Sending to DB.
[14:00:00.115] [INFO ] DB query started: SELECT balance FROM accounts WHERE id=201
[14:00:04.820] [WARN ] DB query slow: 4705ms elapsed. Expected < 200ms.
[14:00:08.900] [WARN ] DB query still running: 8785ms. Lock contention suspected.
[14:00:15.200] [WARN ] DB response time: 15085ms. Approaching timeout limit.
[14:00:30.115] [ERROR] QUERY_TIMEOUT: DB query exceeded 30000ms limit. Killed.
[14:00:30.116] [ERROR] TXN-201 FAILED: Timeout waiting for DB response.
[14:01:00.100] [INFO ] TXN-202 received. Starting processing.
[14:01:00.120] [ERROR] DB_CONNECTION_TIMEOUT: All connections busy. TXN-202 FAILED.
[14:01:30.100] [INFO ] TXN-203 received. Starting processing.
[14:01:30.115] [INFO ] TXN-203 validated. Sending to DB.
[14:01:30.320] [INFO ] TXN-203 SUCCESS. Response time: 205ms.
EOF
→ Timeout log created. 3 transactions — 2 failed, 1 succeeded after DB recovered.
Step 1 — Identify the total wait time
How long did the transaction wait before timing out? Find start and end timestamps.
terminal
# Find when TXN-201 started and when it failed
grep "TXN-201" ~/timeout-scenario.log
# Find all WARN lines — where did slowness start?
grep "WARN\|slow\|contention" ~/timeout-scenario.log
# Find the exact timeout error
grep "TIMEOUT" ~/timeout-scenario.log
→ TXN-201 started at 14:00:00, timed out at 14:00:30. Total wait: 30 seconds. DB query was the bottleneck.
Step 2 — Identify the type of timeout
Was it a DB timeout, a gateway timeout, or an API timeout?
terminal
# What type of timeout was it?
grep "TIMEOUT" ~/timeout-scenario.log | awk '{print $3}'
# How many transactions were affected?
grep -c "FAILED" ~/timeout-scenario.log
# Did any succeed after the timeout?
grep "SUCCESS" ~/timeout-scenario.log
→ QUERY_TIMEOUT = DB query was the cause. 2 transactions failed. 1 succeeded after recovery — confirms DB resolved itself.
Step 3 — Simulate checking for slow queries in SQLite
On a real server you query pg_stat_activity. Here we simulate with your SQLite lab database.
terminal — SQLite simulation
sqlite3 ~/fintech_lab.db << 'EOF'
-- Simulate finding stuck/slow transactions
SELECT txn_id, status, amount
FROM transactions
WHERE status = 'PENDING'
ORDER BY txn_id;
-- Simulate response time analysis by status
SELECT status, COUNT(*) as count, SUM(amount) as total
FROM transactions
GROUP BY status;
EOF
→ PENDING transactions = simulated stuck/slow queries. This shows how many are waiting and their total value at risk.
Write the Timeout RCA for Jira
Put all findings together into a clear RCA entry.
Jira RCA — Timeout
Incident : Transaction Timeout — TXN-201, TXN-202
Start Time : 14:00:00
Timeout At : 14:00:30 (30 second limit exceeded)
Type : QUERY_TIMEOUT — DB query exceeded limit
Root Cause : DB query on accounts table ran 30x slower
than expected. WARN logs showed lock contention
at 14:00:04. A long-running query was holding
a lock and blocking all other queries behind it.
Impact : 2 transactions failed. TXN-203 succeeded after
lock was released — confirms self-recovery.
Action : DBA identified and killed the blocking query.
Next Steps : DBA to add index on accounts.id column.
Review query timeout limits with L3 team.
→ Complete, professional timeout RCA ready for Jira. ✅
09 Real L2 Scenarios
01
Monitoring shows CPU at 97% for 10 minutes. You run ps aux --sort=-%cpu | head -5 — a java batch job is at 96%. It has been running 3 hours — normally takes 20 minutes. Stuck in a loop. You announce on bridge, get approval, kill -15 the PID. CPU drops to 8% in 30 seconds.
02
Transactions are timing out but CPU is normal and DB is running. You check the log — SOCKET_TIMEOUT pointing to SBP gateway. It is not your system at all. You check SBP's status page — they are experiencing issues. You inform the client and the bridge: "External timeout — SBP gateway slow. Our system is healthy."
03
All IBFT transactions are timing out after exactly 10 seconds. You check the log — it is always exactly 10s. This is the configured API timeout — not random latency. The DB query runs in 9.8 seconds every time. Fix: DBA adds a missing index — query drops to 80ms. All timeouts stop immediately.
04
CPU is at 85% but no single process stands out — all at 15–20% each. This is a traffic surge — too many requests simultaneously. Each is using its fair share but together they overwhelm the server. The fix is to add more capacity or implement rate limiting — not kill any process.
✅ Week 6 · Day 3 & 4 Outcomes
- Read the top command output — identify %CPU, load average, idle%, and the heavy process
- Know the 4 most common causes of a CPU spike in a fintech server
- Use ps aux to find the top CPU-consuming process and get its PID
- Decide whether to wait, kill, or escalate based on the process identity and run time
- Complete the CPU lab — create a real spike on Kali, investigate it with top and ps, kill it, confirm recovery, write the Jira entry
- Explain what a transaction timeout is and why it happens
- Identify the 6 types of timeouts and know which system each one points to
- Read a timeout log and calculate total wait time from timestamps
- Understand how a slow DB query causes a QUERY_TIMEOUT and how to find it
- Complete the timeout RCA lab — trace the full chain, identify root cause, produce a Jira-ready RCA