L2 Support Engineer · Fintech · Extra Learning
Troubleshooting Scenarios
Extra Learning · Reference Guide
Troubleshooting Scenarios
Eight real-world problems you will face as an L2 engineer — each one mapped to its symptoms, the action you take, and the exact fix you apply. Memorise this table and you can handle the most common incidents without looking anything up.
API Down
Disk Full
High CPU
Queue Backlog
Missing TXN
Timeout
Cert Expired
DB Slow
01 How to Use This Reference
Purpose
Every scenario follows the same three-column logic used in real incident handling:
Symptoms — what you or the client observes. This is the entry point.
Action — what you check or do first to confirm the root cause.
Fix — the resolution that closes the incident.
Start with the symptom. Match it to a scenario. Follow the action and fix. That is L2 troubleshooting in its cleanest form.
02 The 8 Troubleshooting Scenarios
01
API Down
Service Failure
Symptoms
- 503 errors returned to client
- All requests to the endpoint fail
- Health check endpoint not responding
- Error rate metric spikes to 100%
Action
- Call the /health endpoint first
- Check application logs for startup errors
- Check if the service process is running
- Check disk and memory — may have caused crash
Fix
Restart the service. If it crashes again immediately — check logs for the crash reason before restarting again. Escalate to L3 if restart does not hold.
02
Disk Full
Resource Failure
Symptoms
- Writes failing across all services
- DB transactions returning write errors
- Log files stopped growing
- Applications crashing on file writes
Action
- Run
df -h to confirm disk at 100%
- Run
du -sh /var/log/* | sort -rh | head
- Identify the biggest folder consuming space
- Check if old logs are the culprit (most common)
Fix
Clean logs. Archive or delete old log files to free space. Monitor disk after cleanup. If it fills quickly again — a rogue log is being written continuously. Find and fix the source.
03
High CPU
Resource Spike
Symptoms
- Slow application response times
- Requests timing out across all endpoints
- CPU metric above 90% on dashboard
- Load average growing over time
Action
- Run
top — press P to sort by CPU
- Run
ps aux --sort=-%cpu | head -6
- Identify the process using the most CPU
- Check how long it has been running
Fix
Kill the process — kill -15 PID for graceful stop. Get bridge approval before killing on PROD. Confirm CPU recovers after kill. Never use kill -9 on a DB process.
04
Queue Backlog
MQ Stuck
Symptoms
- Transactions stuck in PENDING state
- Delay reported by clients
- MQ queue depth growing, not decreasing
- Consumer not picking up messages
Action
- Check MQ queue depth — is it growing?
- Check consumer service — is it running?
- Check consumer logs for crash errors
- Note the queue depth before touching anything
Fix
Restart consumer in safe order: stop consumer → get bridge approval → restart MQ if needed → restart consumer → monitor queue depth decreasing. Messages are not lost — they wait safely in the queue.
05
Missing Transaction
Investigation
Symptoms
- Customer complaint — payment not received
- Transaction ID shows no result in client app
- Money debited but destination not credited
- Status unknown — client has no update
Action
- Search DB for the TXN reference ID
- Check TRANSACTIONS_LOG for current status
- Check MX_MESSAGE for SBP/external response
- Grep logs for TXN ID — full history
Fix
Trace status. If DB shows SUCCESS but client not notified — trigger notification. If PENDING and SBP confirmed — manually update (L3 approval). If FAILED — check reconciliation and refund eligibility.
Symptoms
- Slow API responses — taking seconds not ms
- TIMEOUT errors in logs
- Transactions failing after 30 seconds
- Client reports "request timed out"
Action
- Identify timeout type from the log message
- QUERY_TIMEOUT → DB query too slow
- SOCKET_TIMEOUT → external API not responding
- Check network if SOCKET_TIMEOUT
Fix
Restart gateway if external timeout. If DB query timeout — check for slow queries and add index (do not restart DB). If network — check connectivity to the external endpoint.
07
Certificate Expired
SSL / TLS
Symptoms
- TLS error in logs or client response
- SSL handshake failure messages
- All HTTPS connections to endpoint failing
- Browser shows "certificate not trusted"
Action
- Check the certificate expiry date
- Run:
openssl s_client -connect host:443
- Confirm it is the cert and not the config
- Identify who issued and manages the cert
Fix
Renew the certificate. Contact the certificate authority or use the internal cert management tool. After renewing — restart the web server or load balancer to apply the new cert. Verify with openssl after.
Symptoms
- Transaction response times very high
- QUERY_TIMEOUT errors in logs
- All services depending on DB are slow
- DB CPU spiking on the server
Action
- Check pg_stat_activity for long-running queries
- Identify which query is taking the most time
- Check if a lock is being held on a table
- Do NOT restart DB — diagnose first
Fix
Optimise query — add index. Never restart the DB for a slow query — that wastes time and risks data integrity. Ask DBA to add the missing index. Response time drops immediately after the index is created.
03 Quick Reference — All 8 Scenarios at a Glance
💡 Keep this table memorised. When a ticket arrives and a client describes a symptom — you should immediately know which scenario it maps to and what to do first.
🔖 Troubleshooting Cheat Sheet
| Problem | Symptom | First Action | Fix |
| API Down |
503 errors from endpoint |
Check service logs + health endpoint |
Restart service |
| Disk Full |
Writes fail everywhere |
df -h → find biggest folder |
Clean logs |
| High CPU |
Slow app, all endpoints |
top → identify heavy process |
kill -15 PID |
| Queue Backlog |
Transactions pending, delay |
Check queue depth + consumer status |
Restart consumer |
| Missing TXN |
Customer complaint, no update |
DB search by TXN ID + log grep |
Trace status |
| Timeout |
Slow API, 30s failures |
Identify timeout type from log message |
Restart gateway / add index |
| Cert Expired |
TLS error in logs |
openssl s_client → check expiry |
Renew certificate |
| DB Slow |
Timeout, high response time |
Check pg_stat_activity for slow queries |
Add index |
✅ Troubleshooting Scenarios — What I Know
- API Down → 503 errors → check service logs and health endpoint → restart service if process has stopped
- Disk Full → writes failing → df -h to confirm, du to find biggest folder → clean old logs to free space
- High CPU → slow app across all endpoints → top / ps aux to identify the heavy process → kill -15 PID with bridge approval
- Queue Backlog → transactions pending, delay → check queue depth and consumer status → restart consumer in safe order
- Missing Transaction → customer complaint, unknown status → search DB and grep logs by TXN ID → trace status and update or escalate
- Timeout → slow API, 30-second failures → identify timeout type from the log error message → restart gateway for external or add index for DB
- Certificate Expired → TLS error → openssl s_client to check expiry → renew cert and restart web server to apply
- DB Slow → timeout and high response time → check pg_stat_activity for long-running queries → ask DBA to add missing index — never restart DB for a slow query