L2 Support Engineer · Fintech · Extra Learning
Troubleshooting Scenarios
Extra Learning · Reference Guide

Troubleshooting Scenarios

Eight real-world problems you will face as an L2 engineer — each one mapped to its symptoms, the action you take, and the exact fix you apply. Memorise this table and you can handle the most common incidents without looking anything up.

API Down Disk Full High CPU Queue Backlog Missing TXN Timeout Cert Expired DB Slow
01 How to Use This Reference
Purpose

Every scenario follows the same three-column logic used in real incident handling:

Symptoms — what you or the client observes. This is the entry point.
Action — what you check or do first to confirm the root cause.
Fix — the resolution that closes the incident.

Start with the symptom. Match it to a scenario. Follow the action and fix. That is L2 troubleshooting in its cleanest form.

02 The 8 Troubleshooting Scenarios
01
API Down
Service Failure
Symptoms
  • 503 errors returned to client
  • All requests to the endpoint fail
  • Health check endpoint not responding
  • Error rate metric spikes to 100%
Action
  • Call the /health endpoint first
  • Check application logs for startup errors
  • Check if the service process is running
  • Check disk and memory — may have caused crash
Fix

Restart the service. If it crashes again immediately — check logs for the crash reason before restarting again. Escalate to L3 if restart does not hold.

02
Disk Full
Resource Failure
Symptoms
  • Writes failing across all services
  • DB transactions returning write errors
  • Log files stopped growing
  • Applications crashing on file writes
Action
  • Run df -h to confirm disk at 100%
  • Run du -sh /var/log/* | sort -rh | head
  • Identify the biggest folder consuming space
  • Check if old logs are the culprit (most common)
Fix

Clean logs. Archive or delete old log files to free space. Monitor disk after cleanup. If it fills quickly again — a rogue log is being written continuously. Find and fix the source.

03
High CPU
Resource Spike
Symptoms
  • Slow application response times
  • Requests timing out across all endpoints
  • CPU metric above 90% on dashboard
  • Load average growing over time
Action
  • Run top — press P to sort by CPU
  • Run ps aux --sort=-%cpu | head -6
  • Identify the process using the most CPU
  • Check how long it has been running
Fix

Kill the processkill -15 PID for graceful stop. Get bridge approval before killing on PROD. Confirm CPU recovers after kill. Never use kill -9 on a DB process.

04
Queue Backlog
MQ Stuck
Symptoms
  • Transactions stuck in PENDING state
  • Delay reported by clients
  • MQ queue depth growing, not decreasing
  • Consumer not picking up messages
Action
  • Check MQ queue depth — is it growing?
  • Check consumer service — is it running?
  • Check consumer logs for crash errors
  • Note the queue depth before touching anything
Fix

Restart consumer in safe order: stop consumer → get bridge approval → restart MQ if needed → restart consumer → monitor queue depth decreasing. Messages are not lost — they wait safely in the queue.

05
Missing Transaction
Investigation
Symptoms
  • Customer complaint — payment not received
  • Transaction ID shows no result in client app
  • Money debited but destination not credited
  • Status unknown — client has no update
Action
  • Search DB for the TXN reference ID
  • Check TRANSACTIONS_LOG for current status
  • Check MX_MESSAGE for SBP/external response
  • Grep logs for TXN ID — full history
Fix

Trace status. If DB shows SUCCESS but client not notified — trigger notification. If PENDING and SBP confirmed — manually update (L3 approval). If FAILED — check reconciliation and refund eligibility.

06
Timeout
Latency Issue
Symptoms
  • Slow API responses — taking seconds not ms
  • TIMEOUT errors in logs
  • Transactions failing after 30 seconds
  • Client reports "request timed out"
Action
  • Identify timeout type from the log message
  • QUERY_TIMEOUT → DB query too slow
  • SOCKET_TIMEOUT → external API not responding
  • Check network if SOCKET_TIMEOUT
Fix

Restart gateway if external timeout. If DB query timeout — check for slow queries and add index (do not restart DB). If network — check connectivity to the external endpoint.

07
Certificate Expired
SSL / TLS
Symptoms
  • TLS error in logs or client response
  • SSL handshake failure messages
  • All HTTPS connections to endpoint failing
  • Browser shows "certificate not trusted"
Action
  • Check the certificate expiry date
  • Run: openssl s_client -connect host:443
  • Confirm it is the cert and not the config
  • Identify who issued and manages the cert
Fix

Renew the certificate. Contact the certificate authority or use the internal cert management tool. After renewing — restart the web server or load balancer to apply the new cert. Verify with openssl after.

08
DB Slow
Database
Symptoms
  • Transaction response times very high
  • QUERY_TIMEOUT errors in logs
  • All services depending on DB are slow
  • DB CPU spiking on the server
Action
  • Check pg_stat_activity for long-running queries
  • Identify which query is taking the most time
  • Check if a lock is being held on a table
  • Do NOT restart DB — diagnose first
Fix

Optimise query — add index. Never restart the DB for a slow query — that wastes time and risks data integrity. Ask DBA to add the missing index. Response time drops immediately after the index is created.

03 Quick Reference — All 8 Scenarios at a Glance
💡 Keep this table memorised. When a ticket arrives and a client describes a symptom — you should immediately know which scenario it maps to and what to do first.
🔖 Troubleshooting Cheat Sheet
ProblemSymptomFirst ActionFix
API Down 503 errors from endpoint Check service logs + health endpoint Restart service
Disk Full Writes fail everywhere df -h → find biggest folder Clean logs
High CPU Slow app, all endpoints top → identify heavy process kill -15 PID
Queue Backlog Transactions pending, delay Check queue depth + consumer status Restart consumer
Missing TXN Customer complaint, no update DB search by TXN ID + log grep Trace status
Timeout Slow API, 30s failures Identify timeout type from log message Restart gateway / add index
Cert Expired TLS error in logs openssl s_client → check expiry Renew certificate
DB Slow Timeout, high response time Check pg_stat_activity for slow queries Add index

✅ Troubleshooting Scenarios — What I Know