Extra Learning · Reference Guide

Troubleshooting Scenarios

Eight real-world problems you will face as an L2 engineer — each one mapped to its symptoms, the action you take, and the exact fix you apply. Memorise this table and you can handle the most common incidents without looking anything up.

API Down Disk Full High CPU Queue Backlog Missing TXN Timeout Cert Expired DB Slow

01 How to Use This Reference

Purpose

Every scenario follows the same three-column logic used in real incident handling:

Symptoms — what you or the client observes. This is the entry point.
Action — what you check or do first to confirm the root cause.
Fix — the resolution that closes the incident.

Start with the symptom. Match it to a scenario. Follow the action and fix. That is L2 troubleshooting in its cleanest form.

02 The 8 Troubleshooting Scenarios

API Down

Service Failure

Symptoms

503 errors returned to client
All requests to the endpoint fail
Health check endpoint not responding
Error rate metric spikes to 100%

Action

Call the /health endpoint first
Check application logs for startup errors
Check if the service process is running
Check disk and memory — may have caused crash

Fix

Restart the service. If it crashes again immediately — check logs for the crash reason before restarting again. Escalate to L3 if restart does not hold.

Disk Full

Resource Failure

Symptoms

Writes failing across all services
DB transactions returning write errors
Log files stopped growing
Applications crashing on file writes

Action

Run df -h to confirm disk at 100%
Run du -sh /var/log/* | sort -rh | head
Identify the biggest folder consuming space
Check if old logs are the culprit (most common)

Fix

Clean logs. Archive or delete old log files to free space. Monitor disk after cleanup. If it fills quickly again — a rogue log is being written continuously. Find and fix the source.

High CPU

Resource Spike

Symptoms

Slow application response times
Requests timing out across all endpoints
CPU metric above 90% on dashboard
Load average growing over time

Action

Run top — press P to sort by CPU
Run ps aux --sort=-%cpu | head -6
Identify the process using the most CPU
Check how long it has been running

Fix

Kill the process — kill -15 PID for graceful stop. Get bridge approval before killing on PROD. Confirm CPU recovers after kill. Never use kill -9 on a DB process.

Queue Backlog

MQ Stuck

Symptoms

Transactions stuck in PENDING state
Delay reported by clients
MQ queue depth growing, not decreasing
Consumer not picking up messages

Action

Check MQ queue depth — is it growing?
Check consumer service — is it running?
Check consumer logs for crash errors
Note the queue depth before touching anything

Fix

Restart consumer in safe order: stop consumer → get bridge approval → restart MQ if needed → restart consumer → monitor queue depth decreasing. Messages are not lost — they wait safely in the queue.

Missing Transaction

Investigation

Symptoms

Customer complaint — payment not received
Transaction ID shows no result in client app
Money debited but destination not credited
Status unknown — client has no update

Action

Search DB for the TXN reference ID
Check TRANSACTIONS_LOG for current status
Check MX_MESSAGE for SBP/external response
Grep logs for TXN ID — full history

Fix

Trace status. If DB shows SUCCESS but client not notified — trigger notification. If PENDING and SBP confirmed — manually update (L3 approval). If FAILED — check reconciliation and refund eligibility.

Timeout

Latency Issue

Symptoms

Slow API responses — taking seconds not ms
TIMEOUT errors in logs
Transactions failing after 30 seconds
Client reports "request timed out"

Action

Identify timeout type from the log message
QUERY_TIMEOUT → DB query too slow
SOCKET_TIMEOUT → external API not responding
Check network if SOCKET_TIMEOUT

Fix

Restart gateway if external timeout. If DB query timeout — check for slow queries and add index (do not restart DB). If network — check connectivity to the external endpoint.

Certificate Expired

SSL / TLS

Symptoms

TLS error in logs or client response
SSL handshake failure messages
All HTTPS connections to endpoint failing
Browser shows "certificate not trusted"

Action

Check the certificate expiry date
Run: openssl s_client -connect host:443
Confirm it is the cert and not the config
Identify who issued and manages the cert

Fix

Renew the certificate. Contact the certificate authority or use the internal cert management tool. After renewing — restart the web server or load balancer to apply the new cert. Verify with openssl after.

DB Slow

Database

Symptoms

Transaction response times very high
QUERY_TIMEOUT errors in logs
All services depending on DB are slow
DB CPU spiking on the server

Action

Check pg_stat_activity for long-running queries
Identify which query is taking the most time
Check if a lock is being held on a table
Do NOT restart DB — diagnose first

Fix

Optimise query — add index. Never restart the DB for a slow query — that wastes time and risks data integrity. Ask DBA to add the missing index. Response time drops immediately after the index is created.

03 Quick Reference — All 8 Scenarios at a Glance

💡 Keep this table memorised. When a ticket arrives and a client describes a symptom — you should immediately know which scenario it maps to and what to do first.

🔖 Troubleshooting Cheat Sheet

Problem	Symptom	First Action	Fix
API Down	503 errors from endpoint	Check service logs + health endpoint	Restart service
Disk Full	Writes fail everywhere	`df -h` → find biggest folder	Clean logs
High CPU	Slow app, all endpoints	`top` → identify heavy process	kill -15 PID
Queue Backlog	Transactions pending, delay	Check queue depth + consumer status	Restart consumer
Missing TXN	Customer complaint, no update	DB search by TXN ID + log grep	Trace status
Timeout	Slow API, 30s failures	Identify timeout type from log message	Restart gateway / add index
Cert Expired	TLS error in logs	`openssl s_client` → check expiry	Renew certificate
DB Slow	Timeout, high response time	Check pg_stat_activity for slow queries	Add index

✅ Troubleshooting Scenarios — What I Know

API Down → 503 errors → check service logs and health endpoint → restart service if process has stopped
Disk Full → writes failing → df -h to confirm, du to find biggest folder → clean old logs to free space
High CPU → slow app across all endpoints → top / ps aux to identify the heavy process → kill -15 PID with bridge approval
Queue Backlog → transactions pending, delay → check queue depth and consumer status → restart consumer in safe order
Missing Transaction → customer complaint, unknown status → search DB and grep logs by TXN ID → trace status and update or escalate
Timeout → slow API, 30-second failures → identify timeout type from the log error message → restart gateway for external or add index for DB
Certificate Expired → TLS error → openssl s_client to check expiry → renew cert and restart web server to apply
DB Slow → timeout and high response time → check pg_stat_activity for long-running queries → ask DBA to add missing index — never restart DB for a slow query