L2 Support Engineer · Fintech · Week 6
Week 6
Day 1
Day 2
Week 6 · Day 1 & Day 2
DB Failure Drill &
MQ Stuck Scenario
Two of the most common and most critical incidents you will face as L2. Today you learn exactly what to do when the database goes down and when the message queue gets stuck.
Day 1 — DB Outage
Day 2 — MQ Stuck
Outage Drill
Queue Monitoring
Safe Restart
Day 1
DB Failure Drill — Database Outage Troubleshooting
01 The Simple Idea
Real-life Analogy
Think of the database like the cash register at a supermarket. Every transaction goes through it. If the cash register stops working — the whole store stops. Nobody can pay. Nobody can check stock. Everything freezes.
A DB outage is exactly that. Every service that depends on the database — payment processing, transaction logging, balance updates — all stop the moment the DB goes down. Your job as L2 is to confirm it is the DB, not guess, and follow a clear protocol to get it back.
What causes a DB outage?
Connection pool exhausted — too many open connections, no new ones available. The most common cause in fintech.
DB server crash — the database process stopped. Could be hardware, memory, or a software bug.
Disk full — the database ran out of disk space to write new data. Writes fail, reads may still work.
Lock contention — a long-running query is holding locks, blocking all other queries from completing.
Network issue — the app server can't reach the DB server. DB is fine but the connection is broken.
02 How to Identify a DB Outage
🔴 Signs That Point to a DB Issue
| What you see | What it means | Severity |
| DB_CONNECTION_TIMEOUT in logs | App tried to connect to DB but waited too long — pool may be full or DB is down | High |
| Cannot acquire connection from pool | All connections in pool are taken — no new transactions can proceed | High |
| All transactions FAILED simultaneously | System-wide failure — not isolated. DB is the common dependency. | P1 |
| Logs stopped writing | App cannot write to DB — disk may be full or DB is completely down | P1 |
| Response time spiking on all services | Queries are waiting — DB is slow or locked, not necessarily down | Medium |
| Only reads work, writes fail | Disk is full — DB can read existing data but cannot write new data | High |
03 DB Outage Protocol — What to Do in Order
01
Confirm it is the DB — not the app or network
Check the logs first. Look for DB_CONNECTION_TIMEOUT or pool exhausted. Then try to ping the DB server directly. If ping works but DB connection fails — DB process is down. If ping fails — network issue.
02
Declare P1 and open the bridge immediately
Do not investigate alone silently. A DB outage is always a P1. Open the bridge call, notify your lead, send the initial client alert — before you have answers.
03
Check disk space — it is the most common hidden cause
Run df -h on the DB server. If disk is at 100% — this is the root cause. Clean old logs or archive data before anything else. DB cannot write to a full disk.
04
Check if the DB process is actually running
Check if the database service (mysql, postgresql, oracle) is active. If it crashed — it needs to be restarted by the DBA. Do not restart production DB yourself without DBA approval on the bridge.
05
Check the connection pool if DB is running but connections failing
If the DB process is up but connections are being refused — the pool is exhausted. The DBA can clear stuck connections or increase the pool limit as a short-term fix.
06
Update every 15 minutes on the bridge — even with no news
Keep the bridge informed. "Still investigating with DBA team, no fix yet" is better than silence. Record every finding and action in the Jira ticket in real time.
04 Day 1 Lab — Simulated DB Down on Kali Linux
What this lab simulates
You cannot take down a real production database. But you can simulate the investigation steps — the log reading, the process checks, the disk checks — exactly as you would do them on a real server. This builds the muscle memory so when it happens for real, your hands know what to do.
🔬 Lab: Simulate and Investigate a DB Outage
Kali Linux · Simulated
Create a simulated DB outage log
This creates a log that looks exactly like what you would see during a real DB outage.
terminal
cat > ~/db-outage.log << 'EOF'
[09:14:50] [INFO ] TXN-001 received. Processing...
[09:14:52] [INFO ] TXN-001 SUCCESS
[09:15:00] [WARN ] DB connection pool at 82%
[09:15:01] [WARN ] DB connection pool at 91%
[09:15:02] [WARN ] DB connection pool at 97%
[09:15:03] [ERROR] DB_CONNECTION_TIMEOUT - pool exhausted
[09:15:03] [ERROR] TXN-002 FAILED - cannot write to database
[09:15:04] [ERROR] DB_CONNECTION_TIMEOUT - pool exhausted
[09:15:04] [ERROR] TXN-003 FAILED - cannot write to database
[09:15:05] [ERROR] DB_CONNECTION_TIMEOUT - pool exhausted
[09:15:05] [ERROR] TXN-004 FAILED - cannot write to database
[09:15:10] [ERROR] DB_CONNECTION_REFUSED - connection refused on port 5432
[09:15:10] [ERROR] Database service appears to be down
EOF
→ db-outage.log created with 13 lines including escalating warnings and errors
Step 1 — Confirm it is the DB (read the log)
Search the log for DB-specific errors to confirm the root cause.
terminal
# Count how many DB errors exist
grep -c "DB_CONNECTION" ~/db-outage.log
# Show the exact DB error lines
grep "DB_CONNECTION\|database" ~/db-outage.log
# Show what happened BEFORE the first error (context)
grep -B 3 "DB_CONNECTION_TIMEOUT" ~/db-outage.log | head -6
→ 4 DB_CONNECTION errors. WARNs at 82%→91%→97% appeared before the first ERROR. Root cause is pool exhaustion.
Step 2 — Check disk space (most common hidden cause)
Always check disk before anything else. A full disk causes DB write failures.
terminal
# Check all disks
df -h
# Check specifically the data folder
df -h /var /home /tmp
→ If any partition shows 100% — that is your root cause. Report it on the bridge immediately.
Step 3 — Check if the DB process is running (SQLite simulation)
On a real server you would check mysql or postgresql. Here we simulate the process check.
terminal — process check commands
# Check if mysql is running (real server)
systemctl status mysql 2>/dev/null || echo "mysql not installed (expected on Kali)"
# Check if postgresql is running
systemctl status postgresql 2>/dev/null || echo "postgresql not installed"
# Check all running processes to see what DB is there
ps aux | grep -E "mysql|postgres|oracle"
# Simulate — is our SQLite DB accessible?
sqlite3 ~/fintech_lab.db "SELECT COUNT(*) FROM transactions;" 2>/dev/null \
&& echo "DB is UP" || echo "DB is DOWN"
→ If you ran the Week 2 SQL lab — DB is UP. This simulates what a healthy DB check looks like. On a real server, a DOWN result means call the DBA.
Step 4 — Count affected transactions and write the Jira RCA
How many transactions failed during this outage? This goes in the incident ticket.
terminal
# Count failed transactions
grep -c "FAILED" ~/db-outage.log
# List which transactions failed
grep "FAILED" ~/db-outage.log | awk '{print $4}'
# When did the outage start?
grep -m 1 "ERROR" ~/db-outage.log | awk '{print $1, $2}'
→ 3 transactions failed (TXN-002, TXN-003, TXN-004). Outage started at 09:15:03. RCA: DB connection pool exhausted. ✅
Day 2
MQ Stuck Scenario — Queue Monitoring & Safe Restart
05 The Simple Idea
Real-life Analogy
Think of a Message Queue like a conveyor belt in a factory. Products (messages/transactions) are placed on the belt and workers (processors) pick them up one by one. If the belt stops — products pile up and nobody is working. But the products are still there, safe on the belt, not lost.
When the MQ gets stuck, transactions are not lost — they are just waiting. Your job is to find out why the belt stopped, fix it, and get it moving again. The right restart sequence matters — a wrong restart can lose the messages sitting in the queue.
What is a Message Queue and why does it get stuck?
A Message Queue (MQ) holds transactions in an ordered line until a worker (processor) picks them up. Common MQ tools: RabbitMQ, Apache Kafka, IBM MQ, ActiveMQ.
It gets stuck because: The consumer/processor crashed and stopped reading from the queue. The queue is full and cannot accept new messages. A single poisoned message (bad format) is blocking all others behind it. The MQ service itself stopped running. Network connectivity between the app and the MQ broke.
🟡 Signs That the MQ is Stuck
| What you see | What it means | Action |
| Queue depth growing, none processed |
Consumer is down — messages piling up, nothing consuming them |
Check consumer |
| Transactions stuck in PENDING forever |
Message was queued but processor never picked it up |
Check queue depth |
| Dead letter queue filling up |
Messages failed processing multiple times and were moved here |
Investigate each |
| All new transactions hanging |
Queue itself may be full or the service is down |
P1 — check MQ service |
| One transaction stuck, others OK |
Poisoned message — bad format blocking the queue at that position |
Identify and remove it |
06 How to Restart MQ Safely — The Right Order
Why the restart order matters
Restarting an MQ incorrectly can lose all messages sitting in the queue. A safe restart means: confirm what is in the queue, stop the consumer first, restart the MQ service, then restart the consumer. Never restart the MQ service while the consumer is actively writing to it.
01
First — check the queue depth before touching anything
Know how many messages are sitting in the queue. This number should go down after restart. If you don't check now you won't know if messages were lost.
02
Stop the consumer/processor first — before touching MQ
The consumer is the worker reading from the queue. Stop it first so it is not in the middle of reading a message when you restart the MQ. An interrupted read can corrupt or lose that message.
03
Get bridge approval before restarting on PROD
Announce on the bridge: "I am about to restart the MQ service. Queue depth is [N] messages. Requesting approval." Wait for a verbal OK from the bridge lead before proceeding.
04
Restart the MQ service
Restart the MQ broker service. After restart, verify it is running and accepting connections. Check the queue depth — all messages should still be there.
05
Start the consumer and watch it process
Restart the consumer service. Watch the queue depth — it should start decreasing as messages are picked up. Monitor the logs for errors. Confirm transactions are completing successfully.
06
Confirm and close the incident
Queue depth back to zero or processing normally. All pending transactions completing. No new errors in the log. Update Jira, notify clients, close the bridge.
07 Day 2 Lab — Simulate MQ Stuck & Safe Restart
🔬 Lab: Simulate a Stuck Queue and Restart it Safely
Kali Linux · Simulated
Create a simulated MQ stuck log
This shows what a stuck queue looks like in the application logs.
terminal
cat > ~/mq-stuck.log << 'EOF'
[10:00:00] [INFO ] MQ consumer started. Connected to queue: payment-queue
[10:00:10] [INFO ] TXN-010 queued. Queue depth: 1
[10:00:11] [INFO ] TXN-010 picked up by consumer. Processing...
[10:00:12] [INFO ] TXN-010 processed successfully
[10:01:00] [INFO ] TXN-011 queued. Queue depth: 1
[10:01:01] [INFO ] TXN-012 queued. Queue depth: 2
[10:01:02] [INFO ] TXN-013 queued. Queue depth: 3
[10:01:03] [WARN ] Consumer not responding. Queue depth: 3
[10:01:30] [WARN ] Consumer not responding. Queue depth: 5
[10:02:00] [ERROR] Consumer crashed. Queue depth: 7. Messages are piling up.
[10:02:01] [ERROR] TXN-011 stuck in PENDING. Not being consumed.
[10:02:01] [ERROR] TXN-012 stuck in PENDING. Not being consumed.
[10:02:01] [ERROR] TXN-013 stuck in PENDING. Not being consumed.
EOF
→ mq-stuck.log created — shows consumer crash at 10:02 with 7 messages piling up
Investigate — what happened and when did it start?
Read the log to understand the full picture before taking any action.
terminal
# Find when warnings started
grep "WARN\|ERROR" ~/mq-stuck.log
# Count stuck transactions
grep -c "PENDING" ~/mq-stuck.log
# Find when consumer crashed
grep "Consumer crashed" ~/mq-stuck.log
→ Consumer crashed at 10:02:00. 3 transactions stuck in PENDING. Queue depth was 7 at time of crash.
Simulate — check current queue depth in SQLite
Use your SQLite lab DB to simulate checking how many transactions are stuck PENDING.
terminal
sqlite3 ~/fintech_lab.db << 'EOF'
-- Simulate queue depth check
SELECT STATUS, COUNT(*) AS QUEUE_DEPTH
FROM transactions
GROUP BY STATUS;
-- Find stuck transactions
SELECT txn_id, status FROM transactions WHERE status = 'PENDING';
EOF
→ Shows pending transactions = your simulated queue depth. This is what you check before restarting.
Simulate the safe restart sequence
Walk through the restart steps in the terminal — this is the exact order on a real server.
terminal — safe restart simulation
echo "=== SAFE MQ RESTART DRILL ==="
echo "Step 1: Note queue depth before restart"
echo " Queue depth = 7 messages (from log)"
echo "Step 2: Stop the consumer first"
echo " [On real server: systemctl stop payment-consumer]"
sleep 1
echo "Step 3: Get bridge approval..."
echo " Bridge Lead: GO AHEAD"
sleep 1
echo "Step 4: Restart MQ service"
echo " [On real server: systemctl restart rabbitmq-server]"
sleep 1
echo "Step 5: Start consumer"
echo " [On real server: systemctl start payment-consumer]"
sleep 1
echo "Step 6: Monitor queue depth - should decrease"
echo " Queue depth: 7 → 5 → 3 → 1 → 0. Processing resumed."
echo "=== RESTART COMPLETE ==="
→ Full restart drill complete. Sequence: stop consumer → get approval → restart MQ → start consumer → monitor ✅
Write the recovery log entry for Jira
Document everything that happened — this is your post-incident record.
What to write in Jira
Incident : MQ Consumer Crashed — Queue Stuck
Start Time : 10:02:00
Detected : 10:02:01 via log monitoring
Impact : 7 transactions stuck in queue (TXN-011 to TXN-017)
Root Cause : Consumer process crashed — no longer reading from queue
Action : Consumer stopped → MQ restarted with bridge approval
→ Consumer restarted → Queue drained in 4 minutes
Resolution : 10:08:00 — all 7 transactions processed successfully
Next Steps : Investigate why consumer crashed. Add consumer health check.
→ Clean incident record. Paste this directly into Jira ticket. ✅
08 Real L2 Scenarios
01
Monitoring alert fires: all transactions failing. You check the log — DB_CONNECTION_TIMEOUT on every line. You run df -h — disk is at 100%. Root cause found in 60 seconds. You announce on the bridge and the DBA starts archiving old data while you track the transactions that failed.
02
Client reports "transactions stuck since 1 hour." You check PENDING transactions in DB — 23 rows. You check the MQ log — consumer crashed at the same time. You do not restart immediately. You announce queue depth, get bridge approval, do the safe restart. All 23 process within 5 minutes.
03
The queue is growing but only 1 specific transaction keeps failing while others work. This is a poisoned message. You identify it, move it to a holding table, and the queue clears instantly. You investigate the poisoned message separately — it had a malformed IBAN that the processor could not handle.
04
Someone restarted the MQ without stopping the consumer first. 3 messages were lost. This is why the safe restart order matters. You now need to find which 3 transactions were lost, identify if they were debited, and decide whether to reprocess or refund — a much harder problem than if the restart had been done correctly.
✅ Week 6 · Day 1 & 2 Outcomes
- Identify the 5 most common causes of a DB outage in a fintech system
- Read log errors and confirm it is a DB issue — not an app or network issue
- Follow the 6-step DB outage protocol — confirm, declare P1, check disk, check process, check pool, update bridge
- Complete the DB outage lab — create a simulated outage log and investigate it step by step
- Check disk space, identify FAILED transactions, and produce a Jira-ready RCA from log data
- Explain what a stuck MQ means and why messages are not lost when it stops
- Identify the 5 signs that a message queue is stuck
- Follow the safe MQ restart order — stop consumer → get approval → restart MQ → start consumer → monitor
- Explain what a poisoned message is and how it can block an entire queue
- Complete the MQ lab — simulate a stuck queue, investigate it, run the safe restart drill, and write the Jira record