Week 6 · Day 1 & Day 2

DB Failure Drill &
MQ Stuck Scenario

Two of the most common and most critical incidents you will face as L2. Today you learn exactly what to do when the database goes down and when the message queue gets stuck.

Day 1 — DB Outage Day 2 — MQ Stuck Outage Drill Queue Monitoring Safe Restart

Day 1 DB Failure Drill — Database Outage Troubleshooting

01 The Simple Idea

Real-life Analogy

Think of the database like the cash register at a supermarket. Every transaction goes through it. If the cash register stops working — the whole store stops. Nobody can pay. Nobody can check stock. Everything freezes.

A DB outage is exactly that. Every service that depends on the database — payment processing, transaction logging, balance updates — all stop the moment the DB goes down. Your job as L2 is to confirm it is the DB, not guess, and follow a clear protocol to get it back.

What causes a DB outage?

Connection pool exhausted — too many open connections, no new ones available. The most common cause in fintech.

DB server crash — the database process stopped. Could be hardware, memory, or a software bug.

Disk full — the database ran out of disk space to write new data. Writes fail, reads may still work.

Lock contention — a long-running query is holding locks, blocking all other queries from completing.

Network issue — the app server can't reach the DB server. DB is fine but the connection is broken.

02 How to Identify a DB Outage

🔴 Signs That Point to a DB Issue

What you see	What it means	Severity
DB_CONNECTION_TIMEOUT in logs	App tried to connect to DB but waited too long — pool may be full or DB is down	High
Cannot acquire connection from pool	All connections in pool are taken — no new transactions can proceed	High
All transactions FAILED simultaneously	System-wide failure — not isolated. DB is the common dependency.	P1
Logs stopped writing	App cannot write to DB — disk may be full or DB is completely down	P1
Response time spiking on all services	Queries are waiting — DB is slow or locked, not necessarily down	Medium
Only reads work, writes fail	Disk is full — DB can read existing data but cannot write new data	High

03 DB Outage Protocol — What to Do in Order

🚨 Step-by-Step DB Outage Response

Confirm it is the DB — not the app or network

Check the logs first. Look for DB_CONNECTION_TIMEOUT or pool exhausted. Then try to ping the DB server directly. If ping works but DB connection fails — DB process is down. If ping fails — network issue.

Declare P1 and open the bridge immediately

Do not investigate alone silently. A DB outage is always a P1. Open the bridge call, notify your lead, send the initial client alert — before you have answers.

Check disk space — it is the most common hidden cause

Run df -h on the DB server. If disk is at 100% — this is the root cause. Clean old logs or archive data before anything else. DB cannot write to a full disk.

Check if the DB process is actually running

Check if the database service (mysql, postgresql, oracle) is active. If it crashed — it needs to be restarted by the DBA. Do not restart production DB yourself without DBA approval on the bridge.

Check the connection pool if DB is running but connections failing

If the DB process is up but connections are being refused — the pool is exhausted. The DBA can clear stuck connections or increase the pool limit as a short-term fix.

Update every 15 minutes on the bridge — even with no news

Keep the bridge informed. "Still investigating with DBA team, no fix yet" is better than silence. Record every finding and action in the Jira ticket in real time.

04 Day 1 Lab — Simulated DB Down on Kali Linux

What this lab simulates

You cannot take down a real production database. But you can simulate the investigation steps — the log reading, the process checks, the disk checks — exactly as you would do them on a real server. This builds the muscle memory so when it happens for real, your hands know what to do.

🔬 Lab: Simulate and Investigate a DB Outage

Kali Linux · Simulated

Create a simulated DB outage log

This creates a log that looks exactly like what you would see during a real DB outage.

terminal

cat > ~/db-outage.log << 'EOF'
[09:14:50] [INFO ] TXN-001 received. Processing...
[09:14:52] [INFO ] TXN-001 SUCCESS
[09:15:00] [WARN ] DB connection pool at 82%
[09:15:01] [WARN ] DB connection pool at 91%
[09:15:02] [WARN ] DB connection pool at 97%
[09:15:03] [ERROR] DB_CONNECTION_TIMEOUT - pool exhausted
[09:15:03] [ERROR] TXN-002 FAILED - cannot write to database
[09:15:04] [ERROR] DB_CONNECTION_TIMEOUT - pool exhausted
[09:15:04] [ERROR] TXN-003 FAILED - cannot write to database
[09:15:05] [ERROR] DB_CONNECTION_TIMEOUT - pool exhausted
[09:15:05] [ERROR] TXN-004 FAILED - cannot write to database
[09:15:10] [ERROR] DB_CONNECTION_REFUSED - connection refused on port 5432
[09:15:10] [ERROR] Database service appears to be down
EOF

→ db-outage.log created with 13 lines including escalating warnings and errors

Step 1 — Confirm it is the DB (read the log)

Search the log for DB-specific errors to confirm the root cause.

terminal

# Count how many DB errors exist
grep -c "DB_CONNECTION" ~/db-outage.log

# Show the exact DB error lines
grep "DB_CONNECTION\|database" ~/db-outage.log

# Show what happened BEFORE the first error (context)
grep -B 3 "DB_CONNECTION_TIMEOUT" ~/db-outage.log | head -6

→ 4 DB_CONNECTION errors. WARNs at 82%→91%→97% appeared before the first ERROR. Root cause is pool exhaustion.

Step 2 — Check disk space (most common hidden cause)

Always check disk before anything else. A full disk causes DB write failures.

terminal

# Check all disks
df -h

# Check specifically the data folder
df -h /var /home /tmp

→ If any partition shows 100% — that is your root cause. Report it on the bridge immediately.

Step 3 — Check if the DB process is running (SQLite simulation)

On a real server you would check mysql or postgresql. Here we simulate the process check.

terminal — process check commands

# Check if mysql is running (real server)
systemctl status mysql 2>/dev/null || echo "mysql not installed (expected on Kali)"

# Check if postgresql is running
systemctl status postgresql 2>/dev/null || echo "postgresql not installed"

# Check all running processes to see what DB is there
ps aux | grep -E "mysql|postgres|oracle"

# Simulate — is our SQLite DB accessible?
sqlite3 ~/fintech_lab.db "SELECT COUNT(*) FROM transactions;" 2>/dev/null \
&& echo "DB is UP" || echo "DB is DOWN"

→ If you ran the Week 2 SQL lab — DB is UP. This simulates what a healthy DB check looks like. On a real server, a DOWN result means call the DBA.

Step 4 — Count affected transactions and write the Jira RCA

How many transactions failed during this outage? This goes in the incident ticket.

terminal

# Count failed transactions
grep -c "FAILED" ~/db-outage.log

# List which transactions failed
grep "FAILED" ~/db-outage.log | awk '{print $4}'

# When did the outage start?
grep -m 1 "ERROR" ~/db-outage.log | awk '{print $1, $2}'

→ 3 transactions failed (TXN-002, TXN-003, TXN-004). Outage started at 09:15:03. RCA: DB connection pool exhausted. ✅

Day 2 MQ Stuck Scenario — Queue Monitoring & Safe Restart

05 The Simple Idea

Real-life Analogy

Think of a Message Queue like a conveyor belt in a factory. Products (messages/transactions) are placed on the belt and workers (processors) pick them up one by one. If the belt stops — products pile up and nobody is working. But the products are still there, safe on the belt, not lost.

When the MQ gets stuck, transactions are not lost — they are just waiting. Your job is to find out why the belt stopped, fix it, and get it moving again. The right restart sequence matters — a wrong restart can lose the messages sitting in the queue.

What is a Message Queue and why does it get stuck?

A Message Queue (MQ) holds transactions in an ordered line until a worker (processor) picks them up. Common MQ tools: RabbitMQ, Apache Kafka, IBM MQ, ActiveMQ.

It gets stuck because: The consumer/processor crashed and stopped reading from the queue. The queue is full and cannot accept new messages. A single poisoned message (bad format) is blocking all others behind it. The MQ service itself stopped running. Network connectivity between the app and the MQ broke.

🟡 Signs That the MQ is Stuck

What you see	What it means	Action
Queue depth growing, none processed	Consumer is down — messages piling up, nothing consuming them	Check consumer
Transactions stuck in PENDING forever	Message was queued but processor never picked it up	Check queue depth
Dead letter queue filling up	Messages failed processing multiple times and were moved here	Investigate each
All new transactions hanging	Queue itself may be full or the service is down	P1 — check MQ service
One transaction stuck, others OK	Poisoned message — bad format blocking the queue at that position	Identify and remove it

06 How to Restart MQ Safely — The Right Order

Why the restart order matters

Restarting an MQ incorrectly can lose all messages sitting in the queue. A safe restart means: confirm what is in the queue, stop the consumer first, restart the MQ service, then restart the consumer. Never restart the MQ service while the consumer is actively writing to it.

✅ Safe MQ Restart — Step by Step

First — check the queue depth before touching anything

Know how many messages are sitting in the queue. This number should go down after restart. If you don't check now you won't know if messages were lost.

Stop the consumer/processor first — before touching MQ

The consumer is the worker reading from the queue. Stop it first so it is not in the middle of reading a message when you restart the MQ. An interrupted read can corrupt or lose that message.

Get bridge approval before restarting on PROD

Announce on the bridge: "I am about to restart the MQ service. Queue depth is [N] messages. Requesting approval." Wait for a verbal OK from the bridge lead before proceeding.

Restart the MQ service

Restart the MQ broker service. After restart, verify it is running and accepting connections. Check the queue depth — all messages should still be there.

Start the consumer and watch it process

Restart the consumer service. Watch the queue depth — it should start decreasing as messages are picked up. Monitor the logs for errors. Confirm transactions are completing successfully.

Confirm and close the incident

Queue depth back to zero or processing normally. All pending transactions completing. No new errors in the log. Update Jira, notify clients, close the bridge.

07 Day 2 Lab — Simulate MQ Stuck & Safe Restart

🔬 Lab: Simulate a Stuck Queue and Restart it Safely

Kali Linux · Simulated

Create a simulated MQ stuck log

This shows what a stuck queue looks like in the application logs.

terminal

cat > ~/mq-stuck.log << 'EOF'
[10:00:00] [INFO ] MQ consumer started. Connected to queue: payment-queue
[10:00:10] [INFO ] TXN-010 queued. Queue depth: 1
[10:00:11] [INFO ] TXN-010 picked up by consumer. Processing...
[10:00:12] [INFO ] TXN-010 processed successfully
[10:01:00] [INFO ] TXN-011 queued. Queue depth: 1
[10:01:01] [INFO ] TXN-012 queued. Queue depth: 2
[10:01:02] [INFO ] TXN-013 queued. Queue depth: 3
[10:01:03] [WARN ] Consumer not responding. Queue depth: 3
[10:01:30] [WARN ] Consumer not responding. Queue depth: 5
[10:02:00] [ERROR] Consumer crashed. Queue depth: 7. Messages are piling up.
[10:02:01] [ERROR] TXN-011 stuck in PENDING. Not being consumed.
[10:02:01] [ERROR] TXN-012 stuck in PENDING. Not being consumed.
[10:02:01] [ERROR] TXN-013 stuck in PENDING. Not being consumed.
EOF

→ mq-stuck.log created — shows consumer crash at 10:02 with 7 messages piling up

Investigate — what happened and when did it start?

Read the log to understand the full picture before taking any action.

terminal

# Find when warnings started
grep "WARN\|ERROR" ~/mq-stuck.log

# Count stuck transactions
grep -c "PENDING" ~/mq-stuck.log

# Find when consumer crashed
grep "Consumer crashed" ~/mq-stuck.log

→ Consumer crashed at 10:02:00. 3 transactions stuck in PENDING. Queue depth was 7 at time of crash.

Simulate — check current queue depth in SQLite

Use your SQLite lab DB to simulate checking how many transactions are stuck PENDING.

terminal

sqlite3 ~/fintech_lab.db << 'EOF'
-- Simulate queue depth check
SELECT STATUS, COUNT(*) AS QUEUE_DEPTH
FROM transactions
GROUP BY STATUS;

-- Find stuck transactions
SELECT txn_id, status FROM transactions WHERE status = 'PENDING';
EOF

→ Shows pending transactions = your simulated queue depth. This is what you check before restarting.

Simulate the safe restart sequence

Walk through the restart steps in the terminal — this is the exact order on a real server.

terminal — safe restart simulation

echo "=== SAFE MQ RESTART DRILL ==="

echo "Step 1: Note queue depth before restart"
echo " Queue depth = 7 messages (from log)"

echo "Step 2: Stop the consumer first"
echo " [On real server: systemctl stop payment-consumer]"
sleep 1

echo "Step 3: Get bridge approval..."
echo " Bridge Lead: GO AHEAD"
sleep 1

echo "Step 4: Restart MQ service"
echo " [On real server: systemctl restart rabbitmq-server]"
sleep 1

echo "Step 5: Start consumer"
echo " [On real server: systemctl start payment-consumer]"
sleep 1

echo "Step 6: Monitor queue depth - should decrease"
echo " Queue depth: 7 → 5 → 3 → 1 → 0. Processing resumed."
echo "=== RESTART COMPLETE ==="

→ Full restart drill complete. Sequence: stop consumer → get approval → restart MQ → start consumer → monitor ✅

Write the recovery log entry for Jira

Document everything that happened — this is your post-incident record.

What to write in Jira

Incident : MQ Consumer Crashed — Queue Stuck
Start Time : 10:02:00
Detected : 10:02:01 via log monitoring
Impact : 7 transactions stuck in queue (TXN-011 to TXN-017)
Root Cause : Consumer process crashed — no longer reading from queue
Action : Consumer stopped → MQ restarted with bridge approval
→ Consumer restarted → Queue drained in 4 minutes
Resolution : 10:08:00 — all 7 transactions processed successfully
Next Steps : Investigate why consumer crashed. Add consumer health check.

→ Clean incident record. Paste this directly into Jira ticket. ✅

08 Real L2 Scenarios

Monitoring alert fires: all transactions failing. You check the log — DB_CONNECTION_TIMEOUT on every line. You run df -h — disk is at 100%. Root cause found in 60 seconds. You announce on the bridge and the DBA starts archiving old data while you track the transactions that failed.

Client reports "transactions stuck since 1 hour." You check PENDING transactions in DB — 23 rows. You check the MQ log — consumer crashed at the same time. You do not restart immediately. You announce queue depth, get bridge approval, do the safe restart. All 23 process within 5 minutes.

The queue is growing but only 1 specific transaction keeps failing while others work. This is a poisoned message. You identify it, move it to a holding table, and the queue clears instantly. You investigate the poisoned message separately — it had a malformed IBAN that the processor could not handle.

Someone restarted the MQ without stopping the consumer first. 3 messages were lost. This is why the safe restart order matters. You now need to find which 3 transactions were lost, identify if they were debited, and decide whether to reprocess or refund — a much harder problem than if the restart had been done correctly.

✅ Week 6 · Day 1 & 2 Outcomes

Identify the 5 most common causes of a DB outage in a fintech system
Read log errors and confirm it is a DB issue — not an app or network issue
Follow the 6-step DB outage protocol — confirm, declare P1, check disk, check process, check pool, update bridge
Complete the DB outage lab — create a simulated outage log and investigate it step by step
Check disk space, identify FAILED transactions, and produce a Jira-ready RCA from log data
Explain what a stuck MQ means and why messages are not lost when it stops
Identify the 5 signs that a message queue is stuck
Follow the safe MQ restart order — stop consumer → get approval → restart MQ → start consumer → monitor
Explain what a poisoned message is and how it can block an entire queue
Complete the MQ lab — simulate a stuck queue, investigate it, run the safe restart drill, and write the Jira record

DB Failure Drill &MQ Stuck Scenario

What causes a DB outage?

🚨 Step-by-Step DB Outage Response

What this lab simulates

🔬 Lab: Simulate and Investigate a DB Outage

What is a Message Queue and why does it get stuck?

Why the restart order matters

✅ Safe MQ Restart — Step by Step

🔬 Lab: Simulate a Stuck Queue and Restart it Safely

✅ Week 6 · Day 1 & 2 Outcomes

DB Failure Drill &
MQ Stuck Scenario