L2 Support Engineer · Fintech · Week 2
Week 2
Day 5
Today's Topic
Severity 1 Handling
An S1 is the most critical incident your team will face. Everything is down, real money is affected, and everyone is watching. Today you learn exactly how to respond — calmly, correctly, and fast.
S1 Protocol
Outage Bridge
Incident Management
Communication
🚨
What is a Severity 1?
A Severity 1 (S1) means a complete system outage — core services are down, real customers are affected, real transactions are failing right now. Every minute of downtime = financial loss and SLA breach. This is the highest alert level.
01 The Simple Idea First
Real-life Analogy
Think of an S1 like a fire alarm going off in a hospital.
Everyone has a specific role. Nobody panics and runs randomly. The fire team follows a drill — who evacuates patients, who calls the fire department, who checks the exits, who gives updates to management.
An S1 outage is exactly like that. There is a defined protocol. Everyone knows their role. You join a bridge call, you investigate your area, you give updates on a schedule, and you don't go quiet. The worst thing you can do in an S1 is go silent.
What is an Outage Bridge?
An Outage Bridge is an emergency group call (Zoom, Teams, or phone) that gets opened the moment an S1 is declared. Everyone relevant joins — L2 engineers, L3 developers, the manager, the client's technical team.
The bridge stays open until the issue is resolved. It is the command center of the incident. All findings, decisions, and updates happen on the bridge. Nothing is discussed outside of it during an active S1.
02 The S1 Response Timeline — Minute by Minute
Minute 0 — Alert Fires
S1 Alert Detected
Monitoring tool fires a CRITICAL alert. Payment service is down. Transactions failing across all clients.
→ Do NOT wait. Do NOT finish what you were doing.
→ Open Jira immediately. Create a P1 ticket.
→ Note the exact time the alert fired.
Within 5 Minutes
First Response — Acknowledge & Notify
You must acknowledge the incident and notify your lead within 5 minutes. Do not investigate silently.
→ Message your lead: "S1 detected. Payment service down since [time]. Investigating now."
→ Send initial client notification — do not wait for root cause first.
→ Open the Outage Bridge call.
⚠️ Rule: Always notify BEFORE you have answers. A client waiting in silence is worse than a client told "we are aware and investigating."
Within 10 Minutes
Initial Investigation — Find the Blast Radius
Understand the scope. How much is affected? Which clients? Which services? Which environments?
→ Check monitoring dashboard — which services are red?
→ Run: df -h, free -h, top — check server health
→ Run: grep "ERROR" /logs/payment-service.log | tail -50
→ Check SENDER_BATCH — how many batches are stuck?
→ Report findings on bridge: "Payment API down. DB connection errors in logs. 3 clients affected."
Every 15 Minutes
Status Updates — Keep Everyone Informed
Every 15 minutes you must give an update — even if you have no new findings. Silence on the bridge is not acceptable.
→ "Update at [time]: Still investigating DB connection issue. DBA team checking connection pool. No ETA yet."
→ Update the Jira ticket with latest findings.
→ Update the client communication channel.
Within 30 Minutes
Root Cause Identified — Escalate if Needed
You should have a working theory by now. Either you can fix it at L2 level or you escalate to L3.
→ If fix is available: implement it, monitor, confirm services recovering.
→ If needs L3: "Escalating to dev team. Root cause appears to be connection pool leak in payment service. L3 team joining bridge now."
→ Never attempt a fix on PROD without bridge approval.
💡 Key rule: On PROD during S1, nothing gets changed without the bridge lead approving it. You say what you want to do, get a verbal OK, then do it.
Resolution
Service Restored — Post-Incident Actions
Services are back. But the job isn't done yet. Post-incident steps are mandatory.
→ Confirm all services are green on monitoring dashboard.
→ Send resolution notification to all affected clients.
→ Update Jira ticket — change status to Resolved.
→ Document full timeline in ticket: what happened, when, what was done.
→ Schedule Post-Incident Review (PIR) meeting within 24 hours.
03 Outage Bridge Protocol — The Rules
01
Join immediately, no excuses
When an S1 bridge is opened, you join within 2 minutes. Every second counts. No waiting to "finish something first."
02
Only speak when you have something relevant
No side conversations. No background noise. Speak when you have a finding, a question, or an update. Keep it short and factual.
03
Never go silent
If you're investigating, say so. Even "still looking, no findings yet" is better than silence. People need to know you're working.
04
Never make PROD changes alone
Announce what you intend to do before doing it. Get approval on the bridge. Then do it. Then confirm what you did.
05
Document everything in real time
While investigating, keep updating the Jira ticket. Every command you ran, every finding, every action taken — write it down as it happens.
06
Separate findings from theories
Say "logs show DB_CONNECTION_TIMEOUT" (fact) not "I think the database crashed" (theory). On an S1 bridge, precision matters.
04 Who Does What During an S1
| Role | Responsibility During S1 | Level |
| Bridge Lead / Incident Manager |
Runs the bridge call. Assigns tasks, collects updates, makes decisions on PROD changes, communicates to management. |
Manager |
| L2 Support Engineer (You) |
First responder. Checks logs, monitors dashboards, checks DB tables, reports findings on bridge, creates and updates Jira ticket, sends client notifications. |
L2 — You |
| L3 / Developer |
Joins when L2 confirms a code-level or config issue. Makes approved changes to services, restarts processes, deploys hotfixes. |
L3 Dev |
| DBA (Database Admin) |
Joins when DB issues are identified. Checks connection pools, slow queries, locks, tablespace. Makes DB-level fixes. |
DBA |
| Client Technical Team |
Joins to confirm impact on their end. Checks their own systems. Confirms when service is restored from their side. |
Client |
05 What to Say — Communication Templates
🔴 Initial Alert — Send Within 5 Minutes
INCIDENT ALERT — S1
Dear Team / Client,
We are currently experiencing an issue with [service name] affecting [client name / all clients].
Issue detected at: [time]
Impact: [transactions failing / service unavailable]
Status: Under investigation. Bridge call is open.
We will provide an update within 15 minutes.
🟠 Progress Update — Every 15 Minutes
INCIDENT UPDATE — [Time]
Current Status: Still investigating
Finding so far: [e.g. DB connection pool exhausted. DBA team actively working on fix.]
ETA: [if known / "To be confirmed"]
Next update in 15 minutes.
🟢 Resolution — Send When Service Restored
INCIDENT RESOLVED — [Time]
Service: [service name] has been restored.
Resolved at: [time]
Total downtime: [X minutes]
Root cause: [brief description]
Action taken: [what was done to fix it]
A full Post-Incident Review will follow within 24 hours.
Please confirm services are functioning on your end.
06 Do's and Don'ts During an S1
✅ Do This
- Join the bridge call immediately
- Send the initial notification before you have answers
- Update every 15 minutes — even with no new findings
- Document every action in Jira as you do it
- State facts — "logs show X" not "I think X"
- Ask for approval before any PROD change
- Keep calm — panicking slows everything down
- Confirm service restored from monitoring before closing
❌ Never Do This
- Go silent on the bridge — never acceptable
- Wait until you have answers to notify the client
- Make changes on PROD without bridge approval
- Guess or speculate without evidence
- Close the ticket before client confirms recovery
- Work alone without reporting on bridge
- Blame others on the client call
- Skip the Post-Incident Review
07 Simulated Outage Drill — Hands-on Lab
How to Perform This Drill
You cannot simulate a real production outage on your own machine. However you can fully simulate the investigation and response steps using your Kali Linux setup, dummy log files, and the SQLite database you built in Day 3. The drill below walks through exactly what you would do — step by step — as if it were real.
The goal is to build muscle memory — so when a real S1 happens, your hands know what to do without thinking.
T+0 — 09:15 AM
Alert received — Create the S1 Jira ticket immediately
Don't investigate first. Create the ticket first so there's a timestamp and a tracking number.
What to write in JiraTitle: "S1 — Payment Service Down — All Transactions Failing"
Priority: P1 | Status: Open | Assigned: You
Description: "Monitoring alert at 09:15 AM. Payment API returning errors. Investigation started."
T+2 — 09:17 AM
Notify your lead — open the bridge
Message before you have answers. This is a rule.
Message to send to lead"S1 detected at 09:15 AM. Payment service down. Transactions failing. Jira ticket created: FIN-1042. Joining bridge now."
T+5 — 09:20 AM
Check server health — open Kali terminal
Run these 3 commands first before touching any logs.
# Check disk space
df -h
# Check RAM
free -h
# Check what process is eating resources
top
✅ Simulate result: Disk OK (60%). RAM OK (50%). But in top — Java process at 98% CPU. That's suspicious.
T+8 — 09:23 AM
Check the payment service log for errors
Create and investigate a dummy log to simulate this on your Kali machine.
# Create a simulated outage log
cat > outage-drill.log << 'EOF'
[09:14:50] [INFO ] Payment service started normally
[09:14:55] [INFO ] TXN-001 received and queued
[09:15:00] [WARN ] DB connection pool at 88%
[09:15:01] [WARN ] DB connection pool at 95%
[09:15:02] [ERROR] DB_CONNECTION_TIMEOUT - pool exhausted
[09:15:02] [ERROR] TXN-001 FAILED - cannot write to DB
[09:15:03] [ERROR] DB_CONNECTION_TIMEOUT - pool exhausted
[09:15:03] [ERROR] TXN-002 FAILED - cannot write to DB
[09:15:04] [ERROR] DB_CONNECTION_TIMEOUT - pool exhausted
EOF
# Now investigate it
grep "ERROR" outage-drill.log
grep -c "ERROR" outage-drill.log
grep "WARN\|ERROR" outage-drill.log
✅ Simulate finding: 5 ERROR lines. DB_CONNECTION_TIMEOUT is the pattern. WARNs appeared at 88% and 95% before crash.
T+10 — 09:25 AM
Check the database — how many transactions are stuck?
Open your SQLite DB from Day 3 and run the stuck transaction queries.
sqlite3 fintech_lab.db
-- How many transactions are stuck PENDING?
SELECT COUNT(*) FROM transactions WHERE status = 'PENDING';
-- Which clients are affected?
SELECT c.client_name, COUNT(*) as stuck
FROM transactions t
JOIN clients c ON t.client_id = c.client_id
WHERE t.status = 'PENDING'
GROUP BY c.client_name;
✅ Bridge update to give: "Finding at 09:25: DB connection pool exhausted since 09:15. 2 transactions stuck in PENDING. Clients affected: Alpha Bank, Beta Wallet. Root cause: DB pool leak. Requesting DBA to join bridge."
T+30 — 09:45 AM
Service restored — send resolution notification
DBA fixed the connection pool. Services are back. Monitoring shows green.
Resolution message to send"INCIDENT RESOLVED — 09:45 AM. Payment service restored. Root cause: DB connection pool exhausted due to unreleased connections. DBA cleared the pool and increased limit. Total downtime: 30 minutes. Post-Incident Review scheduled for tomorrow. Please confirm transactions processing on your end."
✅ Final step: Update Jira ticket status to RESOLVED. Document full timeline. Close the bridge call.
08 Real L2 Scenarios
01
S1 fires at 2 AM. You are on-call. Do not wait until morning. The protocol does not change based on time of day. Join the bridge, notify, investigate. An S1 at 2 AM is treated the same as one at 2 PM.
02
You are on the bridge and your lead asks "what do you see in the logs?" You say: "Logs show DB_CONNECTION_TIMEOUT errors starting at 09:15. WARN messages appeared at 09:14 showing pool at 88% and 95% before the crash. 5 error lines total." Facts only. No guessing.
03
You found the root cause and know the fix — restart the payment service. Do not just do it. Say on the bridge: "I believe restarting the payment service will clear the connection pool. Requesting approval to proceed." Wait for a "go ahead" before you touch anything on PROD.
04
It has been 20 minutes and you have found nothing. That is also useful information. Say on the bridge: "Update at [time]: No application-level errors found. Logs look clean from my side. Possible infra or network issue. Requesting infra team to check." Never stay silent just because you have nothing to report.
✅ Week 2 · Day 5 Outcomes — Can You Do This?
- Define what an S1 is and what makes it different from a P2 or P3 ticket
- Explain what an Outage Bridge is and how to behave on one
- Follow the S1 response timeline — from T+0 alert to resolution — without missing a step
- Send the correct initial, progress, and resolution notifications to clients
- Know your role as L2 on a bridge — what you own, what you escalate, what needs approval
- Complete the simulated outage drill on Kali Linux — investigate logs, query the DB, report findings
- Document a full incident timeline in a Jira ticket during a live S1
- Respond to an S1 calmly, correctly, and within the required time targets