Today's Topic

Severity 1 Handling

An S1 is the most critical incident your team will face. Everything is down, real money is affected, and everyone is watching. Today you learn exactly how to respond — calmly, correctly, and fast.

S1 Protocol Outage Bridge Incident Management Communication

01 The Simple Idea First

Real-life Analogy

Think of an S1 like a fire alarm going off in a hospital.

Everyone has a specific role. Nobody panics and runs randomly. The fire team follows a drill — who evacuates patients, who calls the fire department, who checks the exits, who gives updates to management.

An S1 outage is exactly like that. There is a defined protocol. Everyone knows their role. You join a bridge call, you investigate your area, you give updates on a schedule, and you don't go quiet. The worst thing you can do in an S1 is go silent.

What is an Outage Bridge?

An Outage Bridge is an emergency group call (Zoom, Teams, or phone) that gets opened the moment an S1 is declared. Everyone relevant joins — L2 engineers, L3 developers, the manager, the client's technical team.

The bridge stays open until the issue is resolved. It is the command center of the incident. All findings, decisions, and updates happen on the bridge. Nothing is discussed outside of it during an active S1.

02 The S1 Response Timeline — Minute by Minute

⏱️ From Alert to Resolution — Your Exact Playbook

This is what happens from the moment the S1 alert fires. Every step has a time target.

T+0

Minute 0 — Alert Fires

S1 Alert Detected

Monitoring tool fires a CRITICAL alert. Payment service is down. Transactions failing across all clients.

→ Do NOT wait. Do NOT finish what you were doing.
→ Open Jira immediately. Create a P1 ticket.
→ Note the exact time the alert fired.

T+5

Within 5 Minutes

First Response — Acknowledge & Notify

You must acknowledge the incident and notify your lead within 5 minutes. Do not investigate silently.

→ Message your lead: "S1 detected. Payment service down since [time]. Investigating now."
→ Send initial client notification — do not wait for root cause first.
→ Open the Outage Bridge call.

⚠️ Rule: Always notify BEFORE you have answers. A client waiting in silence is worse than a client told "we are aware and investigating."

T+10

Within 10 Minutes

Initial Investigation — Find the Blast Radius

Understand the scope. How much is affected? Which clients? Which services? Which environments?

→ Check monitoring dashboard — which services are red?
→ Run: df -h, free -h, top — check server health
→ Run: grep "ERROR" /logs/payment-service.log | tail -50
→ Check SENDER_BATCH — how many batches are stuck?
→ Report findings on bridge: "Payment API down. DB connection errors in logs. 3 clients affected."

T+15

Every 15 Minutes

Status Updates — Keep Everyone Informed

Every 15 minutes you must give an update — even if you have no new findings. Silence on the bridge is not acceptable.

→ "Update at [time]: Still investigating DB connection issue. DBA team checking connection pool. No ETA yet."
→ Update the Jira ticket with latest findings.
→ Update the client communication channel.

T+30

Within 30 Minutes

Root Cause Identified — Escalate if Needed

You should have a working theory by now. Either you can fix it at L2 level or you escalate to L3.

→ If fix is available: implement it, monitor, confirm services recovering.
→ If needs L3: "Escalating to dev team. Root cause appears to be connection pool leak in payment service. L3 team joining bridge now."
→ Never attempt a fix on PROD without bridge approval.

💡 Key rule: On PROD during S1, nothing gets changed without the bridge lead approving it. You say what you want to do, get a verbal OK, then do it.

RES

Resolution

Service Restored — Post-Incident Actions

Services are back. But the job isn't done yet. Post-incident steps are mandatory.

→ Confirm all services are green on monitoring dashboard.
→ Send resolution notification to all affected clients.
→ Update Jira ticket — change status to Resolved.
→ Document full timeline in ticket: what happened, when, what was done.
→ Schedule Post-Incident Review (PIR) meeting within 24 hours.

03 Outage Bridge Protocol — The Rules

🎙️ Outage Bridge — How to Behave

These rules exist so the bridge stays focused and productive. An S1 is not the time for confusion.

Join immediately, no excuses

When an S1 bridge is opened, you join within 2 minutes. Every second counts. No waiting to "finish something first."

Only speak when you have something relevant

No side conversations. No background noise. Speak when you have a finding, a question, or an update. Keep it short and factual.

Never go silent

If you're investigating, say so. Even "still looking, no findings yet" is better than silence. People need to know you're working.

Never make PROD changes alone

Announce what you intend to do before doing it. Get approval on the bridge. Then do it. Then confirm what you did.

Document everything in real time

While investigating, keep updating the Jira ticket. Every command you ran, every finding, every action taken — write it down as it happens.

Separate findings from theories

Say "logs show DB_CONNECTION_TIMEOUT" (fact) not "I think the database crashed" (theory). On an S1 bridge, precision matters.

04 Who Does What During an S1

🎯 Roles & Responsibilities on the Bridge

Role	Responsibility During S1	Level
Bridge Lead / Incident Manager	Runs the bridge call. Assigns tasks, collects updates, makes decisions on PROD changes, communicates to management.	Manager
L2 Support Engineer (You)	First responder. Checks logs, monitors dashboards, checks DB tables, reports findings on bridge, creates and updates Jira ticket, sends client notifications.	L2 — You
L3 / Developer	Joins when L2 confirms a code-level or config issue. Makes approved changes to services, restarts processes, deploys hotfixes.	L3 Dev
DBA (Database Admin)	Joins when DB issues are identified. Checks connection pools, slow queries, locks, tablespace. Makes DB-level fixes.	DBA
Client Technical Team	Joins to confirm impact on their end. Checks their own systems. Confirms when service is restored from their side.	Client

05 What to Say — Communication Templates

📢 Ready-to-Use S1 Communication Messages

🔴 Initial Alert — Send Within 5 Minutes

INCIDENT ALERT — S1

Dear Team / Client,

We are currently experiencing an issue with [service name] affecting [client name / all clients].
Issue detected at: [time]
Impact: [transactions failing / service unavailable]
Status: Under investigation. Bridge call is open.

We will provide an update within 15 minutes.

🟠 Progress Update — Every 15 Minutes

INCIDENT UPDATE — [Time]

Current Status: Still investigating
Finding so far: [e.g. DB connection pool exhausted. DBA team actively working on fix.]
ETA: [if known / "To be confirmed"]

Next update in 15 minutes.

🟢 Resolution — Send When Service Restored

INCIDENT RESOLVED — [Time]

Service: [service name] has been restored.
Resolved at: [time]
Total downtime: [X minutes]
Root cause: [brief description]
Action taken: [what was done to fix it]

A full Post-Incident Review will follow within 24 hours.
Please confirm services are functioning on your end.

06 Do's and Don'ts During an S1

✅ Do This

Join the bridge call immediately
Send the initial notification before you have answers
Update every 15 minutes — even with no new findings
Document every action in Jira as you do it
State facts — "logs show X" not "I think X"
Ask for approval before any PROD change
Keep calm — panicking slows everything down
Confirm service restored from monitoring before closing

❌ Never Do This

Go silent on the bridge — never acceptable
Wait until you have answers to notify the client
Make changes on PROD without bridge approval
Guess or speculate without evidence
Close the ticket before client confirms recovery
Work alone without reporting on bridge
Blame others on the client call
Skip the Post-Incident Review

07 Simulated Outage Drill — Hands-on Lab

How to Perform This Drill

You cannot simulate a real production outage on your own machine. However you can fully simulate the investigation and response steps using your Kali Linux setup, dummy log files, and the SQLite database you built in Day 3. The drill below walks through exactly what you would do — step by step — as if it were real.

The goal is to build muscle memory — so when a real S1 happens, your hands know what to do without thinking.

🚨 Drill Scenario: Payment Service Outage

SIMULATED · SAFE TO PRACTICE

09:15 AM — Monitoring alert fires. Payment service is returning errors. Transactions from Alpha Bank and Beta Wallet are failing. You are the on-call L2 engineer.

T+0 — 09:15 AM

Alert received — Create the S1 Jira ticket immediately

Don't investigate first. Create the ticket first so there's a timestamp and a tracking number.

What to write in JiraTitle: "S1 — Payment Service Down — All Transactions Failing"
Priority: P1 | Status: Open | Assigned: You
Description: "Monitoring alert at 09:15 AM. Payment API returning errors. Investigation started."

T+2 — 09:17 AM

Notify your lead — open the bridge

Message before you have answers. This is a rule.

Message to send to lead"S1 detected at 09:15 AM. Payment service down. Transactions failing. Jira ticket created: FIN-1042. Joining bridge now."

T+5 — 09:20 AM

Check server health — open Kali terminal

Run these 3 commands first before touching any logs.


# Check disk space
df -h

# Check RAM
free -h

# Check what process is eating resources
top

✅ Simulate result: Disk OK (60%). RAM OK (50%). But in top — Java process at 98% CPU. That's suspicious.

T+8 — 09:23 AM

Check the payment service log for errors

Create and investigate a dummy log to simulate this on your Kali machine.


# Create a simulated outage log
cat > outage-drill.log << 'EOF'
[09:14:50] [INFO ] Payment service started normally
[09:14:55] [INFO ] TXN-001 received and queued
[09:15:00] [WARN ] DB connection pool at 88%
[09:15:01] [WARN ] DB connection pool at 95%
[09:15:02] [ERROR] DB_CONNECTION_TIMEOUT - pool exhausted
[09:15:02] [ERROR] TXN-001 FAILED - cannot write to DB
[09:15:03] [ERROR] DB_CONNECTION_TIMEOUT - pool exhausted
[09:15:03] [ERROR] TXN-002 FAILED - cannot write to DB
[09:15:04] [ERROR] DB_CONNECTION_TIMEOUT - pool exhausted
EOF

# Now investigate it
grep "ERROR" outage-drill.log
grep -c "ERROR" outage-drill.log
grep "WARN\|ERROR" outage-drill.log

✅ Simulate finding: 5 ERROR lines. DB_CONNECTION_TIMEOUT is the pattern. WARNs appeared at 88% and 95% before crash.

T+10 — 09:25 AM

Check the database — how many transactions are stuck?

Open your SQLite DB from Day 3 and run the stuck transaction queries.


sqlite3 fintech_lab.db

-- How many transactions are stuck PENDING?
SELECT COUNT(*) FROM transactions WHERE status = 'PENDING';

-- Which clients are affected?
SELECT c.client_name, COUNT(*) as stuck
FROM transactions t
JOIN clients c ON t.client_id = c.client_id
WHERE t.status = 'PENDING'
GROUP BY c.client_name;

✅ Bridge update to give: "Finding at 09:25: DB connection pool exhausted since 09:15. 2 transactions stuck in PENDING. Clients affected: Alpha Bank, Beta Wallet. Root cause: DB pool leak. Requesting DBA to join bridge."

T+30 — 09:45 AM

Service restored — send resolution notification

DBA fixed the connection pool. Services are back. Monitoring shows green.

Resolution message to send"INCIDENT RESOLVED — 09:45 AM. Payment service restored. Root cause: DB connection pool exhausted due to unreleased connections. DBA cleared the pool and increased limit. Total downtime: 30 minutes. Post-Incident Review scheduled for tomorrow. Please confirm transactions processing on your end."

✅ Final step: Update Jira ticket status to RESOLVED. Document full timeline. Close the bridge call.

08 Real L2 Scenarios

S1 fires at 2 AM. You are on-call. Do not wait until morning. The protocol does not change based on time of day. Join the bridge, notify, investigate. An S1 at 2 AM is treated the same as one at 2 PM.

You are on the bridge and your lead asks "what do you see in the logs?" You say: "Logs show DB_CONNECTION_TIMEOUT errors starting at 09:15. WARN messages appeared at 09:14 showing pool at 88% and 95% before the crash. 5 error lines total." Facts only. No guessing.

You found the root cause and know the fix — restart the payment service. Do not just do it. Say on the bridge: "I believe restarting the payment service will clear the connection pool. Requesting approval to proceed." Wait for a "go ahead" before you touch anything on PROD.

It has been 20 minutes and you have found nothing. That is also useful information. Say on the bridge: "Update at [time]: No application-level errors found. Logs look clean from my side. Possible infra or network issue. Requesting infra team to check." Never stay silent just because you have nothing to report.

✅ Week 2 · Day 5 Outcomes — Can You Do This?

Define what an S1 is and what makes it different from a P2 or P3 ticket
Explain what an Outage Bridge is and how to behave on one
Follow the S1 response timeline — from T+0 alert to resolution — without missing a step
Send the correct initial, progress, and resolution notifications to clients
Know your role as L2 on a bridge — what you own, what you escalate, what needs approval
Complete the simulated outage drill on Kali Linux — investigate logs, query the DB, report findings
Document a full incident timeline in a Jira ticket during a live S1
Respond to an S1 calmly, correctly, and within the required time targets