Today's Topic

Monitoring Basics

Before a client even calls you, the system should already be screaming that something is wrong. That's what monitoring is — your early warning system.

Alert Types Infra Monitoring App Monitoring Alert Severity

01 The Simple Idea First

Real-life Analogy

Think about the dashboard of a car. While you drive, it quietly watches everything — fuel level, engine temperature, oil pressure, speed. You don't manually check the engine every 5 minutes. The car tells you when something needs attention.

A small yellow light = warning, keep an eye on it. A red flashing light = stop the car right now.

Monitoring tools do exactly this for your servers and applications. They watch everything 24/7 and send you alerts when something looks wrong — before it becomes a disaster.

What is Monitoring?

Monitoring means your systems are being watched continuously — every second of every day. Tools like Grafana, Datadog, Zabbix, or Nagios collect numbers from your servers and apps (like CPU usage, memory, error rates) and display them on a dashboard.

When a number crosses a limit — say CPU goes above 90% — the monitoring tool fires an alert. That alert lands in your email, Slack, or PagerDuty. As an L2 engineer, you receive these alerts and decide what to do next.

02 Infra Monitoring vs App Monitoring

There are two kinds of things you monitor

Think of it like a shop. The building itself (walls, electricity, AC, water) is the infrastructure. The work happening inside (cashier processing payments, stock being tracked) is the application. Both can have problems. Both need to be watched.

Infrastructure

Infra Monitoring

"Is the building healthy?"

CPU usage — is the server working too hard?
Memory (RAM) — is the server running out of space to think?
Disk space — is the hard drive getting full?
Network — is data moving fast enough between servers?
Server uptime — is the machine even turned on?
Database connections — can the app talk to the database?

Application

App Monitoring

"Is the work inside running fine?"

Error rate — how many transactions are failing?
Response time — how long does a page or API take to reply?
Transaction success rate — are payments going through?
Login failures — are users getting locked out?
API availability — is the payment API up and responding?
Queue length — are jobs piling up and not being processed?

Why does this difference matter for L2?

If the alert is infra-related (high CPU, disk full), you check the server, restart services, or clear space. If it's app-related (high error rate, slow API), you dig into the application logs, check the code config, or look at recent deployments. The starting point is different for each.

A very common mistake for new engineers is treating all alerts the same. They're not. Always ask first: is this an infra problem or an app problem?

03 Alert Types — What Each One Means

✅

OK / Resolved

Everything is Normal

The system is healthy. All numbers are within the safe range. This is what you want to see all day. Sometimes you get an OK alert after a previous warning cleared — it means the problem fixed itself or was resolved.

👍 No action needed. Just note it in the log.

ℹ️

INFO / Notice

Just So You Know

Nothing is broken but something worth knowing happened. For example: a scheduled backup completed, a new deployment was done, or a user account was created. These are informational. You read them and move on.

👀 Read it, log it, no urgent action needed.

⚠️

WARNING

Watch This — Could Get Worse

Something is trending in the wrong direction but hasn't broken yet. Example: disk space is at 78% (limit is 90%). It's not a crisis but if you ignore it, it will become one. Warnings are your chance to fix something before clients notice.

🔍 Investigate now. Don't wait until it becomes critical.

🚨

CRITICAL

Something is Broken Right Now

A hard limit has been crossed. The system is either down or seriously degraded. Clients are likely already feeling the impact. Example: CPU at 99%, payment API returning errors for every request, database not responding. This maps to a P1 or P2 Jira ticket.

🚒 Drop everything. Respond immediately. Create a P1 ticket.

04 What a Monitoring Dashboard Looks Like

📡 System Health Dashboard — Alpha Bank · PROD LIVE

🖥️

CPU Usage

42%

🧠

Memory (RAM)

78%

WARNING

💾

Disk Space

96%

CRITICAL

🌐

Payment API Response Time

210ms

❌

Transaction Error Rate

4.2%

WARNING

🗄️

DB Connection Pool

38/100

How to read this dashboard as an L2

Looking at the dashboard above — disk space is critical at 96%. Your first action: log into the server, find what's eating disk space (usually old log files), clean it up or archive it. Don't wait — at 100% the server will crash and stop writing any new data.

The transaction error rate warning at 4.2% also needs attention. This is an app-level alert. You'd check the application logs to see what type of transactions are failing and why.

05 Classify Alerts — Hands-on Lab

Your job: look at each alert and answer two questions

1. Is this Infra or App? 2. What severity is it — OK, Info, Warning, or Critical?

This is exactly what you'll do every morning when you open your monitoring tool. Here's a full classification exercise:

🗂️ Alert Classification Sheet

Alert Message	Type	Severity	What You Should Do
Server CPU at 97% for last 10 minutes	Infra	Critical	Create P1 ticket. Check what process is eating CPU. Restart if needed.
Payment API returning 500 errors — 80% failure rate	App	Critical	Drop everything. Check app logs. Likely a P1 — alert the team immediately.
Disk usage reached 82% on DB server	Infra	Warning	Schedule log cleanup. Monitor closely. Will become critical at 90%+.
Login failures increased by 30% in last hour	App	Warning	Check if it's a specific user, IP, or region. Could be fraud attempt or an auth bug.
Nightly backup completed successfully	Infra	Info	Just log it. No action needed. Good news.
API response time: 210ms (normal is under 500ms)	App	OK	All good. No action needed. System is healthy.
Database server not responding — connection refused	Infra	Critical	P1 immediately. Every app that uses this DB is now broken. Wake up the team.
Transaction queue length growing — 500 jobs waiting	App	Warning	Check if the queue processor is running. If stuck, restart the worker service.
New deployment pushed to PROD at 3:00 PM	App	Info	Watch closely for 30 minutes post-deploy. Any new alerts after this = likely the deployment caused it.
RAM usage dropped back to 45% after restart	Infra	OK / Resolved	Previous warning is cleared. Update the Jira ticket. Inform client if they were affected.

06 Real L2 Scenarios

At 7 AM you get a CRITICAL alert: "DB server not responding." No client has called yet. You open a P1 ticket immediately, start investigating, and send an early heads-up to the client — "We detected an issue and are already working on it." This is proactive support and clients love it.

You get a WARNING: disk at 83%. You think "it's not critical yet, I'll handle it later." Two hours later it hits 100% and the server crashes. Warnings are your chance to prevent the fire. Always act on warnings before they become critical.

A CRITICAL alert fires right after a deployment. First question to ask: "Was anything deployed recently?" If yes, the deployment is your prime suspect. You check what changed and whether rolling it back fixes the issue.

You get 20 alerts at once — CPU, memory, API errors, login failures all firing together. Don't panic. Usually one root cause triggers everything else. Start with the infra alerts first — if the server is overloaded, that's probably causing all the app-level alerts too.

✅ Day 4 Outcomes — Can You Do This?

Explain what monitoring is and why it matters for L2 support
Tell the difference between Infra monitoring and App monitoring with examples of each
Classify any alert as OK, Info, Warning, or Critical — and explain what action each needs
Look at a monitoring dashboard and identify which metrics need immediate attention
Complete the alert classification sheet — label each alert by type and severity
Explain why warnings should never be ignored even if nothing is broken yet
Know the right first question when alerts fire after a deployment